Pictures Worth All Their Words: Adding Search to Images in PDFs

March 1, 2019, 9:55 am

≫ Next: OCR with the Adobe PDF Library .NET and Java Interface

≪ Previous: Converting Werner Colors to CMYK

Datalogics has recently added OCR (Optical Character Recognition) support into our Adobe PDF Library .NET and Java interfaces. Supporting many Latin character-based languages on Windows and Linux, OCR support allows users to recover text from images in PDFs. So, why would you want to consider doing this?

The Challenges With Pictures

Images and pictures are an important part of information transfer and archiving. Whether these are coming from smartphone pictures of receipts, scans of paper documents, or newspaper archives on film, important information is often communicated through images of letters and words rather than actual text. Many pictures within PDF files are only pictures – leaving information inaccessible for indexing, and keeping information locked away and lost to most programs.

Occasionally, we will find “searchable images” inside PDFs – PDF files that have pictures which are searchable by PDF viewers. These searchable images go far to turn these images of information into actual, accessible data. This is accomplished through the use of OCR to apply machine vision and reading techniques to images. OCR “reads” letters in pictures much like humans by being able to see letters and combining these back into actual words.

Enhancing the PDF Process

With our recent enhancements, users of the Datalogics distributed Adobe PDF Library SDK can take their PDF workflows further into new directions, including:

Creating PDFs with searchable images: with OCR, users creating PDFs can scan and recover text from images when importing images into a PDF document. Adding text along with images at document creation unlocks more capabilities for “born-digital” documents. Read-aloud, information interchange, and long-term usability are all better enabled when creating PDFs that contain machine readable text with images.

Enhancing existing PDFs with searchable images: existing PDFs with information locked inside images can be scanned with OCR. Images within PDF files can be replaced with searchable text layers placed underneath the existing images. This enables searching and textual copy and paste from these images – without changing the appearance of these PDFs.

Replacing pictures with text: for files where you know the important content in pictures is text, you can take the OCR process one step further. With our easy-to-use PDF .NET and Java APIs, you can not only recover text from pictures – you can eliminate the pictures and only keep the text portions. This leads to smaller files, faster processing, and more usable data.

Recovering Information From PDFs

At its heart, PDF is a container for various types of information: textual and visual. Optical character recognition in the Datalogics interfaces for the Adobe PDF Library can help you transform pictures into useful text, enhancing the usability and value of new and existing PDF files.

Those who are interested in transforming PDF files into machine-readable information and responsive HTML representations should know that the OCR capabilities discussed above are shared with Datalogics PDF Alchemist. PDF Alchemist takes information retrieval and recovery even further, transforming visually-oriented PDF files into reflowed, re-structured XML and HTML that is suited for information processing tools and workflows.

Whether your interest is in making better PDF files, making your existing PDFs better, or making better use of your PDF files – Datalogics has the technology for you! Feel free to request your free evaluation today.

The post Pictures Worth All Their Words: Adding Search to Images in PDFs appeared first on Datalogics Blog.

↧

OCR with the Adobe PDF Library .NET and Java Interface

May 23, 2019, 7:47 am

≫ Next: Sample of the Week: Fixing Blank AcroForms Using APDFL

≪ Previous: Pictures Worth All Their Words: Adding Search to Images in PDFs

Here at Datalogics, we are continuously innovating and providing our customers with more value to better assist them with their PDF document needs. Over the past few months, we’ve added Optical Character Recognition Support (OCR) to many of our products. We are excited to announce that OCR support is now available within the Java and .NET interfaces of the Adobe PDF Library. We’ve combined the power of the Adobe PDF Library together with Tesseract (a widely-used open source OCR engine) to allow users to access and process the data and text within images.

One of the most common use cases for OCR is in preparing documents for searching or extracting the data into another process. By using our OCR APIs, the text data within these images is accessible without modifying the look of the input document. Let’s walk through some of the key components of the API using .NET. You can view the full code by visiting our public sample GitHub repository.

Setting the PageSegmentationMode to Automatic lets the OCR engine choose how to segment the page for text detection. The Performance parameter allows for multiple levels of granularity when choosing speed vs performance. In this case, we are selecting the mode that will output the best accuracy. This is a common setting when you are unsure of the quality of your input document. The OCRParams will default to English; you’ll need to use the Languages parameter to select other languages. Multiple languages can be selected at the same time.

Once the OCREngine is configured, we can loop through the content of the document, identify the images, and apply the OCR processing:

The image object is replaced by a form, which contains the original image and the identified text laid out behind it. Once this step is complete, the resulting document can be saved and it will contain the original content and the identified text.

As an added benefit, the .NET and Java interfaces currently support Dutch, English, French, German, Italian, Portuguese and Spanish languages, and with additional Chinese, Japanese and Korean languages to be added shortly. Try it out yourself by requesting a free evaluation, and feel free to take a look at our full sample code for Java and .NET (which includes how to start this process from an image rather than a PDF) under the OpticalCharacterRecognition section inside Sample_Source.

The post OCR with the Adobe PDF Library .NET and Java Interface appeared first on Datalogics Blog.

↧

Sample of the Week: Fixing Blank AcroForms Using APDFL

July 23, 2019, 9:30 am

≫ Next: How to Optimize your PDFs for a More Streamlined Workflow – Part One

≪ Previous: OCR with the Adobe PDF Library .NET and Java Interface

Every so often, we run into a situation where a user is printing an AcroForm PDF with APDFL, and the print output is blank. The information is there when viewed in and printed in Acrobat, but the same document is blank when printed or viewed from APDFL. Oftentimes, it’s reported as a bug when it’s not actually a bug.

What’s often happening in those cases is that the form field’s values were updated by an automated process but without updating the appearance of the widgets associated with those form fields. Instead, a ‘NeedsAppearances’ flag is set. This flag tells Acrobat to re-generate the appearances of those widgets from the properties of the field and/or widget annotation. Acrobat looks for this flag when it opens a document and regenerates those appearances automatically.

Luckily, we can use APDFL to approximate the Acrobat’s process of regenerating Normal Appearance Streams for Text Widget Annotations. In this case, we are going to use our .Net interface rather than the APDFL C interface directly.

The first step will be to populate a font lookup table with the names of the PDF Standard-14 fonts:

Next, we are going to open the file and determine if it contains an AcroForm, and if it contains a Calculation Order entry. If it does contain such an entry, we are not going to process this document. This entry implies that some field values are calculated from other field values via a JavaScript function. Fortunately, most of the time, we won’t see this value and all field values are final.

Next, we are going to check each annotation of the document to see if it might be a Widget annotation for a Text Field. Widget Annotations often share the same dictionary record as its Text Field, however, a field could be displayed in multiple places in the document. In this case, the text field would have more than one Widget Annotation associated with it. Since we are approaching the text field from the Annotation side, we need to check both if the annotation is the Text Field, or if its parent is the Text Field. Then we regenerate the Normal Appearance Stream.

Generating the Appearance Stream

When generating the Appearance Stream, the first thing we need to find is the DA string which contains initialization options. Again, for this option, we will need to check both the Annotation itself and its Text Field parent.

The DA string contains options which are formatted like PDF instructions. That makes it easy to parse if you have a PDF instruction parser on hand. If not, DA strings are generally short enough that brute-force string-parsing also works.

After that, we are going to use the information gleaned from the DA string to initialize things. We’re going to generate TextRuns and add them to a Text Element. The text Element will be added to a Group Element. The Group Element Added to a Container, and the Container added to a Form Element, which will be returned. The PDF instructions generated from nesting the Text Element inside the Group element are slightly different than if the Text Element is added directly to the Container. The end result is an Appearance stream closer to how Acrobat generates the PDF Instructions for this same Widget Annotation. I did take one shortcut here, where I make no attempt to line-wrap the text to fit the box; it’s not often needed for forms.

One last fix-up to the Form dictionary, and we are all done regenerating the appearance for this annotation:

Results

So let’s take a look at how this handles an example form to which I added values to the text fields, as rendered by Acrobat:

This same file displayed in our DisplayPDF sample app does not display these values:

But if I run that file through the AddTextWidgetAppearances sample app and take a look at the output file in DisplayPDF, then the field values look pretty close to how Acrobat rendered them:

I hope you found this post helpful! Feel free to leave us a comment with any questions or tips of your own.

The post Sample of the Week: Fixing Blank AcroForms Using APDFL appeared first on Datalogics Blog.

↧

How to Optimize your PDFs for a More Streamlined Workflow – Part One

August 7, 2019, 9:42 am

≫ Next: PDF/A Parts 2 and 3 and ZUGFeRD Support

≪ Previous: Sample of the Week: Fixing Blank AcroForms Using APDFL

Are you looking for ways to better optimize your PDF files to make them more streamlined for your workflow? In this two-part post, we’ll help you go from ‘bloated’ PDFs (and we’ll get into what we mean by ‘bloated’) to what we like to call ‘leaner, cleaner’ PDFs.

First, it’s important to understand some of the common potential problems that can occur with PDF documents. The PDF file format itself is a vast, feature-rich format that nearly everyone is familiar with, however they don’t always realize the complexity behind it. Today, hundreds if not thousands of different PDF Processors exist. Some have been written from scratch, and some have been spun off from another, in different languages. Not to mention there are images, fonts, forms, embedded thumbnails, annotations, metadata, and more that lie in the PDF file itself. It’s easy to see how working with the PDF format itself can be challenging when dealing with it across different products.

Let’s look at an example of how a PDF document can be problematic:

Imagine you have a 100,000-page document with virtually millions of images and forms throughout it. Let’s say you take your favorite PDF software and extract 10 pages out of it to create a new document. Depending on the software you’re using, all of those millions of images and forms we just mentioned may have ‘come along for the ride’ when creating the extracted document. So even though those 10 pages might not be using nearly all of these millions of images and forms, they all exist in the extracted document. This is what we mean by a ‘bloated PDF.’

The common problems with so-called ‘bloated’ PDFs are:

They can slow down processing speed of PDF Viewers
They’re going to impact different workflows like PDF conversion, printing, editing, or document processing which can all be significantly impacted by bloated PDFs
They can cause problems like hanging or crashing

For example, your PDF software may get hung up trying to process a PDF because maybe it’s not a true hang. Perhaps after 7 hours the processing will be complete, and things will return to normal. But in the real world, no user is going to spend 7 hours waiting for your software to finish running to continue processing documents. This leads to end-user frustration where they file bug reports against your software, something like “Your software creates PDFs that cause all of my customers to experience crashes when trying to open them, your software stinks.” Harsh, but that’s the reality.

PDF Optimization Tools: What to Know

Be careful of many so-called PDF optimization tools in the marketplace, because they don’t all behave well with PDFs. Some will create PDF documents that are simply corrupt; something that will then refuse to open in your PDF Viewer. You can have PDFs created that, despite the name of the product, the output becomes larger than the input PDF. There can be missed opportunities for bloated PDFs that suffer from easily correctable issues and for whatever reasons; time, resources, understanding, etc. the authors didn’t implement fixes for such issues and the output doesn’t get reduced in size at all. You can have inaccurate or incorrect output, such as mishandled complex ColorSpace representations in a PDF. Sometimes content that PDF software can’t understand is simply dropped from the page and lost forever, e.g. JBIG2 is a compression type typically handled by commercial software and support is not widespread.

As an example of inaccurate output, let’s say you are in the healthcare space and you have patients that have to sign an authorization of release of information on a form. The patient signs and then the form is saved as a PDF and the image is saved as an image on the page. Let’s say you pass such medical release forms through some PDF Optimization software that just obliterates the image and that signature is no longer readable. In the name of saving space, the optimization software really wrecked the image, all a person can make out is little scratch marks from what used to be a signature. Well now you’re in a pickle, if you’re faced with a lawsuit and you need to show evidence to the court that you had authorization from the patient, well if the judge can’t read it and it doesn’t look like a real person’s signature, and now you’re really stuck. So, it’s easy to see how fidelity is important when dealing with PDF Optimization.

When it comes to things PDF and PDF optimization, the do-it-yourself (DIY) route is not the best option. As we mentioned before the format is quite complex and support among vendors varies dramatically. It’s estimated that if you wanted to write just a subset of a good PDF optimizer, you’re talking about several years of effort to do this. If you’re talking about the full set of what a good PDF optimizer should do, you’re talking decades of effort. No company today that’s looking to be profitable can sacrifice that much time on what may only be one aspect of their business.

Now that we’ve talked about the problems of bloated PDFs, what can you do to achieve a leaner, cleaner PDF? Check out part two of this post to find out!

The post How to Optimize your PDFs for a More Streamlined Workflow – Part One appeared first on Datalogics Blog.

↧

PDF/A Parts 2 and 3 and ZUGFeRD Support

September 10, 2019, 7:11 am

≫ Next: Redact, Extract, Optimize – 3 Solutions to Help You Achieve More with your PDFs

≪ Previous: How to Optimize your PDFs for a More Streamlined Workflow – Part One

PDF/A has always been an important part of document management, and the Adobe PDF Library offers support for creating PDF/A documents that can adhere to Part 2 and Part 3 of the standard. This means you can create a Part 1, Part 2, or Part 3 PDF/A-compliant document. Specifically, we have added support for Levels B and U, and because of this, users can now create ZUGFeRD 2.0 compliant documents, which are based on Part 3 of the PDF/A standard. This update also includes fixes and improvements to PDF/A conversion in general.

In case you’re not familiar with it already, PDF/A itself is a long-term archival standard (hence the ‘A’) for preservation of documents. The underlying theme of this standard is that it’s self-contained with all of the resources it needs to display its contents. What you get is a consistent, expected presentation of a document — even 100 years from when software has changed dramatically, you can count on your document to be viewed in a predictable way. This means it can’t rely on external resources for its PDF content. The specification for Part 1 was released about 15 years ago, but adoption was fairly slow in those first few years. However, in recent years, adoption has become widespread. Acceptance has been most prevalent in the European Union, so much so that many governments and municipalities have now made it a requirement to be used over the regular PDF format.

A simple Google search will reveal that law firms, government agencies, and court systems have dedicated instructions for how to convert PDF documents to be PDF/A compliant. Instructions typically walk users through the steps in Adobe Acrobat to do the conversion. For a one-off exercise or small scale conversions this certainly works, but these manual steps are not practical for bulk processing hundreds, thousands, or potentially millions of documents in larger use cases. As a company, you don’t want to have to hire people to literally press buttons and click through the conversion process when you can simply design software to do this automatically.

You may be wondering what’s behind the evolution of the PDF/A standard. Part 1 was based on PDF v1.4, which was older at the time but was widespread among PDF vendors. Things introduced since v1.4, such as transparency, for example, are not allowed — it’s believed this led to the standard’s slow adoption. Part 1 specifically prohibits attachments in the ‘spirit’ of being an archival standard, so the PDF is not dependent on external software being used to open the attachment. But many real-world users found this made it impractical for documents that needed associated files in order for the document to make sense and be useful.

Parts 2 and 3 are based on v1.7 of the PDF standard, so features that were not allowed in Part 1 are now legal in Parts 2 and 3, such as JPEG2000 compression, attachments, transparency, and more. A new level of compliance was also added, known as Level U.

As a primer on the different levels, we’ll start with Level A. The ‘A’ stands for ‘Accessibility’ or ‘All,’ and it meets all requirements of the standard. This includes those with regard to Accessibility by including structure information (tagging). However, conversion of a non-structured PDF to have structure information can’t be done automatically. This has led to confusion among users with little background in structural information and is also another suspected reason for slow adoption of the standard by users.

Level B stands for ‘Basic’ (Visual) support and only includes requirements for reliable visual reproduction of the document. This has been the most popular choice among PDF/A users. Parts 2 and 3 introduce a new Level U, which stands for ‘Unicode.’ This level is similar to ‘A’ but doesn’t include logical structure information. It requires Unicode equivalents of text to be present and was designed to get past the difficulties of achieving Level A compliance while including more than just the visual representation that you get with Level B.

For Part 2, there is an additional requirement that all attachments must be PDF/A compliant (Part 1 or 2). For Part 3, any type of attachment is allowed, as long as a relationship between the attachment and document content is specified. This loosened requirement for Part 3 has not been without controversy, as it tends to go against the original spirit of being a completely self-contained document that doesn’t rely on anything external. However, it was driven by the real-world desire to include important non-PDF associated files and maintain the originating data formats behind certain PDF documents.

ZUGFeRD (pun of ‘draft horse’ in German) is a new invoice standard which is based on PDF/A-3 plus XML data. It’s similarly experiencing its own surge in interest, and there is a big push for governments to use it. In Germany, there will soon be expanded requirements for documents to comply to this standard. This interest is expected to expand to other markets, including the United States. At Datalogics, we know the ability to convert a PDF to be a ZUGFeRD document is highly desired — that’s why we added a dedicated C++, C#, and Java sample in the PDF Library to demonstrate its usage. Our sample program illustrates how to easily convert a PDF document to be PDF/A-3, how to add the ZUGFeRD XML invoice as an attachment to the document, and how to add the metadata entries unique to ZUGFeRD and the required extension schema, which are not part of the PDF/A-3 standard itself.

Below is a comparison chart to help you better understand the comparison between PDF/A file types.

With all of these changes, you now have much more flexible PDF/A conversion options for all of your PDF document conversion needs. We invite you to download the latest version of the Adobe PDF Library, which includes extended PDF/A support and support for creating ZUGFeRD documents!

The post PDF/A Parts 2 and 3 and ZUGFeRD Support appeared first on Datalogics Blog.

↧

Redact, Extract, Optimize – 3 Solutions to Help You Achieve More with your PDFs

October 3, 2019, 7:24 am

≫ Next: The ZUGFeRD Standard – What it is, how it affects you, and how we can help

≪ Previous: PDF/A Parts 2 and 3 and ZUGFeRD Support

We all want our PDFs to work smarter and help us achieve our business goals. But in order to do that, it’s important to understand what solutions we’re able to tap into using PDF’s many capabilities. Redaction, extraction, and optimization are a few of the most important capabilities that can help us achieve an array of business solutions. Let’s take a look at how each of these can help us achieve more with our PDF documents.

Redaction

It has been a major topic in the news this year, and you’ve probably heard a lot about it lately. By definition, redaction is the process of removing sensitive or classified information (such as names, social security numbers, phones numbers, etc.) from a document prior to its publication. But did you know that the majority of redaction is done improperly, which can subject those documents to a number of security issues?

According to the nonprofit consumer organization Privacy Rights Clearinghouse, a total of 227,052,199 individual records containing sensitive personal information were involved in security breaches in the United States between January 2005 and May 2008, due to improper redaction.

Improperly redacted documents can put you at risk for potential litigation, especially if the information in the document is subject to a security breach, which are becoming more and more common these days. Many tools that claim to redact information actually just put black bars over the text, which only hides the sensitive data, it does not remove it completely. To achieve true and correct redaction, the underlying data must be sanitized, or removed fully from the document.

Here’s an example of how a document was not fully or correctly redacted…the information is simply blocked out but not removed:

To ensure that your documents are redacted fully and properly, make sure you’re using an advanced and reliable PDF redaction tool. Such a tool can be found in our PDF Java Toolkit or in the Adobe PDF Library.

Extraction

PDFs, by nature, are designed to be viewed consistently across many different platforms and devices. That’s great, however, with over 73 million PDFs saved each day and 2.3 trillion PDFs created each year, there’s a whole lot of information within these PDF files that is not easily accessible in different formats. PDF data extraction allows you to transform the data within PDFs, such as tables and images, into XML and HTML formats so you’re able to access the information you need.

Elements such as varied table formats can be especially challenging for some extraction tools to process effectively. If your solution can’t achieve accurate data correlation from tables, it probably doesn’t offer full OCR integration. OCR offers unique dual processing of text and images and addresses them separately. This will ensure you maintain the text within PDFs as pure text output while also implementing image processing.

With data extraction, you should also keep in mind the need for multi-lingual support. PDF documents that contain multiple languages are a challenge for many tools to process, so it’s important that you choose an extraction tool like PDF Alchemist that can handle the complex capabilities to tackle documents with multiple languages.

Here’s an example of PDF data extraction into different formats:

Optimization

37% of B2B content created each year consists of eBooks, white papers, and case studies, where PDF is the ideal format to use. If it takes more than 4 seconds to load a PDF, people are not going to read it. If your documents are loading too slowly, it likely means they’re not optimized, and PDF optimization is very important if you expect users to consume your content. Unoptimized PDFs often result in slow processing speeds with PDF viewers and can have a really negative impact on your workflows.

There are a lot of PDF optimization tools out there, but not all of them are created equally. A bad tool can cause document corruption, mishandled color, and inaccurate outputs just to name a few. Make sure you choose a tool that can tackle the following: color variations, file size, overprint issues, transparency inconsistencies, indexed color space, and unsupported image types.

Our PDF Optimizer tool is a great solution that can handle every aspect of PDF optimization. See how it helped our customer MTW Solutions alleviate their PDF issues while streamlining their workflow in this success story.

Example of PDF Optimizer’s file size reduction capability:

These are just a few examples of why redaction, extraction, and optimization can be critical for getting the most out of your PDF documents and how our tools can help. If you’re interested in learning more, please contact us or visit our product page, and keep an eye out for more information about the benefits of R-E-O from us in the future!

The post Redact, Extract, Optimize – 3 Solutions to Help You Achieve More with your PDFs appeared first on Datalogics Blog.

↧

The ZUGFeRD Standard – What it is, how it affects you, and how we can help

November 18, 2019, 9:58 am

≫ Next: Form with Function, Part One: Working with Digital Forms

≪ Previous: Redact, Extract, Optimize – 3 Solutions to Help You Achieve More with your PDFs

What is the ZUGFeRD Standard?

ZUGFeRD is an e-invoicing standard that has been embraced by German-speaking countries within the European Union (EU). The standard came about from the EU, acknowledging the need for an e-invoicing format standard. The “Forum Elektronische Rechnung Deutschland” (FeRD) is the organization that introduced the ZUGFeRD 1.0 specification and has made it available since June 25^th, 2014. FeRD has subsequently released ZUGFeRD 2.0, which provides for a freely available invoice format that optimally meets the needs of companies for an electronic bill: It is available free of charge and is compatible with the European standard EN 16931. ZUGFeRD 2.0 can be used for domestic and international billing and is equally usable by both small and large companies. The creation of an invoice in PDF format is simple; the evaluation can be done either by the administrator through the PDF image representation or automated by the embedded XML file. The standard has several profiles in order to be able to fulfill special requirements for the contents of the invoice.

Why was it introduced?

According to FeRD, there are about 32 billion invoices exchanged annually in Germany, but adoption of electronic invoices are only in the single-figure percentage range. In order to enable small and medium-sized enterprises to benefit from the advantages of e-invoicing, the German Forum on electronic Invoicing (“Forum elektronische Rechnung Deutschland” – FeRD) developed ZUGFeRD – the “Central User Guide of the Forum for Electronic Invoicing in Germany” as a uniform data format. ZUGFeRD is targeted to meet the needs of digitalization requirements of the public sector and will help facilitate e-invoicing in the business-to-business (B2B) and business to government (B2G) sectors as required by the EU. Having a uniform data format for electronic invoices will help simplify the exchange of structured electronic invoices. The new ZUGFeRD version 2.0 adheres to the requirements of this new EU standard and is based on the PDF/A ISO standard, which sets strict parameters around long-term archiving.

How will it affect you?

The e-Bill law (published in April 2017) by the German government mandates the receipt and processing of e-invoicing for all federal contracting authorities, regardless of the amount of the invoice. The e-Bill law also provides for specific dates for the implementation of electronic invoicing: from November 27^th, 2018, the top federal state authorities must be able to receive and process electronic invoices and, from November 27^th, 2019, all subordinate institutions and sectorial contracting authorities on the federal level.

How can Datalogics help?

The Datalogics Adobe PDF Library supports both Part 2 and Part 3 of the PDF/A standard. This allows users to create a Part 1, Part 2, or Part 3 PDF/A-compliant document. We have also added support for Levels B and U, and as a result, users can now create ZUGFeRD 2.0 compliant documents, which are based on Part 3 of the PDF/A standard.

Be sure to reach out to tech_support@datalogics.com and we are happy to help explain and guide you on PDF/A implementation that aligns with the ZUGFeRD standard. We also offer free trials for our products – visit our products page to complete your free evaluations today!

The post The ZUGFeRD Standard – What it is, how it affects you, and how we can help appeared first on Datalogics Blog.

↧

Form with Function, Part One: Working with Digital Forms

January 2, 2020, 8:49 am

≫ Next: Introducing the Forms Extension for Adobe PDF Library

≪ Previous: The ZUGFeRD Standard – What it is, how it affects you, and how we can help

We don’t need to persuade you of the value of creating digital forms and surveys over paper forms handed out on clipboards. Paper is so 20th century.

You see online surveys on your browser from not-for-profits and airlines and political campaigns and banks every day, and pop-ups asking to improve their customer experience before you click the big “X” in the upper right corner. You probably file your tax returns using TurboTax, a practice the Internal Revenue Service has been encouraging for 25 years to make filings faster and more accurate. Since 2014, over 90% of IRS forms have been completed electronically each year. When you rent a car, the agent checks you in or out with a touch-sensitive tablet and asks you to initial the form with your finger on the screen. The Federal Electronic Signatures in Global and National Commerce Act (ESIGN) of 2000 was designed to help replace paper business forms and documents with digital files by making electronic signatures legal and enforceable.

Types of Digital Forms

Two basic varieties of digital forms are available: HTML and PDF.

HTML forms are common and work well on browsers and mobile devices. Services like Survey Monkey are great, making it easy for individuals and businesses to create and distribute high-quality forms. But the free Basic plan for Survey Monkey limits you to 10 questions and permits no more than 100 responses per survey. An individual user license costs a lot more per month than a comparable license for Adobe Acrobat. As an alternative, you could create your own browser-based survey. That allows you to manage the design and format of your survey and keep complete control over the data returned by your customers, clients, employees, or prospects. But building your own HTML form requires web developer skills and a detailed knowledge of HTML formatting.

PDF form documents are another alternative to create digital surveys. They cost next to nothing and are easy to create. No technical skills are required, just a basic understanding of document design and Adobe Acrobat. You get to make all of the decisions related to the format, length, and content of your PDF form document. You can distribute and process as many PDF forms as you like, and all of the data you receive remains fully under your control.

Introducing Forms Extension for the Adobe PDF Library (APDFL)

Now, with the Adobe PDF Library (APDFL) and the new Forms Extension for APDFL (to be released soon), both provided by Datalogics, your ability to work with PDF forms increases exponentially.

The Datalogics Forms Extension for APDFL is a software module that allows applications built using the Adobe PDF Library to work with PDF AcroForm and XFA forms documents. When you install the Forms Extension, it becomes a seamless part of the Adobe PDF Library, and provides the APDFL user with all of the functions available in Adobe Acrobat for working with PDF forms documents. Please note that you must purchase or already own APDFL to get the extension.

That means that you can integrate all of the PDF forms features offered in Adobe Acrobat into your own software. With Forms Extension for APDFL, you can quickly process PDF forms in large volumes. Unlike working with Acrobat, you aren’t limited to manipulating one document at a time. Moreover, with Forms Extension and APDFL, you will find it easy to export the data from your PDF form documents to XML or other types of storage files formats. After you collect the responses to hundreds or thousands of forms/surveys, you can take the data from these form documents and import them into a database, where you can analyze and manipulate that data however you like.

Using Forms Extension for APDFL

Here are some of the features offered with Forms Extension for APDFL and the Adobe PDF Library:

Import data into dynamic and static XFA and AcroForm documents
Export data from these form documents to other files
Flatten dynamic and static XFA and AcroForm documents into regular PDF documents, including generating appearances for bar. When you flatten a PDF document, you are removing interactive elements, like form fields, and converting the form and its content into text.
Convert dynamic and static XFA forms to AcroForm

We are excited to show you the Forms Extension for APDFL and talk to you about how you can use this new asset within the Adobe PDF Library to expand your ability to efficiently collect and manage information from your customers, prospects, employees, students, and partners. Stay tuned for part two of this post, where we’ll go into more detail about using Forms Extension with XFA and AcroForms. Have questions or want to learn more about Forms Extension? Contact us today.

The post Form with Function, Part One: Working with Digital Forms appeared first on Datalogics Blog.

↧

Introducing the Forms Extension for Adobe PDF Library

February 10, 2020, 12:54 pm

≫ Next: Engineering Perspective: Why you Should Upgrade to Adobe PDF Library v18

≪ Previous: Form with Function, Part One: Working with Digital Forms

Datalogics is very proud to announce the release of the Forms Extension for the Adobe PDF Library (APDFL)! This exciting new add-on expands the PDF Library’s processing capabilities for XFA (static and dynamic) and AcroForm forms, notably allowing XFA documents to be opened and converted to more widely-compatible non-XFA PDF content that can be rendered, viewed, modified, extracted, and printed. The full set of enhanced forms features include:

Rendering XFA and AcroForm form documents
Converting XFA forms to AcroForm
Importing data into XFA and AcroForm documents
Exporting data from these form documents to files
Flattening XFA and AcroForm documents into regular PDF documents, including generating appearances for bar codes and rendering them as bitmaps

The Forms Extension requires the Adobe PDF Library but is licensed separately, and it is currently available on Windows 32-bit for C++, .NET, and Java.

Why is this so exciting, you ask? Let’s dive into the backstory for more context.

The History of XFA Forms

XFA was developed by the company JetForms for dynamic forms entry, similar to HTML forms. Adobe purchased the company and integrated its technology into their products like InDesign and Acrobat. XFA is woven into PDF files via file formats known as Static and Dynamic XFA PDF documents.

Static XFA documents are a hybrid of older AcroForm fields and XFA fields, which can’t dynamically respond to user entry so they’re considered ‘static.’ However, these can be especially problematic, because they contain two representations of the forms data. The idea was that AcroForm-aware viewers (which are very common in the PDF space) can view the PDF and display the AcroForm fields contained in it. Similarly, XFA-aware viewers can view the PDF and display the XFA fields contained in it. Data between the two representations can become out of sync, and viewers that recognize both are challenged with deciding which data to prioritize. This has been a big factor in limiting XFA’s widespread adoption.

Dynamic XFA documents, in contrast, can dynamically change their fields based on user entry. They are a completely new invention and don’t actually contain meaningful PDF content at all! They typically carry a placeholder image or text for a one page PDF that contains a message indicating the current viewer is unsupported and to upgrade to the latest version of Adobe Acrobat instead. The biggest problem with these types of documents is that most PDF technology in the marketplace, at best, can only display the placeholder message. This has led to confusion among users and has hindered Dynamic XFA’s widespread success.

While many have made use of XFA over the years, its drawbacks have outweighed its useful features, and it has been deprecated from the recently-released PDF 2.0 specification. In the short term, XFA document workflows may continue to be used for their dynamic capabilities, and the millions, or perhaps billions, of XFA documents in existence must continue to be accessible. In the longer term, companies are looking for sustainable forms processing solutions.

How Forms Extension for APDFL Works

With the Forms Extension from Datalogics, you can now open XFA documents reliably, and you can flatten the XFA content into regular PDF content. This can mean the difference between having legacy XFA forms documents that were previously only usable in Adobe Acrobat and a small set of other viewing tools and having the power to transform them into regular PDF content that any PDF processor can access. This is an amazingly useful feature given the previously mentioned drawbacks of XFA. This also offers a strategy for baking in form data when you don’t want content to be editable.

In addition to flattening, we also now offer the ability to convert the XFA fields to AcroForm fields.This is significant for users who want to maintain the ability to interact with their PDF forms but are looking for a format that is compatible with existing PDF software and the latest PDF standard.

Finally, we have added the ability to import and export data from XFA documents and AcroForm documents. This supports populating blank form fields from an external data source and exporting form data to an external source – a crucial component of automated forms processing.

Whether you aim to work with XFA forms or AcroForms or convert your XFA forms to AcroForms, we are excited to announce that the Forms Extension for the Adobe PDF library supports all of your PDF forms processing needs! The Forms Extension is now available for a free evaluation period – download it today to get started!

The post Introducing the Forms Extension for Adobe PDF Library appeared first on Datalogics Blog.

↧

Engineering Perspective: Why you Should Upgrade to Adobe PDF Library v18

March 23, 2020, 10:39 am

≫ Next: Forms Extension Launched on Windows 64-bit

≪ Previous: Introducing the Forms Extension for Adobe PDF Library

It’s hard to believe, but Datalogics’ version of the Adobe PDF Library version 15.0.4 was first released nearly three years ago. Since then, we’ve made hundreds of fixes and enhancements, mostly driven by the issues that our customers brought to us to address. Likewise, Adobe has done its own fixes and enhancements to APDFL.

Melding these often-divergent efforts gives our customers the best of both worlds, with the deliberate test-driven approach that Engineering took with the APDFL v18 port, APDFL v18+P1b.

Over many years — between APDFL itself across a number of platforms, our own Java and DotNet interface to APDFL, and the DLI interface for PDF creation –Datalogics’ APDFL team has accumulated a few thousand tests of unique cases, and to this point, we’ve been able to address the vast majority of those to ensure tests are now passing.

In addition, Engineering has put in a lot of effort to audit and modernize the codebase for C++17 compliance, and to remove the potential for undefined behavior and make the code safe for more compiler optimizations over a greater portion of the code.

Engineering has overhauled APDFL v18’s build system; which mostly won’t be visible to customers, but it should make it easy and quick to further update 3rd party components, should they ever need updating (security reasons, for example). APDFL v18’s Unicode components are now up-to-date, with no future plans to update APDFL’s v15.0.4 equivalent.

While we are now at the stage where fixes for v15.0.4 are added to v18 almost in parallel, the transition from version 15.0.4 to 18 should be nearly seamless.

If you are considering upgrading to version 18 from 15.0.1 or earlier, you may find that quite a bit has changed. For example:

Our added PDF Optimizer functionality
Enhancements to allow APDFL to convert files to PDF/A-2 and PDF/A-3
APDFL’s unlocked ability to process/render truly large images
The ability to use OCR on PDF images from the .Net or Java interfaces

These are just a few of the new enhancements offered in v18. In short, upgrading now should be well worth the effort, and we highly recommend it.

Want to learn more about the enhancements APDFL18 will bring to your PDF creation and management process? Contact us today!

The post Engineering Perspective: Why you Should Upgrade to Adobe PDF Library v18 appeared first on Datalogics Blog.

↧

Forms Extension Launched on Windows 64-bit

April 1, 2020, 9:29 am

≪ Previous: Engineering Perspective: Why you Should Upgrade to Adobe PDF Library v18

Datalogics is very proud to announce the newest release of Forms Extension for Adobe PDF Library, available now on Windows 64-bit for C++, .NET, and Java! While our initial launch was limited to Windows 32-bit, we’ve worked hard to overcome obstacles to bring you a comprehensive Windows solution. In case you missed our previous announcement about the recent introduction of this exciting new product, please refer to this blog post.

One of the primary reasons this is a really big deal in the PDF space is that most XFA documents are completely unintelligible to the majority of PDF software and simply can’t be processed at all. Thanks to Forms Extension, such documents can be converted to AcroForms or simply flattened to regular PDF content that any PDF software can support.

Let’s take a closer look at what we mean by “unintelligible.” Here’s a dynamic XFA document opened in Acrobat DC, which is among the very few PDF viewers able to understand the XFA data:

Here we see the text from the XFA form rendered to the Acrobat DC viewer. Now let’s look at what this dynamic XFA document looks like when you open it in a non-XFA aware PDF viewer (e.g. the majority of PDF software):

Here we see a message indicating that the PDF viewer cannot display this type of document. So why do we see something completely different? Let’s take a look at the PDF page’s content itself to find out why:

What we have here is a series of commands to display some text, such as the “Please wait…” we see displayed on the page. That’s the content of the actual PDF page, so that’s what the non-XFA aware viewer shows us. This is what is meant by a shell, stock, or placeholder page for an XFA document. It’s simply supplied so non-XFA aware viewers can show the user something rather than nothing. So then where is the “Text Field” content that Acrobat is able to display?

The reason it’s not in the page’s content is that this is a dynamic XFA document, which means all of the meaningful data is locked inside an XFA container (stream or array) at the document level. Let’s take a look at an excerpt of this data:

We see a “TextEdit” field (a user interface element that encloses a widget intended to aid in the manipulation of textual content) using the font “Myriad Pro” with the value “Text Field.” As a human, you can read that and make sense of it. It’s a form of XML known as XDP that’s designed to carry around the XFA information in a PDF. Most PDF parsing tools designed to show the page aren’t going to look at this XDP data. Even if they did look, they wouldn’t find actual Graphics or Text operators that are expected in a PDF content stream, so non-XFA aware viewers are not able to show us what’s hidden in the XDP data when opening the document.

Enter Forms Extension for the Adobe PDF Library, which allows us to easily flatten this dynamic XFA document to a regular PDF and save the result. Now let’s examine the page’s content after flattening:

Where T1_0 is defined as:

We see that we begin a Text Block (BT), the rgb color for fill operations (rg) is set to Black (0 0 0), the font T1_0 is chosen (Tf), we set various text state operators (Tc, Tw, Ts, Tz, Tr, and Tm), and the string shown is “Text Field” (Tj). This is syntax that all PDF viewers should be able to recognize.

Now, when we view our file processed with Forms Extension in a non-XFA aware PDF viewer, we see:

So to recap, we took a look at how a dynamic XFA document appears both in Acrobat DC viewer and in non-XFA aware viewers. Next we explored the specification of the data in the ‘language’ of the XFA forms technology. Finally, we saw the result of flattening the XFA document into PDF syntax, which PDF software can understand. Hopefully this provides a better appreciation for the power of Forms Extension to take XFA data locked in a format that most PDF software cannot process and transform it into the universally understood PDF format.

Whether you aim to process XFA forms or AcroForms or convert your XFA forms to AcroForms, we are excited to announce that Forms Extension for the Adobe PDF Library supports all of your PDF forms processing needs! Forms Extension is now available for a free evaluation period – download it today to get started!

The post Forms Extension Launched on Windows 64-bit appeared first on Datalogics Blog.

↧