Quantcast
Channel: Adobe PDF Library – Datalogics Blog
Viewing all 91 articles
Browse latest View live

DLE rendering support enhancement: separation channels

$
0
0

Datalogics has been bringing new engineers into working with DLE over the past several months, and we’re excited about enhancements that they have made to DLE in that time. One in particular is the ability to generate color separations as raster images for PDF pages. DLE has had the ability to generate PDF page separations as EPS files for some time. These changes enhance DLE to allow for rendering of PDF page separations to raster images. This allows saving these as any of the file formats that the Image object can be saved as (JPEG, PNG, TIFF) and for all other functionality of DLE Image objects.

Acquiring PDF page raster separations is easy and is demonstrated in the DrawSeparations samples for C# and Java included with DLE, but a couple extra steps over rendering a PDF page are required. To illustrate in Java:

1) Make the list of colorants to draw: this should include the process colors (Cyan, Magenta, Yellow and Black) as the first four list elements, followed by the names of separation (spot) color channels to draw. A shortcut is to get all colorants used on the page, and make separation color channels for each of these:

List<Ink> inks = pg.listInks();
List<SeparationColorSpace> colorants = new ArrayList<SeparationColorSpace>();

for (Ink ink : inks) {
  colorants.add(new SeparationColorSpace(pg, ink));
}

2) Set up the transform desired for the page and the other page drawing parameters. Here we transform the media box of the PDF page from PDF orientation to Java image orientation, and request that the image being drawn be placed on a white background (DO_LAZY_ERASE) and for annotations that have appearance streams to be drawn (USE_ANNOT_FACES):

 double width = pg.getMediaBox().getRight() - pg.getMediaBox().getLeft();
 double height = pg.getMediaBox().getTop() - pg.getMediaBox().getBottom();
 Matrix matrix = new Matrix().scale(1, -1).translate(0, -height);
 DrawParams params = new DrawParams();
 params.setMatrix(matrix);
 params.setDestRect(new Rect(0, 0, width, height));
 params.setFlags(EnumSet.of(DrawFlags.DO_LAZY_ERASE, DrawFlags.USE_ANNOT_FACES));

3) Executing the PDF page rendering can be done to a list of DLE Image objects, or to a list of Java BufferedImage objects; the list contains one image per channel in the order they were added to list of colors to draw:

 List<BufferedImage> separatedImages = pg.drawContents(params, colorants);

And that’s it! We hope you enjoy this new DLE feature.



Adobe Digital Rights Management Technologies – ACS vs. LCRM

$
0
0

When we meet with companies we get a substantial number of questions about how to protect digital content. Adobe has at least 2 offerings in the Digital Rights Management space, Adobe Content Server and Adobe LiveCycle Rights Management. These products both fall into the DRM category, but they solve very different problems for very different markets. This article will discuss the capabilities and licensing of each and what problems they are targeted at solving.

Adobe Content Server – DRM for Commercial eBooks

This product, also known as ACS, is capable of securing content in the ePub and PDF file formats. The product is sold by Adobe partners as a server license, and there are transaction charges incurred for each individual content license granted.

The target market for this offering is eBook or other digital content distributors such as Sony, Kobo, Barnes & Noble or Google. Content is licensed on a one to one basis and rights are applied at the time a specific item is purchased. More specifically, content is licensed to an individual with a specific Adobe ID or Vendor ID, and may not be consumed on reading devices that are not registered with that ID. ACS provides granular control over the rights that may be granted with each content license. For example a book store may sell a popular title for one price with the rights to read it on multiple devices, re-download it a future date and print a range of pages, and may also have the same book listed at a different price with more restrictive rights. Once rights are applied, generally at purchase time, these rights cannot be modified or revoked.

The following rights can be controlled with Adobe Content Server:

  • The ability to open and view content is restricted to a device(s) or app(s) which are activated with the Adobe ID or Vendor ID of the individual who licensed\purchased the content.
  • The number of devices a piece of digital content can be viewed on can be controlled. Limit of one or up to 6
  • The ability to print or not print licensed content, with control of the range of pages that may be printed as well
  • Content licenses may be permanent, or have an expiration date. If the expiration date is 60 days or less the transaction fees are lower. After a loan has expired the document is still physically present, but can no longer be opened.
  • The ability to copy text from licensed content can be granted or withheld.
  • The ability to limit the number of concurrent loans each book in the catalogue will allow.
  • The ability to re-download licensed content at a future date without re-purchasing can provided or withheld.

Notable – PDF files that are rights managed with ACS cannot be viewed in the Adobe Acrobat family of products. To consume a PDF file that has been rights managed with ACS a PDF viewer must contain the Adobe Reader Mobile SDK technology. There are a multitude of such free, or for fee viewers available for every platform you could imagine, but there are no plug-ins for Acrobat or Reader to support decryption of the Adobe Content Server DRM. (sounds like a product opportunity)

Adobe LiveCycle Rights Management – DRM for Enterprise Documents

LCRM, formerly known as LiveCycle Policy Server, can apply digital rights control to multiple file formats including PDF, Microsoft Office, CAD, Adobe Illustrator, Adobe Photoshop and Flash video files. The product is sold as a server license with additional negotiated charges for each piece of digital content that is rights managed.

The target market for LCRM is enterprises or government agencies that wish to control who has access to sensitive information such as price lists, financial results or medical records. Content is rights managed on a one to one or one to many basis and rights are applied in advance of consumption. Specifically, a set of rights can be applied to a document and to exercise these rights a user ID must be authenticated with their LDAP or Microsoft Active Directory credentials. Rights can be granted to individuals or groups, and can be revoked at any time. For example rights to a financial report may be granted to anyone who is a member of the “accounting” group within an organization. If that individual is removed from the “accounting” group, or leaves the company, they may still have a copy of the document in their possession but they will no longer be able to open and view it. In this regard LCRM differs from ACS as Adobe Content Server does not phone home to validate credentials and rights after they have been granted, but LCRM does phone home.

The following rights can be controlled with LCRM:

  • The ability to open and view content is restricted to individuals or groups of individuals who have been granted such rights.
  • The ability to print or not print managed content, with control of the range of pages that may be printed as well
  • Content licenses may be permanent, or have a predefined expiration date.
  • Content licenses can be revoked at any time. If a document’s validity date has expired it may notify the reader that a newer version of the document is available for download.
  • The ability to copy text from licensed content can be granted or withheld.
  • Content access can be audited. LCRM administrators can track who accessed documents and when, as well as determine who attempted to access a document and was denied access.
  • Content access can be allowed off-line or denied. If offline access is allowed it can be restricted to a defined amount of time before credentials must be re-validated.
  • Content access can be provided to anonymous users. For example a document may be distributed freely on the Internet, but can still have an expiration date after which it cannot be viewed or may direct the viewer to a more current version.
  • Content Rights can be applied automatically. For example a company policy may require that all documents created on its copy machines or received/sent as email attachments must have usage rights applied.

Notable – PDF files that are rights managed with LiveCycle Rights Management can be viewed in the Adobe Acrobat family of products on most platforms including recent versions of Reader for Android and iOS. LCRM also provides plug-ins for decrypting rights managed documents in Microsoft Office, Adobe Creative Suite applications and various CAD tools. LCRM protected files are not viewable within 3rd party PDF tools.

As a final thought, the Adobe Acrobat family of products also offer a level of file protection for PDF files through password protection. Password protected PDF files can prevent or allow viewing, copying, printing, or modification. Password protection can be applied within Acrobat or any of the various products that have been built with the Adobe PDF Library, and can be consumed within all of the Adobe PDF viewing products and most 3rd party PDF viewers.


PDF rendering and coordinate systems

$
0
0

With the PDF Library, we get a number of questions from people who are rendering PDF pages to raster devices or to raster image files. One of the trickier concepts to grasp is the translation of coordinate systems from the system used in PDF files to that used for rasterization. In this article, I’ll briefly discuss the factors involved in rendering PDF files.

Coordinate Systems

The PDF file format uses a different coordinate system – a different means of specifying locations relative to each other – than most raster image formats. In the PDF file format, increasing X values specify the rightward direction and increasing Y values specify the upward direction. That is, the point (X + 1, Y + 1) is one PDF unit above and one PDF unit to the right of the point (X, Y). This differs from most image formats where the opposite direction for Y values are used. In most image formats, including how raster images are stored in the PDF file format, increasing Y values specify the downward direction: (X + 1, Y + 1) is one pixel to the right of and one pixel below the pixel at (X, Y). Also note the distinction between PDF points and raster pixels. It is perfectly legitimate and expected that content in a PDF content stream can be placed at non-integer points.

The following hold true for PDF files in most cases but not always:

  • The origin of the PDF coordinate system (0, 0) represents the bottom-left corner of the PDF page
  • PDF files specify 72 points to 1 physical inch

It is very imporant to know that these are true most of the time, but not all of the time. What this means is that when writing a program that renders a PDF page, you need to account for both of these. You must also account for any page rotation that is specified with a Rotate key for the PDF page when rendering.

The PDF API documentation makes reference to these two coordinate systems as user space and device space. User space is used to refer to the PDF page coordinate system, where points are specified in PDF units. Device space is used to refer to the coordinate system of where you are drawinng to, where units are those used in your output type – typically in pixels for a raster image.

Rendering parameters

There are three input parameters to the rendering APIs in the PDF Library / DLE that control how pages are rendered: the transformation matrix, the user space updateRect and the device space destRect. These parameters are used in the following way:

  • The updateRect is used to clip the PDF page to be rendered, restricting the drawing to a specific region of the PDF page. This is specified in user space (PDF) coordinates.
  • The matrix is used to scale, rotate and transform user space (PDF) coordinates to device space coordinates. Usually this has the following:
    • A scaling factor to transform user space coordinates into suitably sized device space coordinates. In a typical situation, someone who is rendering a PDF page to a 300dpi raster would specify scaling of 300/72 in the X and Y direction.
    • A rotation factor to account for rotated PDF pages. PDF pages with Rotate keys specified need to have a transformation matrix applied to cause a suitable rotating for rasterization.
    • A rotation factor to flip the Y coordinates of user space, to account for the different directions that increasing Y values go in between the two coordinate systems.
    • A translation factor in the Y direction to normalize the start of the PDF page to draw in user space to the origin of the device space. This accounts for the flipping of the Y coordinates.
    • A translation factor in the X and/or Y direction to normalize the start of the PDF page to draw in user space to the user space origin (0, 0). This accounts for PDF pages where the visual contents (the CropBox) do not start at the user space origin.
  • The destRect defines the boundaries of device space to draw into. This is specified in device coordinates; typically as the number of raster pixels in the device X and Y coordinates.

Notes

  • Specification of the updateRect is optional; if it is not specified, no clipping of the PDF page will be carried out by the rendering call. The matrix and destRect are required.
  • PDF points that are transformed by the matrix to values outside of the destRest are not drawn; they are clipped.
  • The matrix is not required to transform user space fully into device space. It is legal to have a matrix that draws the PDF page only to part of the destRect. However, you are strongly advised to use the updateRect to restrict the drawing to the region of the PDF intended for imaging.
  • PDF pages can have content outside of their intended viewing region (the CropBox) and outside of their intended print region (the MediaBox). If you do not restrict your rendering region appropriately, then rasterizing PDF pages that have content outside of these regions will show this content. This may lead to unexpected results.

DLE using Python

$
0
0

I’ve always had an appreciation for the higher level languages, the ones that make life easier, that let you code rather than worry about the housekeeping. C# is an improvement over coding in C or C++, since it relieves you of many of the burdens of tracking pointers and object ownership. You still have to compile the program before you can run it.

Scripting languages like Python give the best of both worlds. Programs don’t require compilation before being run, and in fact, you can type commands to an interactive console, just like in the old days of BASIC.

I’ve been something of a Pythonista for a long time now, and I’ve always wanted to access the PDF Library from Python. With DLE we can.

Before you go digging in the distribution to find the secret Python bindings, I’ll tell you there aren’t any. We’re going to use a little trick. There are versions of Python that run on some of the major VMs out there. One of them is IronPython, which runs on .NET, and the other is Jython, which runs on the JVM.

Both mix the ease of use of Python with direct access to the features of the underlying VM. Generally, Python and Java or .NET objects can be freely mixed, and you don’t have to really know in which language classes and objects are declared, especially from the point of view of the Python code.

For this article, I’m going to focus on Jython.

Getting started with Jython

I’ll start by presuming that you’ve installed Jython according to the installation instructions. On Mac,

Change to the directory that contains your DLE installed files, and make sure that Jython can find the components of DLE.

On Macintosh, we set a few environment variables:

$ export DYLD_FRAMEWORK_PATH=$PWD
$ export DYLD_LIBRARY_PATH=$PWD
$ export JYTHONPATH=$PWD/com.datalogics.PDFL.jar

On Windows it’s sufficient to

&gt; set JYTHONPATH=com.datalogics.PDFL.jar

And we start up Jython:

$ jython
*sys-package-mgr*: processing new jar, 'E:\Datalogics\APDFL10.1B1a-x64\DLE\com.datalogics.PDFL.jar'
Jython 2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06)
[Java HotSpot(TM) 64-Bit Server VM (Sun Microsystems Inc.)] on java1.5.0_19
Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&quot; for more information.
&gt;&gt;&gt;

On Mac OS X, because DLE is 32-bit only,you’ll have to make sure to invoke the 32-bit version of the JVM by using the -J-d32 option:

$ jython -J-d32
Jython 2.5.2 (Release_2_5_2:7206, Mar 2 2011, 23:12:06)
[Java HotSpot(TM) Client VM (Apple Inc.)] on java1.6.0_35
Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&quot; for more information.
&gt;&gt;&gt;

Let’s import the DLE classes for ease of use:

&gt;&gt;&gt; from com.datalogics import *

If you got this far, then the DLE classes are imported into your namespace. Where did that com.datalogics.PDFL module come from? From the .jar file! All the Java classes in that namespace are now imported into our Python interpreter.

It’s important to initialize and terminate the Library on the main thread, so the first thing we do is initialize:

&gt;&gt;&gt; lib = PDFL.Library()

Now let’s open the sample.pdf document (in the sample data that comes with DLE), and see what kinds of attributes a document object has:

&gt;&gt;&gt; doc = PDFL.Document('Samples/Data/sample.pdf')
&gt;&gt;&gt; dir(doc)
['ALL_PAGES', 'BEFORE_FIRST_PAGE', 'LAST_PAGE', 'XMPMetadata', '__class__', '__copy__', '__deepcopy__', '__delattr__', '__doc__', '__eq__', '__getattribute__', '__hash__', '__init__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__str__', '__unicode__', 'author', 'baseURI', 'bookmarkRoot', 'class', 'close', 'compressionLevel', 'countXMPMetadataArrayItems', 'createNameTree', 'createPage', 'creator', 'defaultOptionalContentConfig', 'delete', 'deleteOnClose', 'deletePages', 'embedFonts', 'enumIndirectPDFObjects', 'equals', 'fileName', 'findBookmark', 'findLabelForPageNum', 'findPDFObjectByID', 'findPageNumForLabel', 'flattenOptionalContent', 'flattenTransparency', 'getAuthor', 'getBaseURI', 'getBookmarkRoot', 'getClass', 'getCompressionLevel', 'getCreator', 'getDefaultOptionalContentConfig', 'getDeleteOnClose', 'getFileName', 'getFonts', 'getInfo', 'getInfoDict', 'getInstanceID', 'getIsEmbedded', 'getIsLinearized', 'getIsModified', 'getIsOptimized', 'getIsPxDF', 'getKeywords', 'getLoadedFonts', 'getMajorVersion', 'getMajorVersionIsNewerThanCurrentLibrary', 'getMergedXMPKeywords', 'getMinorVersion', 'getMinorVersionIsNewerThanCurrentLibrary', 'getNameTree', 'getNeedsSave', 'getNumPages', 'getOptionalContentConfigs', 'getOptionalContentContext', 'getOptionalContentGroups', 'getPage', 'getPageLabels', 'getPageMode', 'getPermanentID', 'getProducer', 'getRequiresFullSave', 'getRoot', 'getSubject', 'getSuppressErrors', 'getTitle', 'getVersionIsOlderThanCurrentLibrary', 'getVersionString', 'getWasRepaired', 'getXMPMetadata', 'getXMPMetadataArrayItem', 'getXMPMetadataProperty', 'hashCode', 'infoDict', 'insertPages', 'instanceID', 'isEmbedded', 'isLinearized', 'isModified', 'isOptimized', 'isPxDF', 'keywords', 'loadedFonts', 'majorVersion', 'majorVersionIsNewerThanCurrentLibrary', 'mergeXMPKeywords', 'mergedXMPKeywords', 'minorVersion', 'minorVersionIsNewerThanCurrentLibrary', 'movePage', 'needsSave', 'notify', 'notifyAll', 'numPages', 'optionalContentConfigs', 'optionalContentContext', 'optionalContentGroups', 'pageLabels', 'pageMode', 'permRequest', 'permanentID', 'print', 'printToFile', 'producer', 'removeNameTree', 'removeOCG', 'replacePages', 'requiresFullSave', 'root', 'save', 'secure', 'setAuthor', 'setBaseURI', 'setCreator', 'setDeleteOnClose', 'setInfo', 'setIsEmbedded', 'setIsOptimized', 'setKeywords', 'setMinorVersion', 'setNeedsSave', 'setPageLabels', 'setPageMode', 'setProducer', 'setRequiresFullSave', 'setSubject', 'setSuppressErrors', 'setTitle', 'setXMPMetadata', 'setXMPMetadataArrayItem', 'setXMPMetadataProperty', 'subject', 'suppressErrors', 'title', 'toString', 'unsecure', 'versionIsOlderThanCurrentLibrary', 'versionString', 'wait', 'wasRepaired', 'watermark']

It’s interesting that Jython infers attributes from get and set calls, so we can get the number of pages without a function call. But we can’t set the number of pages; it’s a read-only attribute.

&gt;&gt;&gt; doc.numPages
2
&gt;&gt;&gt; doc.numPages=2
Traceback (innermost last):
  File &quot;&quot;, line 1, in ?
AttributeError: read-only attr: numPages

You also might have noticed that there are no declarations of types. That’s because Python is duck-typed: if it looks like a duck, and quacks like a duck, it’s a duck. Every Python object has a specific type, but that type is a property of the object, not the name that is used to reference it.

Time to try something fun, like extracting some text from a PDF file. First, let’s get a word finder:

&gt;&gt;&gt; wf = PDFL.WordFinder(doc, PDFL.WordFinderVersion.LATEST, PDFL.WordFinderConfig())

Now, since we’re in Python, we can get a list of words just by calling getWordList, and it will act like a Python list. So let’s map the unbound function Word.getText onto it, and see what we get.

&gt;&gt;&gt; map(PDFL.Word.getText, wf.getWordList(0))
[u'National', u'Weather', u'Service', u'Zone', u'Forecast', u'http://', u'www.', u'crh.', u'noaa.' ...

There’s a lot of power packed into this one line. First, DLE returns native list types where appropriate, so the list is a java.util.ArrayList. Jython extends Python sequence semantics to Java list types, so we can treat that ArrayList like any other list.

Python’s functional programming (map) turns loops into one-liners, applying a function to each item of a sequence and returning a sequence of results.

Java strings and Python strings are transparently converted.

One thing to remember: The Library object always has to be cleaned up. In the Java-based version of DLE, use the delete method:

&gt;&gt;&gt; lib.delete()

Once an object in DLE has been deleted, dependent objects become invalid. So, if we delete our Document object, the Pages we get from it are no longer usable, and so forth. Deleting the Library object cleans everything up, so we can no longer use our Document.

&gt;&gt;&gt; doc.numPages
  File &quot;&lt;stdin&gt;&quot;, line 1, in &lt;module&gt;
java.lang.RuntimeException: Object is no longer valid (perhaps a parent object was already destroyed) ...

Scripting languages offer the ability to explore an API interactivity. Code can be tested before it is placed in a complete application, and without requiring compilation. Jython makes an excellent exploration tool for the Java version of DLE.

A future installment will show how Python programs can be written for DLE using Jython.

Other resources

  • Jython Console is a wrapper around a Python console prompt that offers code completion.

Adobe PDF Library: rendering and transparency

$
0
0

Today we have a quick discussion on rendering PDF pages with the Adobe PDF Library C/C++ interface and working with transparency. Read on if you’re familiar with the Adobe PDF Library and are interested in how to draw pages over existing graphics or want to maintain alpha channel information when rendering PDF pages.

The Adobe PDF Library from Datalogics provides support both for rendering PDF pages over top of existing graphic overlays, and supplies alpha channel information for rendered PDF files. Rendering PDF pages is typically done into a byte array supplied to the page rendering APIs (PDPageDrawContentsToMemory or PDPageDrawContentsToMemoryWithParams). Most often, callers are not concerned with preserving transparency information and wish for an opaque white page background. This is easily accomplished by adding the kPDPageDoLazyErase option to the drawing flags (drawFlags) supplied to the rendering call. This option causes the PDF Library to draw a white rectangle in the size of the PDF page’s media box (or the size of the area to draw, whichever is smaller) into the raster byte array as its first element, before drawing any PDF markings. This results in a PDF page rendered over an opaque white background and erases any data that was existing in the raster byte array. The resulting display is what Adobe Reader or Acrobat displays by default when viewing PDF files when the transparency grid is off.

However, in some cases you may want to render a PDF page on top of a pre-existing graphic – for example, to simulate the transparency grid view in Adobe Reader/Acrobat. Because the PDF Library does not disturb or change pixels in the imaging byte array, you can pre-initialize the byte array with existing graphical content in the colorspace and format that you are rendering the PDF page to (24-bit packed RGB for a PDF rendered to DeviceRGB, as an example) and use the page rendering APIs to render into the raster byte array. Here, you would make sure not to specify the kPDPageDoLazyErase flag; because you want to maintain the pixels in the raster byte buffer that are not marked on the PDF page, the page should not be overlaid with an opaque background before the PDF is rendered. As long as the kPDPageDoLazyErase flag is not specified, only the pixels that represent areas marked in the PDF page will be changed.

The above maintains unmarked areas on the PDF page but does not show the full support for alpha channel (opacity) information in the Datalogics release of the Adobe PDF Library [note that the following applies only to the Datalogics release of the Adobe PDF Library]. Datalogics enhances the PDF Library to support rendering to RGBA (RGB + alpha) pixel format with full 8-bit alpha channel support by adding a virtual colorspace, “DeviceRGBA”, to the set of colorspaces allowed as rendering targets. The above discussion of the kPDPageDoLazyErase draw flag remains valid: specifying this drawing option tells the PDF Library to overlay the contents of the raster byte buffer with an opaque white rectangle, in the size of the PDF page’s media box or the size of the area to draw, before drawing any of the PDF page. This results in a PDF page with an alpha channel that is entirely opaque – because the background white fill is opaque.

When the kPDPageDoLazyErase flag is not set, and the DeviceRGBA colorspace is specified, the PDF Library will additively composite opacity channel information with the information existing in the alpha channel of the supplied raster byte array. In the case where you simply are interested in the PDF page’s content and do not have a background image to use as an underlay, you therefore initialize the alpha channel bytes to 0×00 (fully transparent). The PDF Library will not change the value of the alpha channel bytes for pixels that are unmarked by the PDF page, and these remain fully transparent in the resulting raster. The RGB bytes can be initialized to any value since they are fully transparent; we recommend white (0xFF 0xFF 0xFF) but any value can be used. For cases where you’d like to draw over a background, the PDF Library will composite marked pixels in an additive manner towards opaque white for areas in the PDF page that are not fully opaque. Of course, areas in the PDF page that are fully opaque – and those marked in a colorspace not containing transparency information – will be marked in the raster byte array as fully opaque and will overwrite any RGBA value previously existing in the raster byte array.

We hope you’ve found this brief discussion helpful. Happy imaging!


APDFL and DLE For low-level PDF work

$
0
0

Note: This is the 1st in a series of four articles exploring low-level PDF manipulation using APDFL and DLE

When implementing low-level PDF work using APDFL, we are essentially talking about using the API subset called the Cos Layer.  The Cos Layer functions manipulate objects which correspond to the basic PDF object types as specified in section 3.2 of the PDF v1.7 Reference (or section 7.3 of the PDF 32000-1:2008 specification).   DLE provides an object-oriented interface to the Cos Layer, but uses the PDF prefix instead of Cos.

In general, you need to use Cos-level functions when you want to implement functionality discussed in the PDF spec that is not covered by specific API calls in APDFL.  If you are considering making Cos-level modifications to PDFs using APDFL, you might want to first prototype using DLE.  So let’s discuss how these relate to each other and some gotchas to Keep In Mind.

Basic PDF object Types Cos Equivalents DLE object equivalents
Boolean values CosBoolean PDFBoolean
Integer and real numbers CosInteger,CosFixed,CosReal PDFInteger,PDFReal
Strings CosString PDFString
Names CosName PDFName
Arrays CosArray PDFArray
Dictionaries CosDict PDFDict
Streams CosStream PDFStream
The null object CosNull

Boolean values:

The simplest PDF Object – PDFBoolean – nonetheless comes with a good number of methods inherited from PDFObject. What is unique to PDFBoolean is its bool property called Value, and its constructors, which correspond to CosBooleanValue() and CosNewBoolean(), respectively. The rest of the methods roughly correspond to CosObj functions which apply to all Cos types.

Integer and real numbers:

PDFInteger corresponds to CosInteger, much like PDFBoolean to CosBoolean. But, in addition to the CosIntegerValue and CosNewInteger() functions, there are also CosIntegerValue64 and CosNewInteger64 if a 32bit int is not sufficient for you.

PDFReal likewise corresponds to CosReal, and from there to a number of different Cos functions, but CosDoubleValue() and CosNewDouble()  (or perhaps CosNewDoubleEx, if you need to specify significant digits) are the ones you will want to use.

Avoid using CosFixed as it has 16bit limitations and is partly deprecated.

Strings:

PDF Strings are actually a bit more complicated than what is discussed in section 3.2.3 of the PDF Reference; you also need to take into account the string types described in section 3.8.1. Using APDFL, you might find some of the ASText functions helpful (e.g. ASTextFromSizedUnicode() and ASTextGetUnicodeCopy()).  Using DLE, that logic is already in place underneath the hood.

Names:

PDF Names represent tokens.  The biggest pitfall is that Names are case sensitive.  Otherwise, while you can use CosNameFromString() in order to get a string from  CosNameValue()’s return value, you will have to pass that ASAtom to ASAtomGetString().

Arrays:

A PDF array is a one-dimensional collection of any and all types of PDF objects, including nested PDF Arrays. As such, it is relatively straightforward to use both in DLE and using the CosArray functions.

Dictionaries:

PDF Dictionaries are the heart of the PDF format. It is an associative table of key and value pairs, with the keys being PDFNames, but values being any PDF object.  There are a number of CosDict functions with KeyString() suffixes; these are helper functions which eliminate the need to create a separate PDF Name for each dictionary lookup.  In DLE, the same effect is achieved using method overloading.

Streams:

A Stream is basically a block of raw data with a dictionary associated with it. It may be compressed or encrypted.

We’ll continue with the discussion of low-level PDF work in an upcoming article.


Solaris and the case of failing exception handling

$
0
0

In the course of bringing our Adobe PDF Library Java interface to Solaris for the 64-bit AMD/Intel platform, we encountered an interesting issue regarding exception handling in shared libraries. We share our story in case it helps someone in the future.

Symptoms: Our APDFL Java interface was nearly completed: code had been ported, examples and tests had been run, and the results were promising. There were several areas with unexpected failures, however. These all manifested as Java virtual machine crashes, and all attempting to handle a C++ exception raised internally inside one of the APDFL C++ shared libraries:

Stack: [0xfffffd7fffbff000,0xfffffd7fffe00000),  sp=0xfffffd7fffdfdfd8,  free space=2043k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code, ...=previous line continue)
C  0x0000000000b34196
C  [libc.so.1+0xd9579]  _Unwind_RaiseException+0×46
C  [libstdc++.so.6.0.9+0xf9cf5]  __cxa_throw+0×55
C  [libDL100pdfl.so.10.1.0.18+0x8ae446]  ASPushExceptionFrame+0×136
C  [libDL100pdfl.so.10.1.0.18+0x823cb7]  PDEPrefGet+0×307
C  [libDL100pdfl.so.10.1.0.18+0x7ece1d]  PDEFontIsMultiByte+0x36dd
C  [libDL100pdfl.so.10.1.0.18+0x7f00fb]  PDEFontSetSysFont+0x6b
C  [libDL100pdfl.so.10.1.0.18+0x495571]  PDDrawCosObjToWindow+0×1461
C  [libDL100pdfl.so.10.1.0.18+0x496182]  PDDrawCosObjToWindow+0×2072
C  [libcom-datalogics-DL100PDFL.so+0x365624]
…_ZN9CDocument10EmbedFontsEN15CEmbedFlagsEnum11CEmbedFlags
…EP16CProgressMonitorP11CCancelProcP11CReportProc+0x47a
C  [libcom-datalogics-DL100PDFL.so+0x365721]
…_ZN9CDocument10EmbedFontsEN15CEmbedFlagsEnum11CEmbedFlagsE+0×19
C  [libcom-datalogics-DL100PDFL.so+0x234714]
…Java_com_datalogics_PDFL_PDFLJNI_Document_1
…embedFonts_1_1SWIG_11+0×70
j  com.datalogics.PDFL.PDFLJNI.Document_embedFonts__SWIG_1
…(JLcom/datalogics/PDFL/Document;I)V+0
j  com.datalogics.PDFL.Document.embedFonts(Ljava/util/EnumSet;)V+9
j  com.datalogics.PDFL.Samples.AddUnicodeText.main([Ljava/lang/String;)V+416

This was only an issue with the x64 version of Solaris 10; the x86 32-bit Intel worked OK, as did both versions on the SPARC platform. Nor was this an issue with C++ – based programs using the same PDF Library via its C language interface.

A clue: Digging through, we found a note in our C++ language sample makefile from 2012 detailing a need to explicitly add the GNU C++ libgcc_s library when linking our sample programs, and to place it before the C runtime linkage. This resolved some issues with crashing when handling exceptions. Some research into past notes indicated an ABI mismatch between the C++ exception handling mechanism in the Solaris C runtime and the mechanism used by GCC in its runtime.

There is a known issue with a slight mismatch in the ABI between these two (libgcc_s.so: _Unwind_RaiseException and Solaris libc.so: _Unwind_RaiseException). Binding symbols to the GCC runtime first causes it to be loaded before the Solaris runtime, and everything works out well. But, simply adding this explicitly to our shared library link line did not help anything.

The problem: After some thought, we realized that running the PDF Library through a Java interface is somewhat different than calling it directly from an executable program in one important way. When building a C++ program, the linker can prompt the runtime linker along and give some help in the order of shared libraries to link – this is manifested by the order of shared libraries specified on the link line. But in our case, the Java interface is a native shared library that is loaded by the Java Virtual Machine dynamically, when it is invoked by a call to initialize the PDF Library. The executable running is the JVM – so the runtime linker will have taken its shared library loading information already from the ‘java’ executable. This executable, built with Oracle’s Solaris Studio compiler (formerly known as Sun Studio), will have the Solaris C runtime and its incompatible C++ exception ABI loaded long before the PDF Library is ever requested to be loaded by the dynamic loader.

It was a catch-22: we had to have the GCC runtime loaded before the Solaris runtime, by a program that knew nothing of GCC. Recompiling the PDF Library and the Java interface with Sun Studio was not an option: the C++ language support was not up to the needs of the Java interface and would require significant rewriting. We were stuck – or were we?

The solution: While not known by many, most UNIX systems (including Solaris) include a mechanism to explicitly load one or more shared libraries before starting an executable program: the LD_PRELOAD environment variable. This defines a shared library or a list of shared libraries to be loaded before program execution, and allows users of the PDF Library Java interface to ensure that the GCC runtime and its exception handling is loaded before the Solaris runtime. This causes references at runtime to the exception handling to be routed correctly. While a somewhat drastic measure, this causes all the exception handling to be wired correctly and resolved this issue.

In brief: Shared libraries built with GCC on Solaris for the x64 platform, that are called from programs built with a different compiler, should use the LD_PRELOAD flag to load the GCC runtime. This includes shared libraries built with GCC and used by Java programs via JNI when running in Oracle’s JVM.


The Adobe LiveCycle Portable Protection Library: an SDK to Manage your Rights Management Server

$
0
0

First, a little background information about Adobe LiveCycle

Adobe LiveCycle is an enterprise suite of server based software products that are meant to help structure business processes. One of its key features is the ability to protect a document (of the following types PDF, Microsoft Word, Microsoft Excel, Microsoft PowerPoint, text, and more!) with a policy to restrict access to a subset of users. Adobe LiveCycle offers this ability through one of its components called, Rights Management.

After a certain point large organizations or organizations that have been using Adobe LiveCycle for an extended period of time may accumulate a significant number of policies and documents, these can become tiresome to manage through the web interface for the Rights Management component of LiveCycle. Users of LiveCycle can also fall into performing the same action over and over, such as protecting documents with the same policy again and again. What if there was a programmatic way to apply these policies to new documents? What if there was a programmatic way to use a policy from one document in another? What if there was a programmatic way to manage these?

Now there is! It’s called the Adobe LiveCycle Portable Protection Library!

What is the Adobe LiveCycle Portable Protection Library?

The Adobe LiveCycle Portable Protection Library (PPL) is a C++ SDK that allows programmatic access to the Rights Management component of your Adobe LiveCycle instance. Currently the Portable Protection Library is supported on Windows (32/64 bit), Solaris (Sparc 64/AMD 64), and HP-UX PA-RISC. It requires the use of HTTPS for its connections to the Rights Management server. Documents that are secured using the Portable Protection Library are encrypted with either 128 or 256 bit AES encryption and a number (if not all) of the permissions that are available for configuration in LiveCycle are still available through this SDK. PPL is built to work with Adobe LiveCycle ES2 and later, including the recently released Adobe LiveCycle ES4.

What does the Portable Protection Library have to offer?

The Portable Protection Library allows users to write applications or plugins that :

  • create, update, and delete policies
  • secure/unsecure documents with a policy
  • revoke access to a document that has been secured with a policy
  • offline access to documents that have been secured with a policy
  • redirect users to your Rights Management web application

PPL can be used in conjunction with the Adobe PDF Library (APDFL) to write custom PDF viewing applications, using PPL to manage the decryption/encryption of files and using APDFL to render documents. PPL also has the capability to cache user credentials (if you are using username and password) so that once a user has been authenticated they can access documents offline, until the authentication timeout limit has been reached. Applications written with the Portable Protection Library can supply application specific permissions and log a number of user events so that there is a record of these in your Rights Management server.

Look for more information about the Adobe LiveCycle Portable Protection Library from Datalogics in the upcoming months as we learn more about it!


pdfaPilot from callas online demo now available for PDF/A-2 and -3

$
0
0

callas pdfaPilot 4 now has full support for the new PDF/A-3 standard and the PDF/UA standard for accessibility. It includes advanced features for highly flexible PDF document processing via new process plans and allows export of PDF to EPUB for mobile publishing. Datalogics has updated our online demonstration to this new level of PDF/A support.

The pdfaPilot toolkit can correct PDF files for PDF/A compliance, thus enabling you to provide robust PDF/A support within your application. It can also validate against both PDF/A-1a and PDF/A1-b levels of conformance and now supports the PDF/A-3 standard.

Datalogics offers callas pdfaPilot as an addon to the Adobe PDF Library, so that developers can add PDF/A support to their applications that create and manipulate PDF.

More information on this product is available on the website, as well as the online demonstration.

Stay tuned to this blog, as we will be rolling out other web services in the very near future.

JBIG2 Compression: Lossy vs. Lossless

$
0
0

Recently it was reported that the JBIG2 compression implementation in some Xerox scanners has the undesired side-effect of changing some characters under specific circumstances to other characters when scanning documents. Datalogics has received inquiries about the JBIG2 compression that is accessible in the Adobe PDF Library. I’d like to provide a bit of information about JBIG2 and how it may interact with your application.

Overview: JBIG2 (http://en.wikipedia.org/wiki/JBIG2) is a standard compression algorithm for bitonal (1 bit, black & white) raster images that can be used either in a lossless or a lossy mode. In both modes, JBIG2 works by searching for and creating reusable compression dictionary entries for a given raster image being compressed, and then re-composing an image as a series of compression dictionary references. In lossless mode, ever raster image area will be expressed exactly by one of the dictionary entries. In lossy modes, JBIG2 compressors can start to substitute compression dictionary entries that are close in appearance to a given raster image area. This enhances compression capability by allowing for more references to fewer entries. However, because these substitutions are sometimes closest-match references, this can cause subtle changes in image appearance due to the choice of reusing an existing, close match for a given image area rather than creating a new entry. Higher levels of lossy compression are accomplished by allowing for matches to be further away from exact and therefore using more references to fewer compression dictionary entries. In some limited circumstances, this can cause characters to change appearance as portions of characters (letters or numbers) are evaluated and substituted with references to close visual approximations that, unfortunately, add up to a visual appearance of a different character.

Impact: the Adobe PDF Library can be used to compress bitonal images in PDF files with the JBIG2 compressor. When compressing, callers have control over the level of compression applied to an image – from lossy control that aggressively searches for close compression dictionary matches, up to completely lossless compression. Callers that specify the most aggressive compression levels for JBIG2 compression might, in theory, see similar issues to those reported against Xerox scanners – though Datalogics has never replicated the specific concern seen with these Xerox scanners. While examples Datalogics has distributed in the past of using JBIG2 compression for PDF images have used aggressive compression, the default mode of the JBIG2 compressor as implemented in the PDF Library is a lossless mode, and will never change the appearance of compressed images.

Recommendation: for archival purposes, Datalogics recommends always using lossless compression if applying compression to JBIG2 images. This can be assured by explicitly not setting the JB2Quality compression encoding dictionary value; or setting this value to 10 or greater. For long-term readability, or to extra assurance that future readers will be able to retrieve and decode bitonal images correctly, consider using CCITT G4 encoding. While CCITT G4 encoding does not compress as well as JBIG2, it always compresses losslessly and is supported by a wider variety of current PDF file consumers.

Threads vs. processes for program parallelization

$
0
0

For most computing tasks, there is great advantage to splitting up workload into multiple actors and partitioning the task into different, multiple tasks for these multiple actors. Two common ways of doing this are multi-threaded programs and multi-process systems. In a multi-threaded program, multiple actors live in a shared program context. In multi-process systems, there are multiple actors but each lives in its own independent program context. Understanding the best choice for your program and workload requires understanding the advantages and disadvantages of multi-threaded programs:

Multi-threaded program advantages:

  • Less overhead to establish and terminate vs. a process: because very little memory copying is required (just the thread stack), threads are faster to start than processes. To start a process, the whole process area must be duplicated for the new process copy to start. While some operating systems only copy memory once it is modified (copy-on-write), this is not universally guaranteed.
  • Faster task-switching: in many cases, it is faster for an operating system to switch between threads for the active CPU task than it is to switch between different processes. The CPU caches and program context can be maintained between threads in a process, rather than being reloaded as in the case of switching a CPU to a different process.
  • Data sharing with other threads in a process: for tasks that require sharing large amounts of data, the fact that threads all share a process’s memory pool is very beneficial. Not having separate copies means that different threads can read and modify a shared pool of memory easily. While data sharing is possible with separate processes through shared memory and inter-process communication, this sharing is of an arms-length nature and is not inherently built into the process model.

Threads are a useful choice when you have a workload that consists of lightweight tasks (in terms of processing effort or memory size) that come in, for example with a web server servicing page requests. There, each request is small in scope and in memory usage. Threads are also useful in situations where multi-part information is being processed – for example, separating a multi-page TIFF image into separate TIFF files for separate pages. In that situation, being able to load the TIFF into memory once and have multiple threads access the same memory buffer leads to performance benefits.

Multi-threaded program disadvantages:

  • Synchronization overhead of shared data: shared data that is modified requires special handling in the form of locks, mutexes and other primitives to ensure that data is not being read while written, nor written by multiple threads at the same time.
  • Shared process memory space: all threads in a process share the same memory space. If something goes wrong in one thread and causes data corruption or an access violation, then this affects and corrupts all the threads in that process. This is a special concern for cross-language environments where it is very easy to have subtle ABI interaction problems, such as Java-based web servers calling upon native libraries via the JNI (Java Native Interface) ABI.
  • Program debugging: multi-threaded programs present difficulties in finding and resolving bugs over and beyond the normal difficulties of debugging programs. Synchronization issues, non-deterministic timing and accidental data corruption all conspire to make debugging multi-threaded programs an order of magnitude more difficult than single-threaded programs.

Processes are a useful choice for parallel programming with workloads where tasks take significant computing power, memory or both. For example, rendering or printing complicated file formats (such as PDF) can sometimes take significant amounts of time – many milliseconds per page – and involve significant memory and I/O requirements. In this situation, using a single-threaded process and using one process per file to process allows for better throughput due to increased independence and isolation between the tasks vs. using one process with multiple threads.

Virtualized and cloud environments such as VMWare and Amazon’s AWS platform complicate this situation somewhat. In these environments, hardware is shared with other virtualized environments and wide variance in CPU allocation as well as I/O times can be seen. Higher variance in context switching return time can also be observed. As Jayasinghe, et. al. observe (http://www.cercs.gatech.edu/opencirrus/OCsummit11/presentations/jayasinghe.pdf) for the Amazon platform, reducing the number of threads in an application running in a cloud environment can increase performance. Designing for elastic demand at the outset is also an important factor: a multiple-process application where each process assumes limited communication and reliance on other processes is an application that is much easier to have scale horizontally to meet demand, through instantiating new server instances, than is an application that relies on multiple threads exclusively & that can only scale vertically.

There are many factors to consider for your specific application and environment, and I’ve only provided an overview of the most important considerations. For applications that use the Adobe PDF Library, we have found that most workloads benefit from a multiple-process approach when possible. Benchmarking and profiling your application, and usage testing, is however at the end of the day the only reliable way of knowing what will work best in your specific situation. I hope the guidelines above help in giving guidance of where to start in writing or parallelizing your application.

Introduction to PDF Color: Fun with Separations

$
0
0

This is the last of the Ducky mini-series within the Introduction to PDF Color story arc. Only four Duckies were not harmed in the making of this particular slide compared to the body counts in the previous slides.  However, you may want to avert your eyes because this slide is particularly gruesome as we are doing the color printing equivalent of a vivisection here.

DrawSeparationImages-out1

Here we are separating the colors of the (ducky.pdf) PDF to their individual process colors, which generally results in grayscale images, since the output color space is DeviceGray. But we are re-inserting the separation images into a same-sized page, and scaling them down to one-quarter size, and changing the color space of the images to a DeviceCMYK Separation color space corresponding to appropriate color component, or ink.

A lot of this is drawn from the DrawSeparations sample app that ships with APDFL.

but the magic is in the snippet below, both because of the Matrix we generate to scale and position the image and since we are replacing the Image’s ColorSpace:

And in more detail,  here is where we setup the separation color spaces for each ink:

 

Start your free eval of the Adobe PDF Library today.

 

Introduction to PDF color: Experiments with Patterns

$
0
0

It’s a bit ironic, but I used the Pattern Colorspace all over my Introduction to PDF Color presentation before rushing through Patterns as the last topic in its own right.

You might recall the following slides; all of which use an Axial Shading pattern:

Different ways of specifying a white to black shading pattern.0OpticalIllusions3DrawSeparationImages-out1

 

but one of the following Radial Shading Patterns were used for as backgrounds for all slides with text on them:

TestPatterns22TestPatterns21TestPatterns23

 

Now, it’s relatively straightforward to create Axial Shading patterns with the DotNet Interface, but creating Radial Shading Patterns and even tiling Patterns isn’t supported with that API (Could be worse, the fruit company‘s fondleslabs don’t support displaying PDF shading patterns at all). So for the above and the slides below, I explored creating them using APDFL’s C interface, starting from the standard createPatterns sample app (which I’ve also translated to PDFJT code) and extending it in a manner that’s more random-walk than depth-first or breadth-first search.

TestPatterns8 TestPatterns14TestPatterns13

Let’s start by looking at the code for the tiling patterns.

Here we are basically setting up the Graphic State, used to set up the content, which is done in this routine which essentially draws an octagon:

and this routine takes PDEContent and creates a Pattern from it:

The next Pattern is much like the first, using the same PDEContent in fact, except we are going to fill the tiling pattern with a CalGray-based Axial Shading pattern.

The last pattern uses an octagram shape rather than an octagon, and also uses an axial shading pattern (deviceRGB-based) as a fill, but the tile itself is rotated 30º.

Now, for the backgrounds, I created a radial shading pattern.

Note that the color values were initially pulled from Example 4.25 from the PDF v1.7 Reference, page 313. They were nice cool greens that I color shifted to red by having the cyan, magenta, and yellow values do a circular shift. And again for blue.

Also note that a Radial Shading Pattern uses a stitched function, which is two -or more- functions appended together:

Learn more about Adobe PDF Library here.

Introduction to PDF Color: Fun with tiling Patterns

$
0
0

For today’s episode, We have old-time Ducky (from the silent-era before the talkies nearly ruined everything by revealing that Ducky’s voice actually squeaked).

ducky_GrayScale_recombined1

Actually, this is the same old Ducky, rendered to grayscale, split apart by gray-scale level, vectorized with POTrace, and then recombined. We are going to use this PDF to demonstrate some other ways to make shades of gray; this time using tiling Patterns. The key to this is this illusion:

0OpticalIllusions14

In this illusion we have a gray background upon which are white lines and black lines; in the square delineated with the black lines we see a darker gray than the outer square with its white lines, hence the illusion. But if you think of this printed on paper, gray is essentially created by varying the density of ink in an area, so if this illusion were printed on paper, your eyes would correctly be reporting that the inner square has a higher ink density than the area around it.

So now, what we are going to do with old-time Ducky is replace the deviceGray fills with tiling patterns whose content fills the tile with roughly the same amount of (black) ink as an entirely gray tile.

In this first variation, we are going to create a pattern of lines, varying the line width to control how much black will be in the tile.

The result (click to see full-size):

ducky_grayscale_recombined_mod11

In this next variation, instead of using straight lines, we are going to draw sawtoothed lines.

ducky_grayscale_recombined_mod21

And finally, instead of lines, we’ll draw rectangles to fill the tile area:

ducky_grayscale_recombined_mod51

The code for replacing  DeviceGray fills with Pattern fills is basically the same:

Note that this code takes advantage that the recombined color shapes are all grouped together by color value, so it starts a new pattern when the grayscale color value changes.  If this assumption does not apply, then you would want another method of re-using  patterns for a given color value.

Start your free evaluation of Adobe PDF Library today.

Adobe PDF Library Beta Program Comes to a Close – Production Release Soon

$
0
0

PDF-LibraryExciting times at Datalogics with the Adobe PDF Library! We’ve been hard at work on the Datalogics distribution of Adobe’s PDF Library version 15 and, after a lot of good times and some challenges, we’ve concluded the beta program and are readying our general release. Our great thanks to everyone who gave us feedback and suggestions through the beta program. If you weren’t able to participate this time around – fear not, the production release is coming up just around the corner on May 17th. What’s new and improved since our release of version 10.1, you ask? Here are the highlights:

  • PDF to PDF/X conversion for PDF/X-1a and PDF/X-3 outputs. PDF/X is the standard in the graphics arts and press fields for conveying accurate, complete PDF proofs and documents for printing
  • Revamped our existing line of samples
  • Performance improvements in importing pages and merging documents for large PDF files
  • Performance and stability enhancements for multithreaded applications and workflows
  • Updates to the PDF/A converter for better performance and smaller file sizes
  • Users can now set the transparency blending color space used for rendering, printing, and flattening PDF documents
  • Black point compensation can now be enabled/disabled when using the Adobe Color Engine color transformation API in the PDF Library

Of course, this update brings the Adobe PDF Library offered by Datalogics in alignment with the same core PDF processing code that underpins Adobe’s PDF Creative Cloud lineup – including Reader and Acrobat DC as well as Photoshop, Illustrator, and the rest of the Creative Cloud suite. Having the same common PDF core shared across Adobe applications and your applications means maximum compatibility and interoperability with the leading set of PDF tools and solutions.

The Adobe PDF Library is a constantly growing and evolving product. Datalogics continues to be actively involved with developments such as the coming PDF 2.0 standard (where we actively participate in defining and refining this upcoming standard), and we continue to bring updates and value to our Adobe PDF Library offering as guided by feedback from you – our customers and our users. Look out for some exciting updates coming up soon!

Learn more about PDF Library and start your free eval today.


Image Masking: Now with Vector Paths

$
0
0

In my last article I ended on a small tease:

At a lull in between, I took a small break to put together a quick DLE program to demonstrate how one masks an Image with a vector path, which I’ll discuss in a follow-up article.

Since the solution isn’t as exciting as you might think, I’m going to discuss the other ways that you can mask an image before revealing the solution and discussing it, because there are lots of ways to mask an image in PDF, so it’s a bit more interesting a topic than you might think.

If you search the PDF Specification (as I did just now) you will find that there are essentially two sections that discuss Masking: Section 8.9.6 Masked Images and section 11.6.5 Specifying Soft Masks, and neither discusses masking with vector paths.

Soft Masks are part of the chapter on transparency and come in two flavors: Images and dictionaries.  If you are familiar with RGBA images, an image soft mask is essentially the Alpha (transparency) channel separated from the RGB image and made to stand-alone within the parent image XObject.  Which is essentially what DLE does with PNG files that have RGBA images.  SoftMask dictionaries, on the other hand, are part of the extended graphic state, and while they contain a Transparency Group Form XObject, which could contain a vector path, it’s not the mechanism that comes to mind as the ideal way to mask an image with a vector path.

Masked images come in three flavors; Stencil Masking, Explicit Masking, and Colour Key Masking. Explicit Masking is like an image soft mask except that the alpha channel pixel can only be all on or all off, so there’s no degree of transparency.  Colour key masking is a bit like how transparency works in GIF files, where one of the indexed colors is designated as the transparency color, but colour key masking you can specify a range for each color component of the image; if all the components values fall within those ranges, then that pixel is masked (or not depending on the Decode parameter, but that’s the idea). Stencil Masking is like Explicit Masking, except that you have thrown away the primary image and are masking whatever is already in the background.

None of these variants allow for vector paths, so how do you mask with a vector path? Masking is the right concept, but the wrong term in this case because to mask an image with a vector path, you simply apply a clipping path to the image.

Or at least, that was the theory, and I needed some code to prove it.  I decided to clip this image: VectorPath_Before  with a 17 point star created with the following code:

 

The result was: VectorPath_After1

However, I was then challenged by my neighbor at the table to turn the clip inside out, or effectively Outside In(!).

Appending the star to the exist clip path, which contains a rectangle around the entire image, required a slight alteration to how I created the star:

And this time the result was:VectorPath_After2Mission accomplished: An image masked with a vector path.

The full code is here.

Evaluate the Adobe PDF Library for free!

PDF Page rendering: Draw (part of a page) to Memory!

$
0
0

I was recently helping a customer reconcile theory and practice with respect to our drawtomemory sample and their desire to render just a portion of a page instead of the entire page.

Shorn of the details, the theory is simple enough: You have an UpdateRect parameter for specifying which area of the page you want to render, a DestRect that represents the destination bitmap, and a matrix with which coordinates on the page can be translated to the destination bitmap via a matrix transformation. In theory, again, you can use your matrix to transform your UpdateRect to get your DestRect.

Actually, that is also true in practice, but figuring out your matrix gets a bit trickier when you are only interested in part of the page and…the page is rotated.

And then, you have to consider our drawtomemory sample app which wasn’t designed to render anything less than the full page and thus had some assumptions that had to be backed out.

Let’s start with MainProc() in mainproc.cpp. The biggest changes were to take optional command-line parameters for the document, page number, and update rectangle values as I personally prefer to test by changing command-line parameters rather than by making code modifications and recompiling.  But the most significant change was adding/requiring an explicit parameter for the UpdateRect:

To replicate the old behavior of rendering the entire page, you would need to explicitly pass in the page’s CropBox, which is still the default if no update rect values are specified on the command-line.

The DrawToMemory class declaration also changed a little bit:

Notably the SetPageRect method is now the SetDestRect method. The old SetPageRect method would take an ASFixedRect parameter and set pageRect to [0 0 rectWidth rectHeight]. The new SetDestRect effectively does the same thing in the first line of code, but then calculates another rectangle with the dimensions swapped and chooses between the two based on the page rotation to set the destRect. The pageRect class variable is now just going to be the page’s CropBox.

and GetImageRect() now returns destRect rather than pageRect.

The next big block of change is in the drawtomemory constructor:

Adobe PDF Library (APDFL) has this nifty little function called PDPageGetFlippedMatrix that returns a matrix for mapping page coordinates to a bitmap that takes into account page dimensions and page rotation.  In the drawtomemory sample, we can use the matrix it returns as-is because we are rendering the entire page.

When we are rendering just a part of the page, we need to shift our little update rectangle into the destination rectangle, and we do that with a translation matrix. But what direction we shift, and how much, is going to depend on how the page is rotated.

Otherwise, the calls to PDPageDrawContentsToMemory() now explicitly pass in the UpdateRect parameter instead of setting it to NULL to have the page CropBox used by default. Instead of initializing the (RGB) buffer to white (0xffffff), I initialized it to medium gray (0x7f7f7f), which comes in handy in debugging while things are a bit Lost in Translation.

The full code is here.

Ready to try out APDFL? Start your free evaluation today! 

 

 

 

 

 

Adobe PDF Library v15: Out and Ready!

$
0
0

I’m excited to announce the much-anticipated new version of the Datalogics release of the Adobe PDF Library. Version 15 includes a lot of new features and updates that I mentioned in my previous blog,  Adobe PDF Library Beta Program Comes to a Close. I won’t bore you by repeating – you’re familiar with those updates, and re-reading them is time you could be spending with APDFL 15 for your Windows, Linux, and Mac applications.

What about the other platforms, you ask? They are still in the hands of the UNIX porting elves, busy at their porting benches with their porting tools – look for these platforms later on in the year! We’ll be rolling out pre-release versions as these are available for those who want to play on the edge. If that describes you, drop us a line and we’ll make sure to sign you up.

Check out our Press Release to learn more about PDFL v15 and sign up for your free evaluation today!

We know some of you didn’t have a chance to join our APDFL 15 beta program for the platforms we released. We want to give you another chance at joining our pre-release program. But this time, it will be for the PDF file optimization feature we’ll be adding to our APDFL distribution over the course of 2016. Are you interested in early access to our file size reduction APIs? Let us know if this is you!

Recreating Acrobat’s Document Properties Advanced Tab

$
0
0

Advanced is the fourth Document Properties tab that we’ve recreated with APDFL.  Previously we’ve covered recreating the Tabs for Fonts, Description, and Initial View. To some extent, The Advanced tab is a continuation of the Initial View tab, but with more arcane options. And because these options are less common, there are fewer helper functions for extracting this information.

DocAdvanced

The Base URL is a way of specifying the document’s home address, and allows for relative URI actions in the document to be fully qualified with the base URI. It’s a PDF 1.1 feature, but like most of the document properties on this tab, it is fairly rarely used. Nonetheless, the code to extract it is:

The Search Index is an undocumented Adobe feature for an external index file.  It is largely obsolete now that computers are fast enough to search PDFs without requiring prebuilt index files.  The following code was arrived at with a bit of reverse engineering rather than official documentation.

Trapped is an indicator that the line art has been prepared for production printing. It’s not an obsolete feature, but it’s only of concern for professional printing. This is the only Advanced tab feature for which there is a dedicated function.

Most of the rest of the Print Dialog portion of the Advanced tab have properties, also stored in the ViewerPreferences dictionary, and while slightly more complicated than the boolean checkboxes in extracted from the Initial View tab, they aren’t that complicated to figure out.  The one exception is the Print Page Range, which is also described in the PDF Reference  in Table 8.1, section 8.1, but the description for the PrintPageRange array could stand to be a bit clearer.

The Binding setting is the last of the ViewerPreferences; Direction, specifically.  The PDF Reference makes the Direction entry sound like it indicates a bi-directional  text reading order; which, I guess, might have some bearing as to where the binding would be for a document.

The last property on the Advanced tab is Language and while I could have done more to imitate the drop-down that Acrobat provides,  the list that Acrobat provides isn’t complete, and if the Lang entry is unrecognized, then it is shown as is. So for the purposes of this exercise, we are just going to show the Lang value as is.

The full code is available here.

Start your free trial of Adobe PDF Library today! 

Recreating Acrobat’s Document Properties Description tab with APDFL

$
0
0

Note that several months ago, I wrote up a sample app which recreated Acrobat’s Font list in the Document Properties dialog, this is now part of a series where we use APDFL to extract or recreate the information contained in the Adobe Acrobat’s Document Properties tab.

The Description tab seems relatively straight forward, but there are a couple of gotchas and in some cases more than one way to extract the information with APDFL.Document Properties Description Tab

I’m going to skip going over how to extract File(name), Location and File Size as you don’t really need APDFL to get that information. But otherwise, let’s proceed from the top down in order, starting with the first four at the top:

Nothing too difficult here, I use PDDocGetInfoASText mainly because these fields could contain Unicode text and shoving it into an ASText variable makes it easier to handle, even if all I’m doing is converting it to UTF-8 for extraction purposes.

If you don’t need Unicode text extraction; for example, you are extracting date properties, the following also works:

Note that I could have parsed the date string and formatted per the current locale, but I’m lazy that’s outside the scope of APDFL per se.

For Application and PDF Producer, I’m pulling these properties directly out of the XMP metadata embedded in the file using the PDDocGetXAPMetadataProperty call…if the metadata stream is actually in the file.  The reason that you might want use this call instead of PDDocGetInfoASText is if you want to extract other metadata that PDDocGetInfo doesn’t know about; such as the PDF/A or PDF/UA flags.

Next up is checking the PDF version and the corresponding version of Acrobat that can open that file, and Adobe Extension levels (to handle the fact that the PDF format has been stuck at version 1.7 for the past decade waiting for Godot the ISO32000 committee to finalize PDF version 2.0.  Adobe snuck in a few new features into PDF by declaring them to be Adobe extensions.  The extension levels map to unofficial PDF versions. The code below matches Acrobat’s secret decoder ring:

A little known feature of the Description tab is that it will provide page size information about the current page.  While you could calculate this from the page CropBox, there are a couple of other factors that could come into play. In the code below, since we don’t have a current page, we’ll just grab the information from the first page.

Grabbing the number of pages is one call:

Determining if the document is a tagged PDF however is a bit more complicated as I didn’t find a good call or flag for determining if the document is tagged or not, so I had to drop to the Cos-level to find the information and it’s a slight bit more complicated than the PDF Reference makes it out to be, as it needs to both have a StructTreeRoot and a MarkInfo:Marked entry set to true in order for Acrobat to consider the document to be a tagged PDF:

Lastly, Fast Web View means that the document is linearized so that the first page that gets opened when viewing the document (which isn’t necessarily page 1) is at the very beginning of the file with all the necessary resources it needs to display; so that that page could be displayed while the rest of the document was slowly be downloaded over a 56kb Modem

And that’s that. Full code is available here.

Interested in trying Adobe PDF Library? Sign up for your free eval today!



Start Your FREE Trial




Viewing all 91 articles
Browse latest View live