All Databases MacTech Vol 15-1999

PDFLib

Volume Number: 15

Issue Number: 12

Column Tag: Programming Techniques

Lasso 3.5

by Kas Thomas

A great freeware library makes adding PDF support

to an app easy

Adobe's Portable Document Format (PDF) has become a de facto standard for electronic

document interchange, based on its ability to deliver graphically rich, structured

content in a consistent manner across multiple operating environments. Almost every

large web site offers at least some PDF-based content, making the Acrobat Reader one

of the most popular downloads on the web. (Incredibly, Adobe claims to average some

100,000 downloads of the Reader from its web site per day.)

Because of its support for vector graphics, font embedding, hypertext links, and other

advanced features, PDF is a powerful, far-reaching document standard. But that also

means it's a relatively complex standard (for details, see the September 1999

MacTech) - and therefore far from trivial to support in an application.

From a programming standpoint, one can talk about two types of PDF support: support

for PDF reading (import), and support for PDF writing (export). As with TIFF,

QuickTime, and many other complex formats, it's much easier to provide write

support than read support, because a comprehensive PDF-read capability means

implementing the entire rather ponderous PDF specification (see

http://partners.adobe.com/asn/developer/PDFS/TN/PDFSPEC.PDF), whereas a

write-only facility may mean implementing only a tiny subset of the PDF spec - the

subset of particular interest to your application. For example, if your application

primarily outputs ASCII text, there is no need to implement graphics-embedding,

halftoning, transfer functions, etc., in order to support PDF output.

Adding a well-defined PDF-output capability to an application can be surprisingly

quick and easy, if you make full use of existing tools. For this article, I decided to add

PDF export capability to BBEdit (the popular text editor), with the aid of a

third-party freeware PDF library called PDFLib. Source code for the BBEdit plug-in

accompanies this article. (The complete CW Pro 5 project, including PDFLib and its

source files, can be found online at ftp://www.mactech.com.) But before we start

talking code, let's take a moment to review the basics of the PDF format, then look at

what kinds of development paths one might take to arrive at a PDF-export capability,

and what sorts of tools are currently available to make the programmer's life easier.

PDF Fundamentals

Adobe's Portable Document Format is a kind of gigantic, special-purpose markup

language, based largely on Postscript (the postfix-notation page description language)

but lacking Postscript's control-flow constructs. PDF is a sort of "unrolled" version

of Postscript, in which all graphics operations are inline (rather than relying on

loops) and therefore speedy. Lookups and indexing operations are likewise fast because

of PDF's extensive use of associative arrays (or "dictionaries," in Adobe parlance),

organized into treelike structures in which all nodes have forward and/or

back-pointers to other nodes; plus, every leaf (of every kind) has an entry in a giant

'xref' table, so that the offset of any object can be looked up instantly.

Pages are organized into sets of objects that describe a page's resources and content.

The objects are human-readable ASCII and look like:

4 0 obj

/Parent 1 0 R

/Resources 8 0 R

/MediaBox [0 0 612 792]

/Contents [5 0 R ]

endobj

In this case, the top line tells us we're dealing with Object No. 4, revision zero. The

object is a dictionary object, as indicated by the double angle brackets, << and >>,

enclosing the object. The first entry in the dictionary is a label telling the type of

dictionary (in this case, a Page). The next label/value pair is a backpointer to the

parent of this object, namely Object No. 1. (A reference ending in 'R', such as 1 0 R, is

a pointer to an object.) The next entry tells where the page's resources can be found

(namely, in Object No. 8.) The MediaBox entry gives the page's dimensions, in points

(72 points to the inch); here, 612 by 792 means that we're dealing with a standard

U.S. Letter-size page (8.5 by 11 inches). The final entry, in the above example,

shows where the page's Contents (probably a stream object) can be found, namely in

Referenced by (5):