All Databases MacTech Vol 16-2000

Sep 00 Online

Volume Number: 16

Issue Number: 9

Column Tag: MacTech Online

PDF and XML

by Jeff Clites < online@mactech.com>

Last month we covered Adobe's Portable Document Format (PDF), focusing on how it

relates to Quartz, Apple's new imaging model. In brief, PDF originated as a

simplification of PostScript, retaining PostScript's primitive graphics operators

while discarding its programming-language constructs and adding file and document

structure specifications. Quartz (specifically, the Core Graphics Rendering API) is

again based on this same set of operators, making it natural to "record" graphics

operations into a PDF file, and just as natural to "play back" a PDF into a series of

native drawing instructions. At its simplest, PDF is the new PICT; more interestingly,

the Quartz imaging model is at the center of all 2-D graphics on Mac OS X, providing a

centralized facility for rending drawing commands from different APIs (such as

QuickDraw) into different output formats, be they destined for the screen, a printer,

or a file.

Of course, part of the beauty of Quartz is that it frees the programmer from having to

worry about the details of this process. At the same time, Quartz is certain to increase

the popularity of PDF, and in particular expand its use beyond just a format for

traditional documents. Accordingly, it will be to a programmer's advantage to know as

much as possible about PDF, and to be aware of its strengths and its weaknesses.

As touched on above, PDF defines a file format in addition to a graphics model. In the

abstract, a PDF file describes a tree of objects, with a significant separation between

document content and document layout. This should send off bells in a developer's head,

because it sounds similar to XML, and it's natural to wonder how deep this connection

is-to ask questions like, "can a PDF document be represented in XML." The short

answer is "probably not", but it's interesting to investigate the parallels between the

two formats.

Intersections with XML

PDF and XML are similar in that they define a file structure which is designed to

encapsulate a wide range of data in a fairly generalized, hierarchical fashion. Although

PDF is designed to be extensible, it does define an interpretation for the information it

contains, and it's not clear how well current PDF-rendering applications would handle

PDF documents with content which they don't recognize. XML is at the other extreme.

At its core, it says nothing about the semantics of the data which it can contain, and it's

often used as a format for information which isn't naturally thought of as a

"document." But given its generality, it would certainly be possible to devise an

XML-based format to encapsulate page-descriptions in a manner similar to PDF. On

the other hand, there are several facilities of PDF which are not easily mimicked using

XML-features dealing more with practical performance issues than with conceptual

structure.

PDF was designed to be a final format, so that PDFs represent finished documents,

rather than in-progress works (such as word-processing documents) which will be

extensively changed. Still, it is possible to make limited modifications to PDFs, and

interestingly this can be done by appending the "change" information to the end of a

PDF, without requiring the entire document to be rewritten. This makes it convenient

to prepare an initial document and at a later stage add annotations or hyperlinks. This

approach also provides a measure of safety, as previous versions of a document can be

recovered simply by truncating the changes off the end, and modifications cannot cause

complete corruption of the base document. This also means that it is possible to modify

large documents without large resource requirements.

Despite XML's flexibility, it isn't possible to create a well-formed XML document by

appending information directly to another document, because of the requirement that

there be a single root element. (It is possible to work around this limitation, but only

by splitting the document into multiple files.) Additionally, PDF documents frequently

encapsulate binary data (such as images or compressed text), and it is not convenient

to embed such data into XML documents directly-XML is a text-based format, and

binary data could be interpreted as markup, or mangled if the document is converted to

a different character encoding. XML-based formats traditionally handle this by storing

the data in a separate file which is then referenced from the base document, just as

images are included in HTML files. This is less convenient than PDF's single-file

approach. (It would be possible to include binary data in XML documents by converting

it into a text-based representation, such as Base-64 encoding, but this tends to offset

the benefits of compression.) Finally, PDF has a higher structural flexibility, in that

logical containment is not always represented by physical containment. In other

words, structures which logically contain other objects may do so by referencing the

objects by name, whereas in XML such containment is almost always represented by

physically nesting elements. This flexibility allows the same PDF to be represented in

different ways, so that for example a PDF file may be optimized for page-at-a-time

delivery over the internet, or alternatively it could be created in a single-pass by a

printer driver.

FOP

So despite the current popularity of XML, it isn't likely that PDF is going to be

superceded any time soon. So where do PDF and XML intersect? Well, as we observed

before, it's natural to think of XML as unformatted data, and to think of PDF as an

output format. The preferred way to get from XML to something with formatting is by

way of XSL Transformations (XSLT). In the case of XML-to-PDF transformation,

there's a tool to help with the process, FOP. (It's part of the Apache XML project.) To

use FOP, you first use an XSLT processor to convert your XML document into a tree of

formatting objects, which may itself be represented as an XML document. This is

where you determine the form of your final document. Since, as mentioned above, XML

documents are traditionally devoid of formatting information and are often viewed as

pure data, any decisions about how this information will be presented must be

encapsulated in the style sheet. Once this is done, and you have your tree of formatting

objects, you feed this into FOP, which produces your final PDF. FOP is very much a

work in progress, and does not yet support all of the formatting objects defined in the

XSL specification, but even as-is it appears quite useful. IBM has an informative

tutorial on transforming XML documents. (A free registration is required to access the

tutorial.) It discusses using FOP to create PDF documents, and in addition shows you

how to generate SVG (Scalable Vector Graphics), which is useful for creating things

like charts and graphs from XML-encapsulated data.

• FOP

<http://xml.apache.org/fop/>

• Tutorial: Transforming XML documents

<http://www.ibm.com/software/developer/education/transforming-xml/>

OmniPDF

Finally, while you're playing with PDF, be sure to check out OmniPDF if you are

running Mac OS X. It's a very cool PDF viewer. It's still under development, but it's

Cocoa-native (and hence Mac-OS-X-native), and it really shows off the power of

Quartz, as it uses Core Graphics Rendering to do its magic. (OmniPDF is from the Omni

Group, who also created OmniWeb, which is currently the only Cocoa-native web

browser available. You should check it out also-it's a refreshing alternative, and it has

many fun features which set it apart from your usual browser choices.)

• OmniPDF

<http://www.omnigroup.com/products/omnipdf/>

• OmniWeb

<http://www.omnigroup.com/products/omniweb/>

Referenced by (6):