Sep 00 Online
Volume Number: 16
Issue Number: 9
Column Tag: MacTech Online
PDF and XML
by Jeff Clites < online@mactech.com>
Last month we covered Adobe's Portable Document Format (PDF), focusing on how it
relates to Quartz, Apple's new imaging model. In brief, PDF originated as a
simplification of PostScript, retaining PostScript's primitive graphics operators
while discarding its programming-language constructs and adding file and document
structure specifications. Quartz (specifically, the Core Graphics Rendering API) is
again based on this same set of operators, making it natural to "record" graphics
operations into a PDF file, and just as natural to "play back" a PDF into a series of
native drawing instructions. At its simplest, PDF is the new PICT; more interestingly,
the Quartz imaging model is at the center of all 2-D graphics on Mac OS X, providing a
centralized facility for rending drawing commands from different APIs (such as
QuickDraw) into different output formats, be they destined for the screen, a printer,
or a file.
Of course, part of the beauty of Quartz is that it frees the programmer from having to
worry about the details of this process. At the same time, Quartz is certain to increase
the popularity of PDF, and in particular expand its use beyond just a format for
traditional documents. Accordingly, it will be to a programmer's advantage to know as
much as possible about PDF, and to be aware of its strengths and its weaknesses.
As touched on above, PDF defines a file format in addition to a graphics model. In the
abstract, a PDF file describes a tree of objects, with a significant separation between
document content and document layout. This should send off bells in a developer's head,
because it sounds similar to XML, and it's natural to wonder how deep this connection
is-to ask questions like, "can a PDF document be represented in XML." The short
answer is "probably not", but it's interesting to investigate the parallels between the
two formats.
Intersections with XML
PDF and XML are similar in that they define a file structure which is designed to
encapsulate a wide range of data in a fairly generalized, hierarchical fashion. Although
PDF is designed to be extensible, it does define an interpretation for the information it
contains, and it's not clear how well current PDF-rendering applications would handle
PDF documents with content which they don't recognize. XML is at the other extreme.
At its core, it says nothing about the semantics of the data which it can contain, and it's
often used as a format for information which isn't naturally thought of as a
"document." But given its generality, it would certainly be possible to devise an
XML-based format to encapsulate page-descriptions in a manner similar to PDF. On
the other hand, there are several facilities of PDF which are not easily mimicked using
XML-features dealing more with practical performance issues than with conceptual
structure.
PDF was designed to be a final format, so that PDFs represent finished documents,
rather than in-progress works (such as word-processing documents) which will be
extensively changed. Still, it is possible to make limited modifications to PDFs, and
interestingly this can be done by appending the "change" information to the end of a
PDF, without requiring the entire document to be rewritten. This makes it convenient
to prepare an initial document and at a later stage add annotations or hyperlinks. This
approach also provides a measure of safety, as previous versions of a document can be
recovered simply by truncating the changes off the end, and modifications cannot cause
complete corruption of the base document. This also means that it is possible to modify
large documents without large resource requirements.
Despite XML's flexibility, it isn't possible to create a well-formed XML document by
appending information directly to another document, because of the requirement that
there be a single root element. (It is possible to work around this limitation, but only
by splitting the document into multiple files.) Additionally, PDF documents frequently
encapsulate binary data (such as images or compressed text), and it is not convenient
to embed such data into XML documents directly-XML is a text-based format, and
binary data could be interpreted as markup, or mangled if the document is converted to
a different character encoding. XML-based formats traditionally handle this by storing
the data in a separate file which is then referenced from the base document, just as
images are included in HTML files. This is less convenient than PDF's single-file
approach. (It would be possible to include binary data in XML documents by converting
it into a text-based representation, such as Base-64 encoding, but this tends to offset
the benefits of compression.) Finally, PDF has a higher structural flexibility, in that
logical containment is not always represented by physical containment. In other
words, structures which logically contain other objects may do so by referencing the
objects by name, whereas in XML such containment is almost always represented by
physically nesting elements. This flexibility allows the same PDF to be represented in
different ways, so that for example a PDF file may be optimized for page-at-a-time
delivery over the internet, or alternatively it could be created in a single-pass by a
printer driver.
FOP
So despite the current popularity of XML, it isn't likely that PDF is going to be
superceded any time soon. So where do PDF and XML intersect? Well, as we observed
before, it's natural to think of XML as unformatted data, and to think of PDF as an
output format. The preferred way to get from XML to something with formatting is by
way of XSL Transformations (XSLT). In the case of XML-to-PDF transformation,
there's a tool to help with the process, FOP. (It's part of the Apache XML project.) To
use FOP, you first use an XSLT processor to convert your XML document into a tree of
formatting objects, which may itself be represented as an XML document. This is
where you determine the form of your final document. Since, as mentioned above, XML
documents are traditionally devoid of formatting information and are often viewed as
pure data, any decisions about how this information will be presented must be
encapsulated in the style sheet. Once this is done, and you have your tree of formatting
objects, you feed this into FOP, which produces your final PDF. FOP is very much a
work in progress, and does not yet support all of the formatting objects defined in the
XSL specification, but even as-is it appears quite useful. IBM has an informative
tutorial on transforming XML documents. (A free registration is required to access the
tutorial.) It discusses using FOP to create PDF documents, and in addition shows you
how to generate SVG (Scalable Vector Graphics), which is useful for creating things
like charts and graphs from XML-encapsulated data.
• FOP
<http://xml.apache.org/fop/>
• Tutorial: Transforming XML documents
<http://www.ibm.com/software/developer/education/transforming-xml/>
OmniPDF
Finally, while you're playing with PDF, be sure to check out OmniPDF if you are
running Mac OS X. It's a very cool PDF viewer. It's still under development, but it's
Cocoa-native (and hence Mac-OS-X-native), and it really shows off the power of
Quartz, as it uses Core Graphics Rendering to do its magic. (OmniPDF is from the Omni
Group, who also created OmniWeb, which is currently the only Cocoa-native web
browser available. You should check it out also-it's a refreshing alternative, and it has
many fun features which set it apart from your usual browser choices.)
• OmniPDF
<http://www.omnigroup.com/products/omnipdf/>
• OmniWeb
<http://www.omnigroup.com/products/omniweb/>