Dynamic PDF
Volume Number: 16
Issue Number: 3
Column Tag: Programming Techniques
Dynamic PDF Made Easy
by Kas Thomas
It's simpler than you think to generate custom PDF
documents in real time using Perl and CGI
Adobe's Portable Document Format (PDF) has won universal acclaim as document
interchange format, because of its ability to deliver graphically rich, structured
content consistently across multiple operating environments. Because of its strong
support for vector graphics, font embedding, hypertext links, and other advanced
features, PDF is a powerful, far-reaching document standard. But that also makes it a
relatively complex standard. (For details, see the September 1999 MacTech.) PDF's
complexity, in turn, has discouraged many web developers from attempting to write
PDF files dynamically, at the server. Automatically generated HTML pages are
common; but when was the last time you saw an auto-generated PDF web page?
It turns out that with a little foreknowledge of PDF's internal workings, and a good
grasp of CGI fundamentals, it's not that hard to put together server scripts that will
generate PDF on demand. There's no question that dynamic PDF can be a challenge to do
right. But if your dynamic document needs are modest (for example, if you'd simply
like to be able to generate a custom "Thank You" page at run time) and your Perl skills
are passable (i.e., you've graduated from writing "Hello World" to generating the much
more impressive string "Internal Server Error"), you can automate the serving of
custom PDF pages with surprisingly little effort. In fact, a Perl script comprising no
more than five or six dozen lines of code will get you started.
Our Strategy
Our strategy will be simple: We want to be able to collect an arbitrary glob of "user
text" from a web form, and convert that text to a PDF page served back to the user via
HTTP, suitable for viewing in a browser equipped with the free Acrobat Viewer
plug-in. For simplicity's sake, we'll start by omitting any attempts at vector
graphics, colored text, rotated text, styled text in various point sizes, etc. (But I
hasten to add that before this article is finished, we're going to be doing all of those
things and more.) For starters, we just want to serve the user's words back to him, at
any convenient point size, in black and white, on a letter-sized PDF page. For sake of
convenience, we're also going to generate the "data collection" form and our PDF reply
from one and the same Perl script. That way, we don't have to keep track of multiple
documents: an HTML form, Perl scripts, CSS stylesheets, etc. If you want to add that
complexity to the picture yourself later, so be it. Here, we're going to keep things
stone-simple.
I might point out that "live" copies of the dynamic-PDF scripts developed below can be
found at http://www.acroforms.com/dynamicPDF.html (which is a portal page with
links to several working scripts).
I also want to stress that when we're done, the PDF pages we will have generated can be
saved to disk using your browser's Save As feature. (Reader won't save the docs, but
your browser will.) Thus, we will have achieved the Holy Grail of the PDF world:
writing PDF without using Distiller or Acrobat.
Once we've gotten our basic approach nailed down, we'll branch out into fancy features
like colored backgrounds, stroked/filled paths, rotated text, styled text, etc. But first,
a detour into the land of PDF internal structure.
PDF at the Subatomic Level
Internally, a PDF file consists of plain-ASCII object descriptions, following a notably
Postscript-like (i.e., postfix-notation) syntax. All PDF files have a header
(containing version information), a cross-reference (or 'xref') table containing
offsets to the various objects that make up the file, and a trailer containing references
to the root object as well as the byte offset from the start of the file to the beginning of
the 'xref' table. (See my story in see the September 1999 MacTech for more
information.) The formal PDF specification, available online at
http://partners.adobe.com/asn/developer/PDFS/TN/PDFSPEC.PDF, describes the
relationships between PDF objects (and the meanings of various dictionary "key
entries) in comprehensive detail, if you're interested. But we're going to skip all that,
because most of it is not necessary for what we're doing.
It turns out you can bend (or even break) a lot of the "rules" spelled out in the PDF
spec, without making Acrobat Reader unhappy, if you know a thing or two about PDF
internals. An instructive exercise in this regard is to make Adobe's Distiller program
generate a small "Hello World" type of file, then examine the file in a text editor-and
try to find superfluous data in it that can be removed. If you try this, you will quickly
discover that /Info objects, for example, contain meta-data about the PDF file (such as
the doc's author, subject, producer, keywords, etc.) that can easily be dispensed with.
Likewise, Adobe tends to insert long /ID objects (containing a kind of "digital
fingerprint" for identification of the file) in PDF files; these aren't strictly needed.
If you keep removing superfluous objects and data items from a PDF file, you may be
amazed at how much "data" you can actually get rid of without making Reader complain.
A careful re-reading of the PDF Specification usually reveals that items you thought
were mandatory are actually optional. Even the all-important 'xref' table isn't
strictly needed, since Acrobat Reader can (and will, and does) generate the necessary
byte offsets itself, at runtime, if the table is defective or missing.
In late 1999, I sponsored a contest at in which the goal was to
produce the smallest possible PDF file that wrote "Hello World" to the screen without
making Acrobat Reader generate any error dialogs. Amazingly, the winner of that
contest submitted a file that was only a little more than 200 bytes long:
%PDF-1.
1 0 obj<>
2 0 obj<
R/Resources<
<>
>>>>/Contents<<>>stream
BT/R 14
Tf(Hello World!)Tj
endstream
>>]>>
trailer<>
Note that the second object (starting with "2 0 obj...") is extra-long, with many
nested "dictionary" entries, ending only just before the word "trailer" at the bottom of
the file. This particular PDF file uses 14-point Arial as the typeface. No font
encodings are embedded in the file, however, because Arial is one of the base-14 fonts
that Acrobat Reader ships with. (Arial replaced Helvetica in the version-4.0 release
of Reader.) Every machine that has Acrobat Reader is guaranteed to have Arial.
There are many things technically "wrong" with this short PDF file, including the fact
that it has no 'xref' table and contains a text stream that is opened with 'BT' (for Begin
Text) but is never closed with 'ET' (End Text). In many ways, it's a wonder Reader can
even parse and display this file. Yet it does.
There is one small technical difficulty with this file that impacts viewing: Namely, no
positioning information is given for the text - which means (since Reader "zeroes out
the current transformation matrix, or CTM, before displaying any block of text) that
the starting "pen position" for displaying "Hello World!" in this instance is the origin
of user space: i.e., the lower left corner of the page. You have to scroll down to the very
bottom of the page to see the single line of text. To fix this requires that we add a
transformation matrix of our own to the text stream:
%PDF-1.
1 0 obj<>
2 0 obj<
R/Resources<
<>
>>>>/Contents<<>>stream
BT/R 14
1 0 0 1 72 720 Tm
Tf(Hello World!)Tj
endstream
>>]>>
trailer<>
Can you spot the change? We've added a line that ends with 'Tm' (the
transform-matrix operator). The six-number matrix will look familiar to you if
you've studied Postscript. The first four numbers are scaling and skewing factors
(signifying no change in those characteristics, in this case); the final two numbers
are translation values in 'x' and 'y', respectively. Since the default user space in PDF
has 72 units to the inch, using values of 72 and 720 result in the starting position for
our text results in our text being drawn one inch from the left edge of the page and ten
inches up from the bottom. (In PDF, 'y' units get bigger as you go up.) Now our
14-point text will be drawn where the user can see it.
In the Perl script that follows, we're going to use a variation of the foregoing PDF file
as the basis (the "template," if you will) for our dynamically generated PDF page.
While some experts may cringe at the thought of using a "broken" PDF file as the
starting point for this kind of exercise, the fact is that for illustrative purposes, a
"subatomic" template file of this kind is hard to beat. Besides, the proof of the pudding
is that Acrobat Reader doesn't in any way "choke" on the end result. What the user sees
in his (or her) browser is a fully functional PDF file: one that can be saved to disk
(using the browser's Save As feature) for later use.
The Common Gateway
Collecting information from users is a common task in the world of the Web, calling
for the use of interactive forms. Sometimes, the form in question is a static HTML file
stored on a server; but increasingly, HTML forms themselves are dynamically
generated at the server in response to user requests. That is, the form itself doesn't
exist until a CGI script, often written in Perl, generates it.
CGI (the Common Gateway Interface) is nothing more than a set of conventions for
pushing and pulling information across an HTTP connection. Forms, and the scripts
that retrieve information from web forms, use the CGI protocol to get the job done.
Learning to write CGI scripts in Perl is an art unto itself, but the job is made
tremendously easier if you take advantage of some of the excellent free Perl libraries
available for implementing CGI processes.
Arguably the best Perl library for this task is Lincoln Stein's powerful CGI.pm
package. This set of routines (which takes the pain out of writing HTML at the server
and dealing with web forms) is so useful that it now ships with Perl as part of the
"full install" of the language. What this means is that any Unix server that has Perl
5.003, patchlevel 7 or higher (as almost all do) automatically has CGI.pm, waiting to
be called from any Perl script.
When you use CGI.pm, you don't have to worry about whether incoming form data is
being supplied via GET or POST, where it's being stored on the server, or any of the
nasty parsing details involved in extracting urlencoded data from HTTP streams. To get
the value of a form field called "UserName" from a Perl script at runtime using
CGI.pm, all you have to do is:
$name = param('UserName');
Here, $name refers to the local scalar variable into which we wish to store the value
from "UserName." If the user has typed "John Smith" into the relevant form field
before Submitting the form data to the server, then $name will contain "John Smith
after this line of code executes. The param() function in CGI.pm takes care of finding
the incoming data, decoding it if it's urlencoded, etc.
Dynamic Forms Using CGI.pm
One task CGI.pm excels at is generating HTML code dynamically. For example, to create
the HTML form shown in Figure 1, all we have to do is execute the following few lines
of Perl:
print header,
start_html(-title=>'PDF Bounceback, by Kas Thomas',
-bgcolor=>'#FFFFFF'),
p(h2('PDF Text Entry'));
print start_form,
textarea(-name=>'UsersText',
-rows=>24,
-columns=>60),p,
submit(-name=>'action',-value=>'Generate PDF');

end_form;
Figure 1.A form created dynamically using CGI.pm.
The function h2('content here') applies an HTML level-2 headline tag around the text
"content here"; likewise, p() applies paragraph tags around a block of text, and so
forth. Nesting the function calls causes the relevant HTML tags to nest correctly. A
common idiom in Perl for getting multiple strings to print to an output stream is to
separate the strings with commas and put them between print at the start and a
semicolon at the end. Functions like header(), start_form(), and end_form() are part
of the CGI.pm package. (In Perl, parentheses after a function name do not constitute a
"function call operator." They are therefore optional, if no arguments are given.)
If you write a Perl script containing the foregoing lines of code, name the script
"Form.pl", and place it in the "cgi-bin" directory of your web server, then whenever
anybody goes to the script's URL, the script will launch automatically and write the
form shown in Figure 1 to the caller's browser. This is how many (if not most)
dynamic forms work.
A Dynamic PDF Script
At this point, believe it or not, we're in a position to put together a full, working Perl
script for producing an HTML form dynamically, retrieving the contents of that
(filled-out) form, and auto-generating a PDF reply back to the user's browser.
Listing 1 shows such a script, complete.
Listing 1: dPDF.pl
______________________________
dPDF.pl
A complete Perl script for generating an HTML form, retrieving the
form's contents, and writing the contents back out as a PDF file.
#!/usr/bin/perl
# ------------------------------
# Script: dPDF.pl (simple script to output PDF dynamically)
# © 2000 by Kas Thomas. Updates at www.acroforms.com.
# ------------------------------
use CGI qw/:standard/;
# Check params, and if null, show form...
DrawForm() unless param();
$allText = param('UsersText');
print header("application/pdf"); # output PDF header
Send_PDF_Leader();
foreach $line (split /\n/,$allText) {

print '(' . $line . ')Tj';
print "\n";
print 'T*'; [TOKEN:8227] goto start of next line
print "\n";

}