All Databases MacTech Vol 16-2000

Dynamic PDF

Volume Number: 16

Issue Number: 3

Column Tag: Programming Techniques

Dynamic PDF Made Easy

by Kas Thomas

It's simpler than you think to generate custom PDF

documents in real time using Perl and CGI

Adobe's Portable Document Format (PDF) has won universal acclaim as document

interchange format, because of its ability to deliver graphically rich, structured

content consistently across multiple operating environments. Because of its strong

support for vector graphics, font embedding, hypertext links, and other advanced

features, PDF is a powerful, far-reaching document standard. But that also makes it a

relatively complex standard. (For details, see the September 1999 MacTech.) PDF's

complexity, in turn, has discouraged many web developers from attempting to write

PDF files dynamically, at the server. Automatically generated HTML pages are

common; but when was the last time you saw an auto-generated PDF web page?

It turns out that with a little foreknowledge of PDF's internal workings, and a good

grasp of CGI fundamentals, it's not that hard to put together server scripts that will

generate PDF on demand. There's no question that dynamic PDF can be a challenge to do

right. But if your dynamic document needs are modest (for example, if you'd simply

like to be able to generate a custom "Thank You" page at run time) and your Perl skills

are passable (i.e., you've graduated from writing "Hello World" to generating the much

more impressive string "Internal Server Error"), you can automate the serving of

custom PDF pages with surprisingly little effort. In fact, a Perl script comprising no

more than five or six dozen lines of code will get you started.

Our Strategy

Our strategy will be simple: We want to be able to collect an arbitrary glob of "user

text" from a web form, and convert that text to a PDF page served back to the user via

HTTP, suitable for viewing in a browser equipped with the free Acrobat Viewer

plug-in. For simplicity's sake, we'll start by omitting any attempts at vector

graphics, colored text, rotated text, styled text in various point sizes, etc. (But I

hasten to add that before this article is finished, we're going to be doing all of those

things and more.) For starters, we just want to serve the user's words back to him, at

any convenient point size, in black and white, on a letter-sized PDF page. For sake of

convenience, we're also going to generate the "data collection" form and our PDF reply

from one and the same Perl script. That way, we don't have to keep track of multiple

documents: an HTML form, Perl scripts, CSS stylesheets, etc. If you want to add that

complexity to the picture yourself later, so be it. Here, we're going to keep things

stone-simple.

I might point out that "live" copies of the dynamic-PDF scripts developed below can be

found at http://www.acroforms.com/dynamicPDF.html (which is a portal page with

links to several working scripts).

I also want to stress that when we're done, the PDF pages we will have generated can be

saved to disk using your browser's Save As feature. (Reader won't save the docs, but

your browser will.) Thus, we will have achieved the Holy Grail of the PDF world:

writing PDF without using Distiller or Acrobat.

Once we've gotten our basic approach nailed down, we'll branch out into fancy features

like colored backgrounds, stroked/filled paths, rotated text, styled text, etc. But first,

a detour into the land of PDF internal structure.

PDF at the Subatomic Level

Internally, a PDF file consists of plain-ASCII object descriptions, following a notably

Postscript-like (i.e., postfix-notation) syntax. All PDF files have a header

(containing version information), a cross-reference (or 'xref') table containing

offsets to the various objects that make up the file, and a trailer containing references

to the root object as well as the byte offset from the start of the file to the beginning of

the 'xref' table. (See my story in see the September 1999 MacTech for more

information.) The formal PDF specification, available online at

http://partners.adobe.com/asn/developer/PDFS/TN/PDFSPEC.PDF, describes the

relationships between PDF objects (and the meanings of various dictionary "key

entries) in comprehensive detail, if you're interested. But we're going to skip all that,

because most of it is not necessary for what we're doing.

It turns out you can bend (or even break) a lot of the "rules" spelled out in the PDF

spec, without making Acrobat Reader unhappy, if you know a thing or two about PDF

internals. An instructive exercise in this regard is to make Adobe's Distiller program

generate a small "Hello World" type of file, then examine the file in a text editor-and

try to find superfluous data in it that can be removed. If you try this, you will quickly

discover that /Info objects, for example, contain meta-data about the PDF file (such as

the doc's author, subject, producer, keywords, etc.) that can easily be dispensed with.

Likewise, Adobe tends to insert long /ID objects (containing a kind of "digital

fingerprint" for identification of the file) in PDF files; these aren't strictly needed.

If you keep removing superfluous objects and data items from a PDF file, you may be

amazed at how much "data" you can actually get rid of without making Reader complain.

A careful re-reading of the PDF Specification usually reveals that items you thought

were mandatory are actually optional. Even the all-important 'xref' table isn't

strictly needed, since Acrobat Reader can (and will, and does) generate the necessary

byte offsets itself, at runtime, if the table is defective or missing.

In late 1999, I sponsored a contest at in which the goal was to

produce the smallest possible PDF file that wrote "Hello World" to the screen without

making Acrobat Reader generate any error dialogs. Amazingly, the winner of that

contest submitted a file that was only a little more than 200 bytes long:

%PDF-1.

1 0 obj<>

2 0 obj<

R/Resources<

>>>>/Contents<<>>stream

BT/R 14

Tf(Hello World!)Tj

endstream

>>]>>

trailer<>

Note that the second object (starting with "2 0 obj...") is extra-long, with many

nested "dictionary" entries, ending only just before the word "trailer" at the bottom of

the file. This particular PDF file uses 14-point Arial as the typeface. No font

encodings are embedded in the file, however, because Arial is one of the base-14 fonts

that Acrobat Reader ships with. (Arial replaced Helvetica in the version-4.0 release

of Reader.) Every machine that has Acrobat Reader is guaranteed to have Arial.

There are many things technically "wrong" with this short PDF file, including the fact

that it has no 'xref' table and contains a text stream that is opened with 'BT' (for Begin

Text) but is never closed with 'ET' (End Text). In many ways, it's a wonder Reader can

even parse and display this file. Yet it does.

There is one small technical difficulty with this file that impacts viewing: Namely, no

positioning information is given for the text - which means (since Reader "zeroes out

the current transformation matrix, or CTM, before displaying any block of text) that

the starting "pen position" for displaying "Hello World!" in this instance is the origin

of user space: i.e., the lower left corner of the page. You have to scroll down to the very

bottom of the page to see the single line of text. To fix this requires that we add a

transformation matrix of our own to the text stream:

%PDF-1.

1 0 obj<>

2 0 obj<

R/Resources<

>>>>/Contents<<>>stream

BT/R 14

1 0 0 1 72 720 Tm

Tf(Hello World!)Tj

endstream

>>]>>

trailer<>

Can you spot the change? We've added a line that ends with 'Tm' (the

transform-matrix operator). The six-number matrix will look familiar to you if

you've studied Postscript. The first four numbers are scaling and skewing factors

(signifying no change in those characteristics, in this case); the final two numbers

are translation values in 'x' and 'y', respectively. Since the default user space in PDF

has 72 units to the inch, using values of 72 and 720 result in the starting position for

our text results in our text being drawn one inch from the left edge of the page and ten

inches up from the bottom. (In PDF, 'y' units get bigger as you go up.) Now our

14-point text will be drawn where the user can see it.

In the Perl script that follows, we're going to use a variation of the foregoing PDF file

as the basis (the "template," if you will) for our dynamically generated PDF page.

While some experts may cringe at the thought of using a "broken" PDF file as the

starting point for this kind of exercise, the fact is that for illustrative purposes, a

"subatomic" template file of this kind is hard to beat. Besides, the proof of the pudding

is that Acrobat Reader doesn't in any way "choke" on the end result. What the user sees

in his (or her) browser is a fully functional PDF file: one that can be saved to disk

(using the browser's Save As feature) for later use.

The Common Gateway

Collecting information from users is a common task in the world of the Web, calling

for the use of interactive forms. Sometimes, the form in question is a static HTML file

stored on a server; but increasingly, HTML forms themselves are dynamically

generated at the server in response to user requests. That is, the form itself doesn't

exist until a CGI script, often written in Perl, generates it.

CGI (the Common Gateway Interface) is nothing more than a set of conventions for

pushing and pulling information across an HTTP connection. Forms, and the scripts

that retrieve information from web forms, use the CGI protocol to get the job done.

Learning to write CGI scripts in Perl is an art unto itself, but the job is made

tremendously easier if you take advantage of some of the excellent free Perl libraries

available for implementing CGI processes.

Arguably the best Perl library for this task is Lincoln Stein's powerful CGI.pm

package. This set of routines (which takes the pain out of writing HTML at the server

and dealing with web forms) is so useful that it now ships with Perl as part of the

"full install" of the language. What this means is that any Unix server that has Perl

5.003, patchlevel 7 or higher (as almost all do) automatically has CGI.pm, waiting to

be called from any Perl script.

When you use CGI.pm, you don't have to worry about whether incoming form data is

being supplied via GET or POST, where it's being stored on the server, or any of the

nasty parsing details involved in extracting urlencoded data from HTTP streams. To get

the value of a form field called "UserName" from a Perl script at runtime using

CGI.pm, all you have to do is:

$name = param('UserName');

Here, $name refers to the local scalar variable into which we wish to store the value

from "UserName." If the user has typed "John Smith" into the relevant form field

before Submitting the form data to the server, then $name will contain "John Smith

after this line of code executes. The param() function in CGI.pm takes care of finding

the incoming data, decoding it if it's urlencoded, etc.

Dynamic Forms Using CGI.pm

One task CGI.pm excels at is generating HTML code dynamically. For example, to create

the HTML form shown in Figure 1, all we have to do is execute the following few lines

of Perl:

print header,

start_html(-title=>'PDF Bounceback, by Kas Thomas',

-bgcolor=>'#FFFFFF'),

p(h2('PDF Text Entry'));

print start_form,

textarea(-name=>'UsersText',

-rows=>24,

-columns=>60),p,

submit(-name=>'action',-value=>'Generate PDF');

end_form;

Figure 1.A form created dynamically using CGI.pm.

The function h2('content here') applies an HTML level-2 headline tag around the text

"content here"; likewise, p() applies paragraph tags around a block of text, and so

forth. Nesting the function calls causes the relevant HTML tags to nest correctly. A

common idiom in Perl for getting multiple strings to print to an output stream is to

separate the strings with commas and put them between print at the start and a

semicolon at the end. Functions like header(), start_form(), and end_form() are part

of the CGI.pm package. (In Perl, parentheses after a function name do not constitute a

"function call operator." They are therefore optional, if no arguments are given.)

If you write a Perl script containing the foregoing lines of code, name the script

"Form.pl", and place it in the "cgi-bin" directory of your web server, then whenever

anybody goes to the script's URL, the script will launch automatically and write the

form shown in Figure 1 to the caller's browser. This is how many (if not most)

dynamic forms work.

A Dynamic PDF Script

At this point, believe it or not, we're in a position to put together a full, working Perl

script for producing an HTML form dynamically, retrieving the contents of that

(filled-out) form, and auto-generating a PDF reply back to the user's browser.

Listing 1 shows such a script, complete.

Listing 1: dPDF.pl

______________________________

dPDF.pl

A complete Perl script for generating an HTML form, retrieving the

form's contents, and writing the contents back out as a PDF file.

#!/usr/bin/perl

# ------------------------------

# Script: dPDF.pl (simple script to output PDF dynamically)

# ------------------------------

use CGI qw/:standard/;

# Check params, and if null, show form...

DrawForm() unless param();

$allText = param('UsersText');

print header("application/pdf"); # output PDF header

Send_PDF_Leader();

foreach $line (split /\n/,$allText) {

print '(' . $line . ')Tj';

print "\n";

print 'T*'; [TOKEN:8227] goto start of next line

print "\n";

}

Referenced by (6):