generate_fdf extracts fields in UTF-16 format

Bug #192398 reported by Adam Buchbinder
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
pdftk (Debian)
Fix Released
Unknown
pdftk (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Binary package hint: pdftk

The generate_fdf tool outputs field names and field values in what appears to be UTF-16 format. To verify:

$ wget http://koivi.com/fill-pdf-form-fields/Project2.pdf
$ pdftk Project2.pdf generate_fdf output Project2.fdf
$ less Project2.fdf

(The "may be a binary file" warning will display.) The field titles ("Text1", "Text2", and so on) are self-contained UTF-16 strings, with their own Byte Order Marks (FE FF) at the beginning. Additionally, the field values consist only of a bare BOM.

This makes it very difficult to manually edit the fields; it also appears to be unnecessary, since entering plain ASCII text in the fields generates the same output as entering UTF-16 text when merging the FDF file back in with fill_form.

I am running pdftk 1.40-2ubuntu3 on Ubuntu Dapper.

description: updated
Revision history for this message
Adam Buchbinder (adam-buchbinder) wrote :

The following workaround will turn the fields in the generated FDF files into plain ASCII, assuming that they're convertible, by filtering out the BOMs and the embedded NULLs. (ASCII text converted to UTF-16 looks exactly like the result of sticking NULLs before or after (depending on byte order) each character.)

I doubt it will work if the field names contain anything other than ASCII.

$ cat Project2.fdf | sed -e's/\x00//g' | sed -e's/\xFE\xFF//g' | less

Revision history for this message
Adam Buchbinder (adam-buchbinder) wrote :

Consulting the PDF Reference 1.6 ( http://partners.adobe.com/public/developer/en/pdf/PDFReference16.pdf ), there's an optional "Encoding" field (p.674) in the FDF dictionary, which defines handling for strings which don't begin with the BOM. It defaults to PDFDocEncoding, which seems sensible. To generate human-readable strings, it would seem sensible to convert the strings to the PDFDocEncoding when they're extracted.

Revision history for this message
Adam Buchbinder (adam-buchbinder) wrote :

I commented too soon. The supported encodings list in Adobe's implementations is very short (p. 1025); in Acrobat 4.0, it consists only of Shift-JIS; in 5.0, only Shift-JIS, UHC, GBK, and BigFive. (The spec doesn't say what later versions accept.) I had assumed that PDFDocEncoding was something like UTF-8, but it's a superset of Latin-1, so converting to PDFDocEncoding by default will mangle any text that uses odd characters. There's also a note (p. 132) explaining that Unicode strings must be encoded as UTF-16BE with a BOM to start with in order to unambiguously distinguish them from PDFDocEncoding strings. Converting to UTF-8 will make the exported forms information incompatible with at least some implementations.

The best possible solution I can think of here is to see if the string can be reencoded in PDFDocEncoding without missing any characters, and if it can't, leaving it in UTF-16. This would maintain backwards compatibility while making it way, way more hand-editable.

Changed in pdftk:
status: Unknown → Confirmed
Changed in pdftk:
status: New → Confirmed
Revision history for this message
Adam Buchbinder (adam-buchbinder) wrote :

I should also add that acroread 7 (on Linux) exports at least ASCII-only text as plain ASCII (it may be PDFDocEncoding, but I didn't have any special characters in it), so we wouldn't be breaking compatibility by doing that.

Changed in pdftk (Debian):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.