Ubuntu
pdftk package

generate_fdf extracts fields in UTF-16 format

Bug #192398 reported by Adam Buchbinder on 2008-02-16

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	pdftk (Debian)	Fix Released	Unknown	debbugs #421343
	pdftk (Ubuntu)	Confirmed	Undecided	Unassigned

Bug Description

Binary package hint: pdftk

The generate_fdf tool outputs field names and field values in what appears to be UTF-16 format. To verify:

$ wget http://koivi.com/fill-pdf-form-fields/Project2.pdf
$ pdftk Project2.pdf generate_fdf output Project2.fdf
$ less Project2.fdf

(The "may be a binary file" warning will display.) The field titles ("Text1", "Text2", and so on) are self-contained UTF-16 strings, with their own Byte Order Marks (FE FF) at the beginning. Additionally, the field values consist only of a bare BOM.

This makes it very difficult to manually edit the fields; it also appears to be unnecessary, since entering plain ASCII text in the fields generates the same output as entering UTF-16 text when merging the FDF file back in with fill_form.

I am running pdftk 1.40-2ubuntu3 on Ubuntu Dapper.

See original description

Adam Buchbinder (adam-buchbinder) on 2008-02-16

description:

updated

Revision history for this message

Adam Buchbinder (adam-buchbinder) wrote on 2008-02-16:

The following workaround will turn the fields in the generated FDF files into plain ASCII, assuming that they're convertible, by filtering out the BOMs and the embedded NULLs. (ASCII text converted to UTF-16 looks exactly like the result of sticking NULLs before or after (depending on byte order) each character.)

I doubt it will work if the field names contain anything other than ASCII.

$ cat Project2.fdf | sed -e's/\x00//g' | sed -e's/\xFE\xFF//g' | less

Revision history for this message

Adam Buchbinder (adam-buchbinder) wrote on 2008-02-16:

Consulting the PDF Reference 1.6 ( http://partners.adobe.com/public/developer/en/pdf/PDFReference16.pdf ), there's an optional "Encoding" field (p.674) in the FDF dictionary, which defines handling for strings which don't begin with the BOM. It defaults to PDFDocEncoding, which seems sensible. To generate human-readable strings, it would seem sensible to convert the strings to the PDFDocEncoding when they're extracted.

Revision history for this message

Adam Buchbinder (adam-buchbinder) wrote on 2008-02-16:

I commented too soon. The supported encodings list in Adobe's implementations is very short (p. 1025); in Acrobat 4.0, it consists only of Shift-JIS; in 5.0, only Shift-JIS, UHC, GBK, and BigFive. (The spec doesn't say what later versions accept.) I had assumed that PDFDocEncoding was something like UTF-8, but it's a superset of Latin-1, so converting to PDFDocEncoding by default will mangle any text that uses odd characters. There's also a note (p. 132) explaining that Unicode strings must be encoded as UTF-16BE with a BOM to start with in order to unambiguously distinguish them from PDFDocEncoding strings. Converting to UTF-8 will make the exported forms information incompatible with at least some implementations.

The best possible solution I can think of here is to see if the string can be reencoded in PDFDocEncoding without missing any characters, and if it can't, leaving it in UTF-16. This would maintain backwards compatibility while making it way, way more hand-editable.

Bug Watch Updater (bug-watch-updater) on 2008-02-17

Changed in pdftk:
status:	Unknown → Confirmed

freemed (jeff-freemedsoftware) on 2008-02-17

Changed in pdftk:
status:	New → Confirmed

Revision history for this message

Adam Buchbinder (adam-buchbinder) wrote on 2008-03-07:

I should also add that acroread 7 (on Linux) exports at least ASCII-only text as plain ASCII (it may be PDFDocEncoding, but I didn't have any special characters in it), so we wouldn't be breaking compatibility by doing that.

Bug Watch Updater (bug-watch-updater) on 2023-09-01

Changed in pdftk (Debian):
status:	Confirmed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

debbugs #421343
[done wishlist upstream confirmed] Edit

Bug watches keep track of this bug in other bug trackers.

Ubuntupdftk package

generate_fdf extracts fields in UTF-16 format

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
pdftk package