Zope 2

PageTemplateFile opens XML files in binary mode

Bug #143131 reported by yuppie on 2004-08-19

Affects		Status	Importance	Assigned to	Milestone
	Zope 2	Invalid	Low	Unassigned

Bug Description

This is a problem on Windows. If I read the specs ( http://www.w3.org/TR/2004/REC-xml-20040204/#sec-line-ends ) correctly, Windows newlines are allowed within XML. But PageTemplateFile opens them in binary mode, ignoring the fact the file might contain CRs. As a result, the parsed files contain a mix of CR/LF, LF and even CR newlines.

Is there any good reason why this was fixed for HTML, but not for XML files?

Tags:

Revision history for this message

yuppie (yuppie3) wrote on 2004-10-05:

Fred Drake wrote:
> This report isn't clear. Please update the issue and explain what the
> problem is; glancing at the code on the Zope 2 and Zope 3 trunks, the
> only thing that looks suspicious to me is that re-opening an HTML file
> doesn't use Python's universal newline support.
>
> HTML is always text, so should be treated that way on input. XML may
> contain textual content, but should always be handed to the XML parser
> as a raw byte stream to allow the proper decoding machinery a shot at
> doing the right thing.

I try to restate the issue:

This is a problem in CMFSetup. CMFSetup creates XML using PageTemplateFiles. These files are checked in to CVS in text mode. So depending on the platform, they contain different newlines. If opened as text file, these newlines are normalized to LF. But opened as binary files, newlines are not normalized. Normalizing could be done at a later point, but that's not the case. So line breaks are not normalized before parsing, but the parser expects LF newlines.

Removing newlines, the parser removes only LF, leaving in the CR. Adding newlines, the parser adds LF. Existing newlines are preserved as CR/LF. So the returned XML contains all 3 kinds of newlines.

This is what the XML 1.0 spec says:

"""2.11 End-of-Line Handling

XML parsed entities are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by some combination of the characters CARRIAGE RETURN (#xD) and LINE FEED (#xA).

To simplify the tasks of applications, the XML processor MUST behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character."""

Revision history for this message

Fred Drake (fdrake) wrote on 2004-10-05:

Entry #2 by yuppie on Oct 5, 2004 12:34 pm:
> This is a problem in CMFSetup. CMFSetup creates XML using
> PageTemplateFiles. These files are checked in to CVS in text mode. So
> depending on the platform, they contain different newlines. If opened as
> text file, these newlines are normalized to LF. But opened as binary
> files, newlines are not normalized. Normalizing could be done at a later
> point, but that's not the case. So line breaks are not normalized before
> parsing, but the parser expects LF newlines.

Are you actually observing CR characters in the generated output? If so, are you certain these are generated due to template data or do they come from some other source (inserted content, for example)?

XML input is parsed by Expat; I'd be very interested in learning of failures of Expat to properly normalize input data.

If you can generate a template that exhibits this, please attach it to this tracker issue; the shorter, the better.

Thanks.

Revision history for this message

yuppie (yuppie3) wrote on 2004-10-05:

test_ptfile.py Edit (6.8 KiB, text/plain)

Uploaded: test_ptfile.py

Uploading a modified version of test_ptfile.py. The two new tests in TestPageTemplateFile demonstrate that HTML templates are normalized, while in XML mode the CR is preserved.

Revision history for this message

Clemens Robbenhaar (crobbenhaar) wrote on 2005-09-04:

PageTemplateFile.patch Edit (564 bytes, text/plain)

Uploaded: PageTemplateFile.patch

stumbled into this issue while poking around for
issue 1820 ... when runnign the attached new tests unter linux,
both tests fail (python 2.3.5, Zope svn head);
i.e. Windows newlines are preserverd verbatim
in both cases.
If I read the python docs right, this is the expected behaviour,
because files handle windows / unix / mac newlines transparently
the same only if the file is openend with mode "u".

If I patch the PageTemplateFile as attached, at least the
HTML test pass; the xml test still fails for reasons
I had not figured out.
(Note that I really do not suggest to apply that patch,
as it does not seem to solve the issue in the first place.)

Revision history for this message

Martijn Pieters (mjpieters) wrote on 2006-03-11:

Status: Pending => Accepted

Supporters added: mj

PageTemplateFiles now open their files with universal newline-support, solving the problem only for HTML templates on the filesystem. FTP-ed templates still fail, because the PythonExpr code doesn't anticipate line-endings other than '\n' newlines.

The correct fix would be to extend PageTemplates.PythonExpr.PythonExpr.__init__ to also deal with '\r\n' and '\r' newlines. I'll take this later today or tomorrow.

Revision history for this message

Martijn Pieters (mjpieters) wrote on 2006-03-11:

Status: Accepted => Pending

Supporters removed: mj

Jumping the gun here; bug 1820 deals with the PythonExpr class being unable to deal with line-endings other than '\n'; this bug deals with newline handling in general. Resigning (and accepting 1474 instead).

Revision history for this message

Martijn Pieters (mjpieters) wrote on 2006-03-11:

*Bug number overload* ;)

I resigned from 1474, accepted 1820..

Revision history for this message

TinoW (tino-wildenhain) wrote on 2006-03-11:

Just a note: while this is actually a bug, finding the bug smells a lot like bad application design. Multiline python expressions should not be used in TAL anyway.

Revision history for this message

Martijn Pieters (mjpieters) wrote on 2006-03-11:

Multi-line python has perfectly legitimate reasons; if you keep your templates readable it makes sense to break a python statement into multiple lines; the expression machinery should treat all linebreaks as newlines.

The problem of newlines in python expressions is bug 1820. However, if that is the only problem that yuppie has with XML PT files then this bug can be resolved as soon as 1820 is resolved.

Revision history for this message

yuppie (yuppie3) wrote on 2006-09-05:

#10