PageTemplateFile opens XML files in binary mode

Bug #143131 reported by yuppie
2
Affects Status Importance Assigned to Milestone
Zope 2
Invalid
Low
Unassigned

Bug Description

This is a problem on Windows. If I read the specs ( http://www.w3.org/TR/2004/REC-xml-20040204/#sec-line-ends ) correctly, Windows newlines are allowed within XML. But PageTemplateFile opens them in binary mode, ignoring the fact the file might contain CRs. As a result, the parsed files contain a mix of CR/LF, LF and even CR newlines.

Is there any good reason why this was fixed for HTML, but not for XML files?

Tags: bugday
Revision history for this message
yuppie (yuppie3) wrote :

Fred Drake wrote:
> This report isn't clear. Please update the issue and explain what the
> problem is; glancing at the code on the Zope 2 and Zope 3 trunks, the
> only thing that looks suspicious to me is that re-opening an HTML file
> doesn't use Python's universal newline support.
>
> HTML is always text, so should be treated that way on input. XML may
> contain textual content, but should always be handed to the XML parser
> as a raw byte stream to allow the proper decoding machinery a shot at
> doing the right thing.

I try to restate the issue:

This is a problem in CMFSetup. CMFSetup creates XML using PageTemplateFiles. These files are checked in to CVS in text mode. So depending on the platform, they contain different newlines. If opened as text file, these newlines are normalized to LF. But opened as binary files, newlines are not normalized. Normalizing could be done at a later point, but that's not the case. So line breaks are not normalized before parsing, but the parser expects LF newlines.

Removing newlines, the parser removes only LF, leaving in the CR. Adding newlines, the parser adds LF. Existing newlines are preserved as CR/LF. So the returned XML contains all 3 kinds of newlines.

This is what the XML 1.0 spec says:

"""2.11 End-of-Line Handling

XML parsed entities are often stored in computer files which, for editing convenience, are organized into lines. These lines are typically separated by some combination of the characters CARRIAGE RETURN (#xD) and LINE FEED (#xA).

To simplify the tasks of applications, the XML processor MUST behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character."""

Revision history for this message
Fred Drake (fdrake) wrote :

Entry #2 by yuppie on Oct 5, 2004 12:34 pm:
> This is a problem in CMFSetup. CMFSetup creates XML using
> PageTemplateFiles. These files are checked in to CVS in text mode. So
> depending on the platform, they contain different newlines. If opened as
> text file, these newlines are normalized to LF. But opened as binary
> files, newlines are not normalized. Normalizing could be done at a later
> point, but that's not the case. So line breaks are not normalized before
> parsing, but the parser expects LF newlines.

Are you actually observing CR characters in the generated output? If so, are you certain these are generated due to template data or do they come from some other source (inserted content, for example)?

XML input is parsed by Expat; I'd be very interested in learning of failures of Expat to properly normalize input data.

If you can generate a template that exhibits this, please attach it to this tracker issue; the shorter, the better.

Thanks.

Revision history for this message
yuppie (yuppie3) wrote :

Uploaded: test_ptfile.py

Uploading a modified version of test_ptfile.py. The two new tests in TestPageTemplateFile demonstrate that HTML templates are normalized, while in XML mode the CR is preserved.

Revision history for this message
Clemens Robbenhaar (crobbenhaar) wrote :

Uploaded: PageTemplateFile.patch

stumbled into this issue while poking around for
issue 1820 ... when runnign the attached new tests unter linux,
both tests fail (python 2.3.5, Zope svn head);
i.e. Windows newlines are preserverd verbatim
in both cases.
 If I read the python docs right, this is the expected behaviour,
because files handle windows / unix / mac newlines transparently
the same only if the file is openend with mode "u".

 If I patch the PageTemplateFile as attached, at least the
HTML test pass; the xml test still fails for reasons
I had not figured out.
 (Note that I really do not suggest to apply that patch,
as it does not seem to solve the issue in the first place.)

Revision history for this message
Martijn Pieters (mjpieters) wrote :

Status: Pending => Accepted

 Supporters added: mj

PageTemplateFiles now open their files with universal newline-support, solving the problem only for HTML templates on the filesystem. FTP-ed templates still fail, because the PythonExpr code doesn't anticipate line-endings other than '\n' newlines.

The correct fix would be to extend PageTemplates.PythonExpr.PythonExpr.__init__ to also deal with '\r\n' and '\r' newlines. I'll take this later today or tomorrow.

Revision history for this message
Martijn Pieters (mjpieters) wrote :

Status: Accepted => Pending

 Supporters removed: mj

Jumping the gun here; bug 1820 deals with the PythonExpr class being unable to deal with line-endings other than '\n'; this bug deals with newline handling in general. Resigning (and accepting 1474 instead).

Revision history for this message
Martijn Pieters (mjpieters) wrote :

*Bug number overload* ;)

I resigned from 1474, accepted 1820..

Revision history for this message
TinoW (tino-wildenhain) wrote :

Just a note: while this is actually a bug, finding the bug smells a lot like bad application design. Multiline python expressions should not be used in TAL anyway.

Revision history for this message
Martijn Pieters (mjpieters) wrote :

Multi-line python has perfectly legitimate reasons; if you keep your templates readable it makes sense to break a python statement into multiple lines; the expression machinery should treat all linebreaks as newlines.

The problem of newlines in python expressions is bug 1820. However, if that is the only problem that yuppie has with XML PT files then this bug can be resolved as soon as 1820 is resolved.

Revision history for this message
yuppie (yuppie3) wrote :

Uploaded: test_ptfile.py.patch

Uploading again the tests from comment #4, this time as a patch against the Zope 2.9 branch. The issue still exists.

Tres Seaver (tseaver)
Changed in zope2:
status: New → Triaged
importance: Medium → Low
tags: added: bugday
removed: bug zope
Revision history for this message
Colin Watson (cjwatson) wrote :

The zope2 project on Launchpad has been archived at the request of the Zope developers (see https://answers.launchpad.net/launchpad/+question/683589 and https://answers.launchpad.net/launchpad/+question/685285). If this bug is still relevant, please refile it at https://github.com/zopefoundation/zope2.

Changed in zope2:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.