etree.from string replaces newlines with spaces in attributes

Bug #2069088 reported by Michael
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

First of all, thank you for you great work. lxml is amazing!

>>> from lxml import etree
>>> etree.fromstring('<img src="\nend"/>').attrib
{'src': ' end'}

lxml replaces the \n with a space inside attributes.

[EDIT] This is a bit troublesome behavior. For my understanding, newlines are valid in attributes, see https://www.w3.org/TR/REC-xml/#NT-AttValue and https://stackoverflow.com/questions/449627/are-line-breaks-in-xml-attribute-values-allowed

This is especially an issue with base64 encoded images in email layouts. (Note, this encoding is the most compatible to show a company logo without accessing the web for privacy reasons.) SMTP Servers may respond with "501, line too long" errors. The way to fix this is split the long string into chunks. Valid and working for base64 as well as xml and html attributes in all browsers. However, replacing the newline with a space breaks the encoding. An option to preserve whitespace (or only newlines) in attributes would be ... just amazing again.

Note, the tostring method would need the option, too. Else, it converts the \n to &#10;. Also note, that lxml is widely used in a lot of platforms, bringing this issue to all of them.

Here the requested information:

>>> print("%-20s: %s" % ('Python', sys.version_info))
Python : sys.version_info(major=3, minor=10, micro=12, releaselevel='final', serial=0)
>>> print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
lxml.etree : (4, 6, 5, 0)
>>> print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
libxml used : (2, 9, 10)
>>> print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
libxml compiled : (2, 9, 10)
>>> print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
libxslt used : (1, 1, 34)
>>> print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
libxslt compiled : (1, 1, 34)

Revision history for this message
Michael (arbosh) wrote :

Just a small addition: I'm aware of the W3C recommendation for a parser to replace all whitespace with a space. I also tested

>>> etree.fromstring('<img src="&#10;end"/>', parser=parser).attrib
>>> {'src': '\nend'}

But than I have no possibility to print the \n. And with &#10; the line length will be too long again and nothing is won.

So, if SMTP forces us to remain below 1000 letters in a line, if base64 supports to be split in several lines, if browsers and email programs accept newlines in attributes, because its allowed and valid ... the only thing missing is the option in a parser to do less - not to strip newlines here. Maybe this makes my report a feature request, sorry for this.

The point is, how to explain to clients "this is impossible because ... yea, W3C ... standards ... stuttering", if they want their logo to be shown in emails.

Is there any chance that such an option can be added to lxml?

As far as I tried and tested, custom parsers until now cannot help out.

description: updated
Michael (arbosh)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.