lxml

etree.from string replaces newlines with spaces in attributes

Bug #2069088 reported by Michael on 2024-06-11

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	New	Undecided	Unassigned

Bug Description

First of all, thank you for you great work. lxml is amazing!

>>> from lxml import etree
>>> etree.fromstring('<img src="data:image/png;base64,start\nend"/>').attrib
{'src': 'data:image/png;base64,start end'}

lxml replaces the \n with a space inside attributes.

[EDIT] This is a bit troublesome behavior. For my understanding, newlines are valid in attributes, see https://www.w3.org/TR/REC-xml/#NT-AttValue and https://stackoverflow.com/questions/449627/are-line-breaks-in-xml-attribute-values-allowed

This is especially an issue with base64 encoded images in email layouts. (Note, this encoding is the most compatible to show a company logo without accessing the web for privacy reasons.) SMTP Servers may respond with "501, line too long" errors. The way to fix this is split the long string into chunks. Valid and working for base64 as well as xml and html attributes in all browsers. However, replacing the newline with a space breaks the encoding. An option to preserve whitespace (or only newlines) in attributes would be ... just amazing again.

Note, the tostring method would need the option, too. Else, it converts the \n to 
. Also note, that lxml is widely used in a lot of platforms, bringing this issue to all of them.

Here the requested information:

>>> print("%-20s: %s" % ('Python', sys.version_info))
Python : sys.version_info(major=3, minor=10, micro=12, releaselevel='final', serial=0)
>>> print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
lxml.etree : (4, 6, 5, 0)
>>> print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
libxml used : (2, 9, 10)
>>> print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
libxml compiled : (2, 9, 10)
>>> print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
libxslt used : (1, 1, 34)
>>> print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))
libxslt compiled : (1, 1, 34)

See original description

Revision history for this message

Michael (arbosh) wrote on 2024-06-11:

Just a small addition: I'm aware of the W3C recommendation for a parser to replace all whitespace with a space. I also tested

>>> etree.fromstring('<img src="data:image/png;base64,start
end"/>', parser=parser).attrib
>>> {'src': 'data:image/png;base64,start\nend'}

But than I have no possibility to print the \n. And with 
 the line length will be too long again and nothing is won.

So, if SMTP forces us to remain below 1000 letters in a line, if base64 supports to be split in several lines, if browsers and email programs accept newlines in attributes, because its allowed and valid ... the only thing missing is the option in a parser to do less - not to strip newlines here. Maybe this makes my report a feature request, sorry for this.

The point is, how to explain to clients "this is impossible because ... yea, W3C ... standards ... stuttering", if they want their logo to be shown in emails.

Is there any chance that such an option can be added to lxml?

As far as I tried and tested, custom parsers until now cannot help out.

description:

updated

Michael (arbosh) on 2024-06-11

description:

updated

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.