etree.from string replaces newlines with spaces in attributes
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
New
|
Undecided
|
Unassigned |
Bug Description
First of all, thank you for you great work. lxml is amazing!
>>> from lxml import etree
>>> etree.fromstrin
{'src': 'data:image/
lxml replaces the \n with a space inside attributes.
[EDIT] This is a bit troublesome behavior. For my understanding, newlines are valid in attributes, see https:/
This is especially an issue with base64 encoded images in email layouts. (Note, this encoding is the most compatible to show a company logo without accessing the web for privacy reasons.) SMTP Servers may respond with "501, line too long" errors. The way to fix this is split the long string into chunks. Valid and working for base64 as well as xml and html attributes in all browsers. However, replacing the newline with a space breaks the encoding. An option to preserve whitespace (or only newlines) in attributes would be ... just amazing again.
Note, the tostring method would need the option, too. Else, it converts the \n to . Also note, that lxml is widely used in a lot of platforms, bringing this issue to all of them.
Here the requested information:
>>> print("%-20s: %s" % ('Python', sys.version_info))
Python : sys.version_
>>> print("%-20s: %s" % ('lxml.etree', etree.LXML_
lxml.etree : (4, 6, 5, 0)
>>> print("%-20s: %s" % ('libxml used', etree.LIBXML_
libxml used : (2, 9, 10)
>>> print("%-20s: %s" % ('libxml compiled', etree.LIBXML_
libxml compiled : (2, 9, 10)
>>> print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_
libxslt used : (1, 1, 34)
>>> print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_
libxslt compiled : (1, 1, 34)
Just a small addition: I'm aware of the W3C recommendation for a parser to replace all whitespace with a space. I also tested
>>> etree.fromstrin g('<img src="data: image/png; base64, start&# 10;end" />', parser= parser) .attrib png;base64, start\nend' }
>>> {'src': 'data:image/
But than I have no possibility to print the \n. And with the line length will be too long again and nothing is won.
So, if SMTP forces us to remain below 1000 letters in a line, if base64 supports to be split in several lines, if browsers and email programs accept newlines in attributes, because its allowed and valid ... the only thing missing is the option in a parser to do less - not to strip newlines here. Maybe this makes my report a feature request, sorry for this.
The point is, how to explain to clients "this is impossible because ... yea, W3C ... standards ... stuttering", if they want their logo to be shown in emails.
Is there any chance that such an option can be added to lxml?
As far as I tried and tested, custom parsers until now cannot help out.