End-of-line normalization differs between etree.XML and etree.iterparse
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Normalization of end-of-line (ie. convert \r\n to \n) differs between using etree.XML (or etree.parse) and etree.iterparse.
A small example is attached.
Expected output: none
Current output:
Traceback (most recent call last):
File "lxml-eol-
repr(
AssertionError: 'line1\nline2' != 'line1\r\nline2'
Environment:
Python 3.6.5 (default, May 11 2018, 04:00:52)
[GCC 8.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> from lxml import etree
>>>
>>> print("%-20s: %s" % ('Python', sys.version_info))
Python : sys.version_
>>> print("%-20s: %s" % ('lxml.etree', etree.LXML_
lxml.etree : (4, 2, 1, 0)
>>> print("%-20s: %s" % ('libxml used', etree.LIBXML_
libxml used : (2, 9, 8)
>>> print("%-20s: %s" % ('libxml compiled', etree.LIBXML_
libxml compiled : (2, 9, 8)
>>> print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_
libxslt used : (1, 1, 32)
>>> print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_
libxslt compiled : (1, 1, 32)
Thank you for the test script which made it easy to reproduce this. However, I can also reproduce this with xmllint, which means that the problem is in libxml2 and not in lxml.
$ python -c 'print( "<test> <![CDATA[ line1\r\ nline2] ]></test> ")' | xmllint - | python -c 'import sys; print(repr( sys.stdin. read()) )' "1.0"?> \n<test> <![CDATA[ line1\nline2] ]></test> \n'
'<?xml version=
$ python -c 'print( "<test> <![CDATA[ line1\r\ nline2] ]></test> ")' | xmllint --push - | python -c 'import sys; print(repr( sys.stdin. read()) )' "1.0"?> \n<test> <![CDATA[ line1\r\ nline2] ]></test> \n'
'<?xml version=