Wrong sourcelines 65535 when remove_blank_text=True
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
libxml2 |
New
|
Undecided
|
Unassigned | ||
lxml |
Invalid
|
Undecided
|
Unassigned |
Bug Description
The old bug #1341590 reappeared for etree.XMLParser
When loading an XML document with a large number lines, the sourceline is correct for early lines, later stays constants (and wrongly) on 65535, and is again correct for even later lines.
Tested with libxml2 version 2.9.4 and python-lxml version 4.1.0 in Python 2.7.14.
To add to this, I just found an even more strange case where for remove_
Example program:
-------
from lxml.etree import XMLParser, fromstring
for remove_blank_text in [True, False]:
print(
lines = 65540
xmldata = '<a>' + ('<b/>\n' * lines) + '</a>'
tree = fromstring(xmldata, MLParser(
ok = True
for i, e in enumerate(
line = i + 1
if line != e.sourceline:
ok = False
print(' Expected: {}, got: {}'.format(line, e.sourceline))
if ok:
print(' OK')
-------
Output:
-------
remove_
Expected: 65536, got: 65535
Expected: 65537, got: 65535
Expected: 65538, got: 65535
Expected: 65539, got: 65535
Expected: 65540, got: 65535
remove_
Expected: 65535, got: 65536
Expected: 65536, got: 65537
Expected: 65537, got: 65538
Expected: 65538, got: 65539
Expected: 65539, got: 65540
Expected: 65540, got: 65541
-------
description: | updated |
lxml can't do anything about this, since this would need to be handled in the parser of libxml2. Larger line numbers are stored only in text nodes, many of which are discarded (in fact, the most important ones) if whitespace nodes are removed.
Also, it could even be argued that line numbers are in fact undefined if the whitespace nodes that split the lines are removed during parsing.