Wrong sourcelines 65535 when remove_blank_text=True

Bug #1742121 reported by Volker Diels-Grabsch
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libxml2
New
Undecided
Unassigned
lxml
Invalid
Undecided
Unassigned

Bug Description

The old bug #1341590 reappeared for etree.XMLParser(remove_blank_text=True).

When loading an XML document with a large number lines, the sourceline is correct for early lines, later stays constants (and wrongly) on 65535, and is again correct for even later lines.

Tested with libxml2 version 2.9.4 and python-lxml version 4.1.0 in Python 2.7.14.

To add to this, I just found an even more strange case where for remove_blank_text=False an off-by-one error in sourceline appears.

Example program:

----------------------------------------------------------------------
from lxml.etree import XMLParser, fromstring
for remove_blank_text in [True, False]:
    print('remove_blank_text={!r}'.format(remove_blank_text))
    lines = 65540
    xmldata = '<a>' + ('<b/>\n' * lines) + '</a>'
    tree = fromstring(xmldata, MLParser(remove_blank_text=remove_blank_text))
    ok = True
    for i, e in enumerate(tree.iterfind('b')):
        line = i + 1
        if line != e.sourceline:
            ok = False
            print(' Expected: {}, got: {}'.format(line, e.sourceline))
    if ok:
        print(' OK')
----------------------------------------------------------------------

Output:

----------------------------------------------------------------------
remove_blank_text=True
  Expected: 65536, got: 65535
  Expected: 65537, got: 65535
  Expected: 65538, got: 65535
  Expected: 65539, got: 65535
  Expected: 65540, got: 65535
remove_blank_text=False
  Expected: 65535, got: 65536
  Expected: 65536, got: 65537
  Expected: 65537, got: 65538
  Expected: 65538, got: 65539
  Expected: 65539, got: 65540
  Expected: 65540, got: 65541
----------------------------------------------------------------------

description: updated
Revision history for this message
Volker Diels-Grabsch (vogg) wrote :
description: updated
description: updated
Revision history for this message
scoder (scoder) wrote :

lxml can't do anything about this, since this would need to be handled in the parser of libxml2. Larger line numbers are stored only in text nodes, many of which are discarded (in fact, the most important ones) if whitespace nodes are removed.

Also, it could even be argued that line numbers are in fact undefined if the whitespace nodes that split the lines are removed during parsing.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.