incorrect sourceline for long xmls

Bug #1666195 reported by Daniel PUIU
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

Python : sys.version_info(major=3, minor=4, micro=4, releaselevel='final', serial=0)
lxml.etree : (3, 6, 0, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 3)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)

windows 10 x64, python x64, lxml x64

For very long xmls, the returned sourceline is wrong.
I have build a custom xml to show my problem (which is attached to this bug):

<?xml version="1.0" encoding="UTF-8"?>
<root>
 <child>
  <grandchild>GC0</grandchild>
  <grandchild>GC1</grandchild>
  <grandchild>GC2</grandchild>
  <grandchild>GC3</grandchild>
  <grandchild>GC4</grandchild>
  <grandchild>GC5</grandchild>
  <grandchild>GC6</grandchild>
  <grandchild>GC7</grandchild>
  <grandchild>GC8</grandchild>
  <grandchild>GC9</grandchild>
 </child>
        ...
 <child>
  <grandchild>GC0</grandchild>
  <grandchild>GC1</grandchild>
  <grandchild>GC2</grandchild>
  <grandchild>GC3</grandchild>
  <grandchild>GC4</grandchild>
  <grandchild>GC5</grandchild>
  <grandchild>GC6</grandchild>
  <grandchild>GC7</grandchild>
  <grandchild>GC8</grandchild>
  <grandchild>GC9</grandchild>
 </child>
</root>

I have 32768 child nodes. Starting with 5461st child the returned sourceline is wrong by at least 1 line:
The 5461st node is at line 65535, but sourceline returns 65536.

The following code:

for grandchild in children[5461].getchildren():
 print(grandchild.getparent().sourceline, grandchild.sourceline)

prints

65536 65536
65536 65537
65536 65538
65536 65539
65536 65540
65536 65541
65536 65542
65536 65543
65536 65544
65536 65545

for a higher level of nesting elements the difference between the real sourceline and the returned sourceline grows.

In this case the following code:

for grandchild in children[-1].getchildren():
 print(grandchild.getparent().sourceline, grandchild.sourceline)

prints

393208 393208
393208 393209
393208 393210
393208 393211
393208 393212
393208 393213
393208 393214
393208 393215
393208 393216
393208 393217

so the difference is still 1.

Tags: sourceline
Revision history for this message
Daniel PUIU (danielpuiu) wrote :
Daniel PUIU (danielpuiu)
description: updated
Revision history for this message
Daniel PUIU (danielpuiu) wrote :

I created a second xml which looks like this:

<root>
 <level1>
  <child1><child2><child3></child3></child2></child1>
  <child4></child4><child5></child5><child6></child6>
 </level1>
        ...
 <level1>
  <child1><child2><child3></child3></child2></child1>
  <child4></child4><child5></child5><child6></child6>
 </level1>
</root>

childs = root.xpath('.//child1')

Starting with the 16384th child1 element it won't return the right sourceline:

>>> c1[16382].sourceline
65531
>>> c1[16383].sourceline
65535
>>> c1[16384].sourceline
65535
>>> c1[16385].sourceline
65535
>>> c1[16386].sourceline
65535
>>> c1[-1].sourceline
65535

Revision history for this message
scoder (scoder) wrote :

Sorry, not a bug in lxml itself. This is due to the way libxml2 handles large source line numbers internally. It stores them exactly up to 65535 (for historical and memory efficiency reasons), and then switches to a different scheme that only remembers them in text nodes, not in elements. There is nothing lxml can do about this.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.