Missing tail in iterparse

Bug #1684273 reported by Jason Owen on 2017-04-19
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Medium
Unassigned
lxml (Ubuntu)
Undecided
Unassigned

Bug Description

Given a minimal parser (below) and a particular input file (attached), iterparse is not returning the `tail` of the last `<span>` tag.

I am listening for the `end` event, which is the default, instead of the `start` event.

Changing the input, for example by deleting unrelated tags such as the `<link>` tag in the `<head>`, causes the missing text to reappear. This makes it hard to produce a minified input! I was able to remove everything /after/ the element with the missing tail, which doesn't affect the bug, so that is what I attached.

I took the silence on the mailing list to mean that I did not have any obvious problems with the way I was using iterparse. :) https://mailman-mail5.webfaction.com/pipermail/lxml/2017-April/007882.html

---

```python
#!/usr/bin/env python3

import sys
from lxml import etree

for _, element in etree.iterparse(sys.argv[1], html=True):
    print((
        element.tag,
        element.attrib,
        element.text,
        element.tail,
    ))
```

Invoke by:
```sh
$ ./bug.py bug.html | grep "splays their blue cards left"
```

Expected output:
```
('span', {'class': 'age e'}, '4', '.\n... Nnastya splays their blue cards left.\n')
```

Actual output: none, and return code 1.

---

Python : sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 7, 3, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 3)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)

Jason Owen (jason-a-owen) wrote :

When used with the system python3-lxml package, rather than the version pip installed into a venv:

Python : sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 5, 0, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 2)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

scoder (scoder) wrote :

I agree that this is unexpected. It can be fixed by internally passing more data into the parser before generating the "end" parse event, i.e. by waiting for the tail text to end before yielding the element that owns it.

Changed in lxml:
importance: Undecided → Medium
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers