iterparse crashes on parsing misplaced HTML tags

Bug #2044225 reported by micheal
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Medium
scoder

Bug Description

```
import io
from lxml import etree

# any self closing tag before html tag
body = '\n\n\n<meta http-equiv="X-UA-Compatible" content="IE=edge" />\n<html>\n<head>\n</head>\n<body>\n</body>\n</html>\n'

PARSE_TAGS = { 'meta', 'html', 'body' }

body_io = io.BytesIO(body.encode())
context = etree.iterparse(body_io, events=('start', 'end'), html=True, recover=True, resolve_entities=False, huge_tree=False, tag=PARSE_TAGS)

print([x for x in context])

```

## current output:

```
Traceback (most recent call last):
  File "/tmp/iter.py", line 12, in <module>
    print([x for x in context])
          ^^^^^^^^^^^^^^^^^^^^
  File "/tmp/iter.py", line 12, in <listcomp>
    print([x for x in context])
          ^^^^^^^^^^^^^^^^^^^^
  File "src/lxml/iterparse.pxi", line 210, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 195, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 230, in lxml.etree.iterparse._read_more_events
  File "src/lxml/parser.pxi", line 1379, in lxml.etree._FeedParser.feed
  File "src/lxml/parser.pxi", line 609, in lxml.etree._ParserContext._handleParseResult
  File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 724, in lxml.etree._handleParseResult
  File "src/lxml/etree.pyx", line 334, in lxml.etree._ExceptionContext._raise_if_stored
  File "src/lxml/saxparser.pxi", line 520, in lxml.etree._handleSaxEndNoNs
  File "src/lxml/saxparser.pxi", line 556, in lxml.etree._pushSaxEndEvent
IndexError: pop from empty list
```

## expected:

parsed meta tag

Revision history for this message
scoder (scoder) wrote :

Thanks for the short reproducer.

The issue is not that it's a self-closing tag but that the tag occurs outside of the html/head context. The effect is that the parser in libxml2 injects its own "html" and "head" tags before the "meta" tag that it reports, but those come from a different context. They are C string constants that libxml2 does not intern in the document's hash table, and thus lxml does not recognise them as identical to the expected tag name and does not report and remember the starting tag. When reporting the end event then, the element has been created and its name is read from the hash table as expected, so the end event is reported without having been prepared at the start event. Thus the failure. I'll see if I can fix it.

Changed in lxml:
assignee: nobody → scoder (scoder)
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
scoder (scoder) wrote :
Changed in lxml:
milestone: none → 5.0.1
status: Confirmed → Fix Committed
scoder (scoder)
summary: - iterprase crashes on parsing self-closing tags
+ iterparse crashes on parsing misplaced HTML tags
scoder (scoder)
Changed in lxml:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.