lxml

iterparse crashes on parsing misplaced HTML tags

Bug #2044225 reported by micheal on 2023-11-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Fix Released	Medium	scoder	lxml 5.0.1

Bug Description

```
import io
from lxml import etree

# any self closing tag before html tag
body = '\n\n\n<meta http-equiv="X-UA-Compatible" content="IE=edge" />\n<html>\n<head>\n</head>\n<body>\n</body>\n</html>\n'

PARSE_TAGS = { 'meta', 'html', 'body' }

body_io = io.BytesIO(body.encode())
context = etree.iterparse(body_io, events=('start', 'end'), html=True, recover=True, resolve_entities=False, huge_tree=False, tag=PARSE_TAGS)

print([x for x in context])

```

## current output:

```
Traceback (most recent call last):
  File "/tmp/iter.py", line 12, in <module>
    print([x for x in context])
          ^^^^^^^^^^^^^^^^^^^^
  File "/tmp/iter.py", line 12, in <listcomp>
    print([x for x in context])
          ^^^^^^^^^^^^^^^^^^^^
  File "src/lxml/iterparse.pxi", line 210, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 195, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 230, in lxml.etree.iterparse._read_more_events
  File "src/lxml/parser.pxi", line 1379, in lxml.etree._FeedParser.feed
  File "src/lxml/parser.pxi", line 609, in lxml.etree._ParserContext._handleParseResult
  File "src/lxml/parser.pxi", line 618, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 724, in lxml.etree._handleParseResult
  File "src/lxml/etree.pyx", line 334, in lxml.etree._ExceptionContext._raise_if_stored
  File "src/lxml/saxparser.pxi", line 520, in lxml.etree._handleSaxEndNoNs
  File "src/lxml/saxparser.pxi", line 556, in lxml.etree._pushSaxEndEvent
IndexError: pop from empty list
```

## expected:

parsed meta tag

Revision history for this message

scoder (scoder) wrote on 2023-12-30:

Thanks for the short reproducer.

The issue is not that it's a self-closing tag but that the tag occurs outside of the html/head context. The effect is that the parser in libxml2 injects its own "html" and "head" tags before the "meta" tag that it reports, but those come from a different context. They are C string constants that libxml2 does not intern in the document's hash table, and thus lxml does not recognise them as identical to the expected tag name and does not report and remember the starting tag. When reporting the end event then, the element has been created and its name is read from the hash table as expected, so the end event is reported without having been prepared at the start event. Thus the failure. I'll see if I can fix it.

Changed in lxml:
assignee:	nobody → scoder (scoder)
importance:	Undecided → Medium
status:	New → Confirmed

Revision history for this message

scoder (scoder) wrote on 2023-12-30:

Fixed in https://github.com/lxml/lxml/commit/28b416fcada162e8e9e4bd44ec03e7e8fcecc344

Changed in lxml:
milestone:	none → 5.0.1
status:	Confirmed → Fix Committed

scoder (scoder) on 2023-12-30

summary:

- iterprase crashes on parsing self-closing tags
+ iterparse crashes on parsing misplaced HTML tags

scoder (scoder) on 2024-01-06

Changed in lxml:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.