iterparse crashes on parsing misplaced HTML tags
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Fix Released
|
Medium
|
scoder |
Bug Description
```
import io
from lxml import etree
# any self closing tag before html tag
body = '\n\n\n<meta http-equiv=
PARSE_TAGS = { 'meta', 'html', 'body' }
body_io = io.BytesIO(
context = etree.iterparse
print([x for x in context])
```
## current output:
```
Traceback (most recent call last):
File "/tmp/iter.py", line 12, in <module>
print([x for x in context])
File "/tmp/iter.py", line 12, in <listcomp>
print([x for x in context])
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
File "src/lxml/
IndexError: pop from empty list
```
## expected:
parsed meta tag
summary: |
- iterprase crashes on parsing self-closing tags + iterparse crashes on parsing misplaced HTML tags |
Changed in lxml: | |
status: | Fix Committed → Fix Released |
Thanks for the short reproducer.
The issue is not that it's a self-closing tag but that the tag occurs outside of the html/head context. The effect is that the parser in libxml2 injects its own "html" and "head" tags before the "meta" tag that it reports, but those come from a different context. They are C string constants that libxml2 does not intern in the document's hash table, and thus lxml does not recognise them as identical to the expected tag name and does not report and remember the starting tag. When reporting the end event then, the element has been created and its name is read from the hash table as expected, so the end event is reported without having been prepared at the start event. Thus the failure. I'll see if I can fix it.