Null character at the beginning of a string in HTML results in a partial tree
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Consider this code:
from lxml import html
ht = html.fromstring
print(html.
I create a tree from five `a` elements and check their text. The second one contains a null character in a non-initial position, while the fourth has it in the initial one. Parsing breaks immediately after encountering the later null, so the output looks like this:
b'<span><a>1</a> <a>2 2+</a> <a>3</a> <a></a></span>'
The rest of the fourth element and everything after it is silently ignored, which doesn't look right. Non-initial null, however, is handled fine.
Debug info:
Python : sys.version_
lxml.etree : (3, 8, 0, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 3)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)
I ran into this same bug using: element_ tree.findall( './/td' )
A NULL was embedded into one of the table data rows - everything thereafter was silently dropped.
This bug causes data loss.
lxml 4.2.1
libxml2 2.9.4