Null character at the beginning of a string in HTML results in a partial tree

Bug #1713329 reported by Dariush
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

Consider this code:

from lxml import html
ht = html.fromstring('<a>1</a> <a>2\0 2+</a> <a>3</a> <a>\0 4</a> <a>5</a>')
print(html.tostring(ht))

I create a tree from five `a` elements and check their text. The second one contains a null character in a non-initial position, while the fourth has it in the initial one. Parsing breaks immediately after encountering the later null, so the output looks like this:

b'<span><a>1</a> <a>2 2+</a> <a>3</a> <a></a></span>'

The rest of the fourth element and everything after it is silently ignored, which doesn't look right. Non-initial null, however, is handled fine.

Debug info:

Python : sys.version_info(major=3, minor=4, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 8, 0, 0)
libxml used : (2, 9, 3)
libxml compiled : (2, 9, 3)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)

Revision history for this message
Paul Richard (psr520) wrote :

I ran into this same bug using: element_tree.findall('.//td')
A NULL was embedded into one of the table data rows - everything thereafter was silently dropped.
This bug causes data loss.
lxml 4.2.1
libxml2 2.9.4

Revision history for this message
scoder (scoder) wrote :

I can reproduce this with xmllint, which means that the behaviour is due to libxml2, not lxml

$ python3 -c 'print("<a>1</a> <a>2\0 2+</a> <a>3</a> <a>\0 4</a> <a>5</a>")' > h.html
$ xmllint --memory --html h.html
h.html:1: HTML parser error : Char 0x0 out of allowed range
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<a>1</a> <a>2 2+</a> <a>3</a> <a></a>
</body></html>

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.