HTML parsing with early </html> discards rest of document

Bug #1305381 reported by Christopher Foo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

Python : sys.version_info(major=3, minor=3, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 3, 4, 0)
libxml used : (2, 9, 1)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

When parsing a complete document that contains HTML soup where the </html> occurs in the middle of the document, the rest of the elements are discarded. For example, the <hr> is missing:

     >>> doc = lxml.html.parse(io.BytesIO(b'<!DOCTYPE html><html><body>1<a href="2"></a></body><img src="3"></html><hr>4'))
     >>> lxml.html.tostring(doc)
     b'<!DOCTYPE html>\n<html><body>1<a href="2"></a></body><img src="3"></html>'

If I leave out the doctype, the <hr> remains, but the document now contains multiple <html>:

    >>> doc = lxml.html.parse(io.BytesIO(b'<html><body>1<a href="2"></a></body><img src="3"></html><hr>4'))
    >>> lxml.html.tostring(doc)
    b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\n<html><body>1<a href="2"></a></body><img src="3"><html><hr><p>4</p></html></html>'

I expected that parsing full document, in the first example, would still keep the <hr> like the second example (even if it's not quite right). I need this behavior because I'm working on a Wget-like web crawler that needs to properly archive all links and also perform optional link conversion on the document.

Revision history for this message
scoder (scoder) wrote :

Parsing is done by libxml2.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.