lxml

HTML parsing with early </html> discards rest of document

Bug #1305381 reported by Christopher Foo on 2014-04-10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Invalid	Undecided	Unassigned

Bug Description

Python : sys.version_info(major=3, minor=3, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 3, 4, 0)
libxml used : (2, 9, 1)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

When parsing a complete document that contains HTML soup where the </html> occurs in the middle of the document, the rest of the elements are discarded. For example, the <hr> is missing:

     >>> doc = lxml.html.parse(io.BytesIO(b'<!DOCTYPE html><html><body>1<a href="2"></a></body><img src="3"></html><hr>4'))
     >>> lxml.html.tostring(doc)
     b'<!DOCTYPE html>\n<html><body>1<a href="2"></a></body><img src="3"></html>'

If I leave out the doctype, the <hr> remains, but the document now contains multiple <html>:

    >>> doc = lxml.html.parse(io.BytesIO(b'<html><body>1<a href="2"></a></body><img src="3"></html><hr>4'))
    >>> lxml.html.tostring(doc)
    b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">\n<html><body>1<a href="2"></a></body><img src="3"><html><hr><p>4</p></html></html>'

I expected that parsing full document, in the first example, would still keep the <hr> like the second example (even if it's not quite right). I need this behavior because I'm working on a Wget-like web crawler that needs to properly archive all links and also perform optional link conversion on the document.

Revision history for this message

scoder (scoder) wrote on 2014-04-10:

Parsing is done by libxml2.

Changed in lxml:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.