HTML parsing with early </html> discards rest of document
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Python : sys.version_
lxml.etree : (3, 3, 4, 0)
libxml used : (2, 9, 1)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)
When parsing a complete document that contains HTML soup where the </html> occurs in the middle of the document, the rest of the elements are discarded. For example, the <hr> is missing:
>>> doc = lxml.html.
>>> lxml.html.
b'<!DOCTYPE html>\n<
If I leave out the doctype, the <hr> remains, but the document now contains multiple <html>:
>>> doc = lxml.html.
>>> lxml.html.
b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
I expected that parsing full document, in the first example, would still keep the <hr> like the second example (even if it's not quite right). I need this behavior because I'm working on a Wget-like web crawler that needs to properly archive all links and also perform optional link conversion on the document.
Parsing is done by libxml2.