HTMLParser(recover=False) is overly strict and does not understand HTML5 content
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Triaged
|
Undecided
|
Unassigned |
Bug Description
```
doc = """<!doctype html>
<html>
<head>
</head>
<body>
<article>
</article>
</body>
</html>"""
etree.fromstrin
```
blows up with an "lxml.etree.
This is inconvenient because even though `doctestcompare` provides possibly neat APIs to test HTML generation (e.g. template output) `doctestcompare` initialises its HTML parser with recover=False and provides no way to override this (save by monkeypatching the module to replace the parser)
Python : sys.version_
lxml.etree : (3, 4, 4, 0)
libxml used : (2, 9, 2)
libxml compiled : (2, 9, 2)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)
While I agree that this is a problem, it's not something that lxml can help with. The HTML parser is implemented in libxml2, and HTML5 support there is fairly limited. I'm Daniel Veillard they would be happy to receive patches. The tags are defined in a long list that describes their structure and relationship:
https:/ /git.gnome. org/browse/ libxml2/ tree/HTMLparser .c?id=b02a167af 3d2a47c155bce12 3820cbb5fa19dc9 c#n597