HTMLParser(recover=False) is overly strict and does not understand HTML5 content

Bug #1463885 reported by Xavier (Open ERP) on 2015-06-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Undecided
Unassigned

Bug Description

```
doc = """<!doctype html>
<html>
    <head>
        <title>test</title>
    </head>
    <body>
        <header>Header text</header>
        <footer>Footer text</footer>
        <article>
            <h1>Article Title</h1>
            <section>
                <h2>Item</h2>
            </section>
        </article>
    </body>
</html>"""
etree.fromstring(doc, parser=etree.HTMLParser(recover=False))
```

blows up with an "lxml.etree.XMLSyntaxError: Tag header invalid, line 7, column 16" even though <header> is a valid HTML element which is perfectly valid as a child to <body>: http://www.w3.org/TR/html-markup/header.html

This is inconvenient because even though `doctestcompare` provides possibly neat APIs to test HTML generation (e.g. template output) `doctestcompare` initialises its HTML parser with recover=False and provides no way to override this (save by monkeypatching the module to replace the parser)

Python : sys.version_info(major=2, minor=7, micro=5, releaselevel='final', serial=0)
lxml.etree : (3, 4, 4, 0)
libxml used : (2, 9, 2)
libxml compiled : (2, 9, 2)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

scoder (scoder) wrote :

While I agree that this is a problem, it's not something that lxml can help with. The HTML parser is implemented in libxml2, and HTML5 support there is fairly limited. I'm Daniel Veillard they would be happy to receive patches. The tags are defined in a long list that describes their structure and relationship:

https://git.gnome.org/browse/libxml2/tree/HTMLparser.c?id=b02a167af3d2a47c155bce123820cbb5fa19dc9c#n597

Changed in lxml:
status: New → Triaged
scoder (scoder) wrote :

Sorry, I meant: "I'm sure Daniel Veillard would be happy to receive patches". The best way to do that, usually, is to send a patch to the libxml2 mailing list. The bug tracker tends to be frequented less regularly.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers