lxml

HTMLParser(recover=False) is overly strict and does not understand HTML5 content

Bug #1463885 reported by Xavier (Open ERP) on 2015-06-10

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	lxml	Triaged	Undecided	Unassigned

Bug Description

```
doc = """<!doctype html>
<html>
    <head>
        <title>test</title>
    </head>
    <body>
        <header>Header text</header>
        <footer>Footer text</footer>
        <article>
            <h1>Article Title</h1>
            <section>
                <h2>Item</h2>
            </section>
        </article>
    </body>
</html>"""
etree.fromstring(doc, parser=etree.HTMLParser(recover=False))
```

blows up with an "lxml.etree.XMLSyntaxError: Tag header invalid, line 7, column 16" even though <header> is a valid HTML element which is perfectly valid as a child to <body>: http://www.w3.org/TR/html-markup/header.html

This is inconvenient because even though `doctestcompare` provides possibly neat APIs to test HTML generation (e.g. template output) `doctestcompare` initialises its HTML parser with recover=False and provides no way to override this (save by monkeypatching the module to replace the parser)

Python : sys.version_info(major=2, minor=7, micro=5, releaselevel='final', serial=0)
lxml.etree : (3, 4, 4, 0)
libxml used : (2, 9, 2)
libxml compiled : (2, 9, 2)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Revision history for this message

scoder (scoder) wrote on 2015-06-10:

While I agree that this is a problem, it's not something that lxml can help with. The HTML parser is implemented in libxml2, and HTML5 support there is fairly limited. I'm Daniel Veillard they would be happy to receive patches. The tags are defined in a long list that describes their structure and relationship:

https://git.gnome.org/browse/libxml2/tree/HTMLparser.c?id=b02a167af3d2a47c155bce123820cbb5fa19dc9c#n597

Changed in lxml:
status:	New → Triaged

Revision history for this message

scoder (scoder) wrote on 2015-06-12:

Sorry, I meant: "I'm sure Daniel Veillard would be happy to receive patches". The best way to do that, usually, is to send a patch to the libxml2 mailing list. The bug tracker tends to be frequented less regularly.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.