This is not incorrect parsing but expected. You are using the col tag incorrectly. None of the parsers will parse the HTML as you expect. As a matter of fact, if you use the lxml library directly, you also won't get what you expect:
>>> Code:
from lxml import etree
from io import StringIO, BytesIO
HTML = '<p><col>text</col></p>'
parser = etree.HTMLParser()
tree = etree.parse(StringIO(HTML), parser)
result = etree.tostring(tree.getroot(), pretty_print=True, method="html")
print(result)
This is not incorrect parsing but expected. You are using the col tag incorrectly. None of the parsers will parse the HTML as you expect. As a matter of fact, if you use the lxml library directly, you also won't get what you expect:
>>> Code:
from lxml import etree
from io import StringIO, BytesIO
HTML = '<p><col> text</col> </p>'
parser = etree.HTMLParser() StringIO( HTML), parser) tree.getroot( ), pretty_print=True, method="html")
tree = etree.parse(
result = etree.tostring(
print(result)
>>> Result:
b'<html> <body>\ n<p></p> \n<col> text</body> </html> \n'