lxml incorrect parsing
Bug #1925723 reported by
Harnek
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Python version: 3.8.9
bs4 version: 4.9.3
OS: Windows
smallest example:
<p><col>
gives:
<p></p><col/>Text
<p></p>
expected:
<p><col>
Changed in beautifulsoup: | |
status: | New → Invalid |
To post a comment you must log in.
This is not incorrect parsing but expected. You are using the col tag incorrectly. None of the parsers will parse the HTML as you expect. As a matter of fact, if you use the lxml library directly, you also won't get what you expect:
>>> Code:
from lxml import etree
from io import StringIO, BytesIO
HTML = '<p><col> text</col> </p>'
parser = etree.HTMLParser() StringIO( HTML), parser) tree.getroot( ), pretty_print=True, method="html")
tree = etree.parse(
result = etree.tostring(
print(result)
>>> Result:
b'<html> <body>\ n<p></p> \n<col> text</body> </html> \n'