lxml incorrect parsing

Bug #1925723 reported by Harnek
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

Python version: 3.8.9
bs4 version: 4.9.3
OS: Windows

smallest example:
<p><col>Text</col></p>

gives:
<p></p><col/>Text
<p></p>

expected:
<p><col>Text</col></p>

Revision history for this message
Isaac Muse (facelessuser) wrote :

This is not incorrect parsing but expected. You are using the col tag incorrectly. None of the parsers will parse the HTML as you expect. As a matter of fact, if you use the lxml library directly, you also won't get what you expect:

>>> Code:

from lxml import etree
from io import StringIO, BytesIO

HTML = '<p><col>text</col></p>'

parser = etree.HTMLParser()
tree = etree.parse(StringIO(HTML), parser)
result = etree.tostring(tree.getroot(), pretty_print=True, method="html")
print(result)

>>> Result:

b'<html><body>\n<p></p>\n<col>text</body></html>\n'

Revision history for this message
Harnek (harnek) wrote :

thanks for reply.
you are correct.
you can close this topic.

Changed in beautifulsoup:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.