Parsing of processing instructions broken in HTMLParser
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Confirmed
|
Low
|
Unassigned |
Bug Description
The following test script:
-------
import sys, io
from lxml import etree
print("%-20s: %s" % ('Python', sys.version_info))
print("%-20s: %s" % ('lxml.etree', etree.LXML_
print("%-20s: %s" % ('libxml used', etree.LIBXML_
print("%-20s: %s" % ('libxml compiled', etree.LIBXML_
print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_
parser = etree.HTMLParse
doc = etree.parse(
p = doc.getroot(
print(p)
print(p.target)
print(p.text)
-------
outputs:
-------
Python : sys.version_
lxml.etree : (3, 8, 0, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)
<?gurk hurz??>
gurk
hurz?
-------
A trailing '?' is included in the PI text, which is not what I expected. I expected the output to be:
-------
<?gurk hurz?>
gurk
hurz
-------
Interesting. According to https:/ /en.wikipedia. org/wiki/ Processing_ Instruction
"""
An SGML processing instruction is enclosed within <? and >.
An XML processing instruction is enclosed within <? and ?>, and contains a target and optionally some content, which is the node value, that cannot contain the sequence ?>.
"""
Since HTML is based on SGML and not XML, this means that the parser is actually correct, but the display/repr isn't.
BTW, note that this: ).getchildren( )[0].getchildre n()[0]. getchildren( )[0] )[0][0] [0]
p = doc.getroot(
is substantially less readable/efficient than just
p = doc.getroot(