Parsing of processing instructions broken in HTMLParser

Bug #1708138 reported by Walter Dörwald on 2017-08-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Low
Unassigned

Bug Description

The following test script:
----------------------------------------------------------------------------
import sys, io
from lxml import etree

print("%-20s: %s" % ('Python', sys.version_info))
print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))

parser = etree.HTMLParser(encoding='utf-8')
doc = etree.parse(io.BytesIO(b'<a><?gurk hurz?></a>'), parser)
p = doc.getroot().getchildren()[0].getchildren()[0].getchildren()[0]
print(p)
print(p.target)
print(p.text)
----------------------------------------------------------------------------
outputs:
----------------------------------------------------------------------------
Python : sys.version_info(major=3, minor=6, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 8, 0, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)
<?gurk hurz??>
gurk
hurz?
----------------------------------------------------------------------------
A trailing '?' is included in the PI text, which is not what I expected. I expected the output to be:
----------------------------------------------------------------------------
<?gurk hurz?>
gurk
hurz
----------------------------------------------------------------------------

scoder (scoder) wrote :

Interesting. According to https://en.wikipedia.org/wiki/Processing_Instruction
"""
An SGML processing instruction is enclosed within <? and >.

An XML processing instruction is enclosed within <? and ?>, and contains a target and optionally some content, which is the node value, that cannot contain the sequence ?>.
"""

Since HTML is based on SGML and not XML, this means that the parser is actually correct, but the display/repr isn't.

BTW, note that this:
p = doc.getroot().getchildren()[0].getchildren()[0].getchildren()[0]
is substantially less readable/efficient than just
p = doc.getroot()[0][0][0]

Changed in lxml:
importance: Undecided → Low
status: New → Confirmed
Walter Dörwald (doerwalter) wrote :

The link in the Wikipedia article (http://www.is-thought.co.uk/book/sgml-8.htm#PI) is a 404. But https://www.w3.org/TR/NOTE-sgml-xml seems to indicate that the PI terminators in XML and SGML are indeed different. However this would mean that the _ProcessingInstruction object might have to remember whether it is an XML or an SGML PI?

scoder (scoder) wrote :

> this would mean that the _ProcessingInstruction object might have to remember whether it is an XML or an SGML PI?

Yes. In fact, its document would know (-> document node type), but for an independently created PI, there is no indication which type was intended (it would always be an XML PI). And if you parse a PI from an HTML document and then stick it into an XML tree, you'd still end up with the same problem, even on serialisation.

It would probably be possible to change the text content when moving PIs between XML and HTML documents, but then it would also have to be done recursively...

Seems like one of those quirks that are best documented somewhere and otherwise forgotten. ;)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers