Parsing of processing instructions broken in HTMLParser

Bug #1708138 reported by Walter Dörwald
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Confirmed
Low
Unassigned

Bug Description

The following test script:
----------------------------------------------------------------------------
import sys, io
from lxml import etree

print("%-20s: %s" % ('Python', sys.version_info))
print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))

parser = etree.HTMLParser(encoding='utf-8')
doc = etree.parse(io.BytesIO(b'<a><?gurk hurz?></a>'), parser)
p = doc.getroot().getchildren()[0].getchildren()[0].getchildren()[0]
print(p)
print(p.target)
print(p.text)
----------------------------------------------------------------------------
outputs:
----------------------------------------------------------------------------
Python : sys.version_info(major=3, minor=6, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 8, 0, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)
<?gurk hurz??>
gurk
hurz?
----------------------------------------------------------------------------
A trailing '?' is included in the PI text, which is not what I expected. I expected the output to be:
----------------------------------------------------------------------------
<?gurk hurz?>
gurk
hurz
----------------------------------------------------------------------------

Revision history for this message
scoder (scoder) wrote :

Interesting. According to https://en.wikipedia.org/wiki/Processing_Instruction
"""
An SGML processing instruction is enclosed within <? and >.

An XML processing instruction is enclosed within <? and ?>, and contains a target and optionally some content, which is the node value, that cannot contain the sequence ?>.
"""

Since HTML is based on SGML and not XML, this means that the parser is actually correct, but the display/repr isn't.

BTW, note that this:
p = doc.getroot().getchildren()[0].getchildren()[0].getchildren()[0]
is substantially less readable/efficient than just
p = doc.getroot()[0][0][0]

Changed in lxml:
importance: Undecided → Low
status: New → Confirmed
Revision history for this message
Walter Dörwald (doerwalter) wrote :

The link in the Wikipedia article (http://www.is-thought.co.uk/book/sgml-8.htm#PI) is a 404. But https://www.w3.org/TR/NOTE-sgml-xml seems to indicate that the PI terminators in XML and SGML are indeed different. However this would mean that the _ProcessingInstruction object might have to remember whether it is an XML or an SGML PI?

Revision history for this message
scoder (scoder) wrote :

> this would mean that the _ProcessingInstruction object might have to remember whether it is an XML or an SGML PI?

Yes. In fact, its document would know (-> document node type), but for an independently created PI, there is no indication which type was intended (it would always be an XML PI). And if you parse a PI from an HTML document and then stick it into an XML tree, you'd still end up with the same problem, even on serialisation.

It would probably be possible to change the text content when moving PIs between XML and HTML documents, but then it would also have to be done recursively...

Seems like one of those quirks that are best documented somewhere and otherwise forgotten. ;)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.