lxml

Parsing of processing instructions broken in HTMLParser

Bug #1708138 reported by Walter Dörwald on 2017-08-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Confirmed	Low	Unassigned

Bug Description

The following test script:
----------------------------------------------------------------------------
import sys, io
from lxml import etree

print("%-20s: %s" % ('Python', sys.version_info))
print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))

parser = etree.HTMLParser(encoding='utf-8')
doc = etree.parse(io.BytesIO(b'<a><?gurk hurz?></a>'), parser)
p = doc.getroot().getchildren()[0].getchildren()[0].getchildren()[0]
print(p)
print(p.target)
print(p.text)
----------------------------------------------------------------------------
outputs:
----------------------------------------------------------------------------
Python : sys.version_info(major=3, minor=6, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 8, 0, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)
<?gurk hurz??>
gurk
hurz?
----------------------------------------------------------------------------
A trailing '?' is included in the PI text, which is not what I expected. I expected the output to be:
----------------------------------------------------------------------------
<?gurk hurz?>
gurk
hurz
----------------------------------------------------------------------------

Revision history for this message

scoder (scoder) wrote on 2017-08-02:

Interesting. According to https://en.wikipedia.org/wiki/Processing_Instruction
"""
An SGML processing instruction is enclosed within <? and >.

An XML processing instruction is enclosed within <? and ?>, and contains a target and optionally some content, which is the node value, that cannot contain the sequence ?>.
"""

Since HTML is based on SGML and not XML, this means that the parser is actually correct, but the display/repr isn't.

BTW, note that this:
p = doc.getroot().getchildren()[0].getchildren()[0].getchildren()[0]
is substantially less readable/efficient than just
p = doc.getroot()[0][0][0]

Changed in lxml:
importance:	Undecided → Low
status:	New → Confirmed

Revision history for this message

Walter Dörwald (doerwalter) wrote on 2017-08-02:

The link in the Wikipedia article (http://www.is-thought.co.uk/book/sgml-8.htm#PI) is a 404. But https://www.w3.org/TR/NOTE-sgml-xml seems to indicate that the PI terminators in XML and SGML are indeed different. However this would mean that the _ProcessingInstruction object might have to remember whether it is an XML or an SGML PI?

Revision history for this message

scoder (scoder) wrote on 2017-08-02:

> this would mean that the _ProcessingInstruction object might have to remember whether it is an XML or an SGML PI?

Yes. In fact, its document would know (-> document node type), but for an independently created PI, there is no indication which type was intended (it would always be an XML PI). And if you parse a PI from an HTML document and then stick it into an XML tree, you'd still end up with the same problem, even on serialisation.

It would probably be possible to change the text content when moving PIs between XML and HTML documents, but then it would also have to be done recursively...

Seems like one of those quirks that are best documented somewhere and otherwise forgotten. ;)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.