Entities vanish when recover=True is set

Bug #1694032 reported by jonas
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Confirmed
Undecided
Unassigned

Bug Description

I'm using lxml (together with BeautifulSoup4) in a preprocessing step for transforming legacy XML data with proprietary markup information into HTML5.
The data contains some xml inconsistencies, probably because of unsupervised manual editing.

During that I stumbled on a somewhat unexpected behaviour. When settinger recover=True on the lxml Parser, the xml entities from texts are left out during parsing when the previous XML structure has an invalid syntax. The rest of the text is recovered though.

I wrote an initial bug report for BeautifulSoup4 (see https://bugs.launchpad.net/beautifulsoup/+bug/1668070?comments=all ) into which Leonard Richardson investigated and suggested to file a bug report here.

Leonard wrote the following test for illustration:

---
data = "<a><b><b></a>&amp;foo"

# lxml alone
import lxml
from StringIO import StringIO
parser = lxml.etree.XMLParser(recover=True)
tree = lxml.etree.parse(StringIO(data), parser)
print lxml.etree.tostring(tree)
# <a><b><b/>foo</b></a>
---

The system where I tested this (MacOS):
Python : sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0)
lxml.etree : (3, 7, 1, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 0)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Revision history for this message
scoder (scoder) wrote :

Sadly, all I can tell you is to go yet another level deeper and report the problem to the libxml2 project, which does the parsing here.

Changed in lxml:
status: New → Invalid
Revision history for this message
scoder (scoder) wrote :

Hmm, I take that back. I couldn't reproduce this with plain libxml2 (i.e. their xmllint tool), so it must be something in lxml.

$ echo "<a><b><b></a>&amp;foo" | xmllint --html --recover --noent -
-:1: HTML parser error : Opening and ending tag mismatch: a and b
<a><b><b></a>&amp;foo
             ^
-:1: HTML parser error : Opening and ending tag mismatch: a and b
<a><b><b></a>&amp;foo
             ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<a><b><b></b></b></a>&amp;foo
</body></html>

Changed in lxml:
status: Invalid → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.