Entities vanish when recover=True is set
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
I'm using lxml (together with BeautifulSoup4) in a preprocessing step for transforming legacy XML data with proprietary markup information into HTML5.
The data contains some xml inconsistencies, probably because of unsupervised manual editing.
During that I stumbled on a somewhat unexpected behaviour. When settinger recover=True on the lxml Parser, the xml entities from texts are left out during parsing when the previous XML structure has an invalid syntax. The rest of the text is recovered though.
I wrote an initial bug report for BeautifulSoup4 (see https:/
Leonard wrote the following test for illustration:
---
data = "<a><b>
# lxml alone
import lxml
from StringIO import StringIO
parser = lxml.etree.
tree = lxml.etree.
print lxml.etree.
# <a><b><
---
The system where I tested this (MacOS):
Python : sys.version_
lxml.etree : (3, 7, 1, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 0)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)
Sadly, all I can tell you is to go yet another level deeper and report the problem to the libxml2 project, which does the parsing here.