Comment 3 for bug 1668070

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for the bug report. This looks like a behavior of lxml. I get the same output when running the bad markup through a similar process that doesn't use any Beautiful Soup code:

---
data = "<a><b><b></a>&amp;foo"

# Beautiful Soup + lxml
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'xml')
print soup
# <a><b><b/>foo</b></a>

# lxml alone
import lxml
from StringIO import StringIO
parser = lxml.etree.XMLParser(recover=True)
tree = lxml.etree.parse(StringIO(data), parser)
print lxml.etree.tostring(tree)
# <a><b><b/>foo</b></a>
---

I can't do anything about this within Beautiful Soup because lxml's XMLParser doesn't specially notify the target about entities. From my perspective it's like the markup doesn't exist.

I suggest filing an issue against lxml. This seems like a problem that should be recoverable.