Thanks for the bug report. This looks like a behavior of lxml. I get the same output when running the bad markup through a similar process that doesn't use any Beautiful Soup code:
I can't do anything about this within Beautiful Soup because lxml's XMLParser doesn't specially notify the target about entities. From my perspective it's like the markup doesn't exist.
I suggest filing an issue against lxml. This seems like a problem that should be recoverable.
Thanks for the bug report. This looks like a behavior of lxml. I get the same output when running the bad markup through a similar process that doesn't use any Beautiful Soup code:
--- <b></a> &foo"
data = "<a><b>
# Beautiful Soup + lxml b/>foo< /b></a>
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'xml')
print soup
# <a><b><
# lxml alone XMLParser( recover= True) parse(StringIO( data), parser) tostring( tree) b/>foo< /b></a>
import lxml
from StringIO import StringIO
parser = lxml.etree.
tree = lxml.etree.
print lxml.etree.
# <a><b><
---
I can't do anything about this within Beautiful Soup because lxml's XMLParser doesn't specially notify the target about entities. From my perspective it's like the markup doesn't exist.
I suggest filing an issue against lxml. This seems like a problem that should be recoverable.