Vanishing ampersands when processing broken XML

Bug #1668070 reported by jonas
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

I'm using BeautifulSoup4 + lxml in a preprocessing step for transforming legacy XML data with proprietary markup information into HTML5.
The data contains some xml inconsistencies, probably because of unsupervised manual editing. E.g. in the following example the second bold tag was intended as closing tag, but the slash was forgotten:

<Text><Paragraph>Some <bold>important<bold> text.</Paragraph><Paragraph>Do you know Pat &amp; Patachon?</Paragraph></Text>

Interestingly, when creating the soup from this document, the ampersand in the following paragraph gets lost, so when converting the soup back to a string, I'm getting:

<?xml version="1.0" encoding="utf-8"?>\n<Text><Paragraph>Some <bold>important<bold> text.</bold><Paragraph>Do you know Pat Patachon?</Paragraph></bold></Paragraph></Text>

With two spaces between Pat and Patachon, but no ampersand!
The ampersand does not get lost when the document has a valid structure, that's why I'm thinking this might be a bug in BeautifulSoup.

I attached a unit test which shows how I'm calling bs4. I ran it on Ubuntu as well as on MacOS X with python 2.7. Both showed the same results.

MacOS:
platform: Darwin-13.4.0-x86_64-i386-64bit
python: sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0)
lxml: (3, 7, 1, 0)

Ubuntu:
platform: Linux-4.4.0-51-generic-x86_64-with-debian-stretch-sid
python: sys.version_info(major=2, minor=7, micro=9, releaselevel='final', serial=0)
lxml: (3, 7, 3, 0)

Revision history for this message
jonas (jonsinge) wrote :
Revision history for this message
jonas (jonsinge) wrote :

BeautifulSoup version: 4.5.1

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for the bug report. This looks like a behavior of lxml. I get the same output when running the bad markup through a similar process that doesn't use any Beautiful Soup code:

---
data = "<a><b><b></a>&amp;foo"

# Beautiful Soup + lxml
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'xml')
print soup
# <a><b><b/>foo</b></a>

# lxml alone
import lxml
from StringIO import StringIO
parser = lxml.etree.XMLParser(recover=True)
tree = lxml.etree.parse(StringIO(data), parser)
print lxml.etree.tostring(tree)
# <a><b><b/>foo</b></a>
---

I can't do anything about this within Beautiful Soup because lxml's XMLParser doesn't specially notify the target about entities. From my perspective it's like the markup doesn't exist.

I suggest filing an issue against lxml. This seems like a problem that should be recoverable.

Changed in beautifulsoup:
status: New → Won't Fix
Revision history for this message
jonas (jonsinge) wrote :

Hmm, thanks for your investigation. I didn't know about that recover parameter of lxml before.
I filed a new bug report here: https://bugs.launchpad.net/lxml/+bug/1694032

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.