Beautiful Soup

Vanishing ampersands when processing broken XML

Bug #1668070 reported by jonas on 2017-02-26

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Won't Fix	Undecided	Unassigned

Bug Description

I'm using BeautifulSoup4 + lxml in a preprocessing step for transforming legacy XML data with proprietary markup information into HTML5.
The data contains some xml inconsistencies, probably because of unsupervised manual editing. E.g. in the following example the second bold tag was intended as closing tag, but the slash was forgotten:

<Text><Paragraph>Some <bold>important<bold> text.</Paragraph><Paragraph>Do you know Pat & Patachon?</Paragraph></Text>

Interestingly, when creating the soup from this document, the ampersand in the following paragraph gets lost, so when converting the soup back to a string, I'm getting:

<?xml version="1.0" encoding="utf-8"?>\n<Text><Paragraph>Some <bold>important<bold> text.</bold><Paragraph>Do you know Pat Patachon?</Paragraph></bold></Paragraph></Text>

With two spaces between Pat and Patachon, but no ampersand!
The ampersand does not get lost when the document has a valid structure, that's why I'm thinking this might be a bug in BeautifulSoup.

I attached a unit test which shows how I'm calling bs4. I ran it on Ubuntu as well as on MacOS X with python 2.7. Both showed the same results.

MacOS:
platform: Darwin-13.4.0-x86_64-i386-64bit
python: sys.version_info(major=2, minor=7, micro=12, releaselevel='final', serial=0)
lxml: (3, 7, 1, 0)

Ubuntu:
platform: Linux-4.4.0-51-generic-x86_64-with-debian-stretch-sid
python: sys.version_info(major=2, minor=7, micro=9, releaselevel='final', serial=0)
lxml: (3, 7, 3, 0)

Revision history for this message

jonas (jonsinge) wrote on 2017-02-26:

Failing unit test that shows that ampersand gets lost Edit (1021 bytes, text/x-python)

Revision history for this message

jonas (jonsinge) wrote on 2017-02-26:

BeautifulSoup version: 4.5.1

Revision history for this message

Leonard Richardson (leonardr) wrote on 2017-05-07:

Thanks for the bug report. This looks like a behavior of lxml. I get the same output when running the bad markup through a similar process that doesn't use any Beautiful Soup code:

---
data = "<a></a>&foo"

# Beautiful Soup + lxml
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'xml')
print soup
# <a>foo</a>

# lxml alone
import lxml
from StringIO import StringIO
parser = lxml.etree.XMLParser(recover=True)
tree = lxml.etree.parse(StringIO(data), parser)
print lxml.etree.tostring(tree)
# <a>foo</a>
---

I can't do anything about this within Beautiful Soup because lxml's XMLParser doesn't specially notify the target about entities. From my perspective it's like the markup doesn't exist.

I suggest filing an issue against lxml. This seems like a problem that should be recoverable.

Changed in beautifulsoup:
status:	New → Won't Fix

Revision history for this message

jonas (jonsinge) wrote on 2017-05-27:

Hmm, thanks for your investigation. I didn't know about that recover parameter of lxml before.
I filed a new bug report here: https://bugs.launchpad.net/lxml/+bug/1694032

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Failing unit test that shows that ampersand gets lost Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.