Vanishing ampersands when processing broken XML
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
I'm using BeautifulSoup4 + lxml in a preprocessing step for transforming legacy XML data with proprietary markup information into HTML5.
The data contains some xml inconsistencies, probably because of unsupervised manual editing. E.g. in the following example the second bold tag was intended as closing tag, but the slash was forgotten:
<Text><
Interestingly, when creating the soup from this document, the ampersand in the following paragraph gets lost, so when converting the soup back to a string, I'm getting:
<?xml version="1.0" encoding=
With two spaces between Pat and Patachon, but no ampersand!
The ampersand does not get lost when the document has a valid structure, that's why I'm thinking this might be a bug in BeautifulSoup.
I attached a unit test which shows how I'm calling bs4. I ran it on Ubuntu as well as on MacOS X with python 2.7. Both showed the same results.
MacOS:
platform: Darwin-
python: sys.version_
lxml: (3, 7, 1, 0)
Ubuntu:
platform: Linux-4.
python: sys.version_
lxml: (3, 7, 3, 0)
BeautifulSoup version: 4.5.1