XML+Unicode: Certain input strings fail to fully parse
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
Using the BeautifulSoup + XML option on my machine, certain input strings fail to fully parse.
For example:
from bs4 import BeautifulSoup
xml = "<ElementA>
soup = BeautifulSoup(
soup
Gives incorrect output of:
<?xml version="1.0" encoding="utf-8"?>
<ElementA>
But if I change just one thing and remove a single character 'X':
from bs4 import BeautifulSoup
xml = "<ElementA>
soup = BeautifulSoup(
soup
Then I get the more complete output of:
<?xml version="1.0" encoding="utf-8"?>
<ElementA>
BAD
After bad character<
The same problem also occurs with this string (this one contains only valid characters):
xml = "<ElementA>
Note I queried this bug on StackOverflow. I thought I had a workaround, but unfortunately it turns out this causes other problems, so I can no longer use it.
http://
I first encountered the problem on BS4.3.2, however it still occurs even after upgrading to BS4.4.0.
My other installed packages are:
beautifulsoup4 (4.4.0)
cchardet (0.3.5)
chardet (2.3.0)
lxml (3.4.4)
I use Python 3.4.3 on OSX 10.10.3 (Yosemite).
There are a number of open bugs against Beautiful Soup in which lxml behaves poorly on Mac OS X (bug 1105148) or Windows (bug 1438111, bug 1417011). I can't test on those platforms, and the problems don't occur on Linux, so there's not much I can do except offer advice.
The quick-and-dirty solution may be to run Unicode.detwingle() on your markup before giving it to lxml. detwingle() can solve a lot of problems with inconsistent encodings, and at the very least it will use REPLACEMENT CHARACTER to make sure the document comes out in UTF-8: http:// www.crummy. com/software/ BeautifulSoup/ bs4/doc/ #inconsistent- encodings
I've noticed that lxml will behave oddly or segfault on Mac OS X depending on how it was installed. (See bug 197243 for instance). That's something to look into. See http:// lxml.de/ FAQ.html# my-application- crashes- on-macos- x for more information.
Assuming that's not the problem, it's theoretically possible there's a bug in the BS lxml treebuilder where it treats the same input events differently on different platforms, but it's much more likely there's a bug in lxml that causes it to send different input events on different platforms. I say this because Beautiful Soup is pure Python and lxml is mainly a wrapper around a C library.
To actually solve the problem I'd advise trying to recreate it using only lxml code and filing a bug against lxml. Normally in this case I'd recommend starting with the diagnose. lxml_trace( ) function, which explains how lxml handles a document when there is no Beautiful Soup code running. However, just loading your sample markup string into lxml gives an exception:
>>> lxml_trace( "<ElementA> <ElementB> Before bad character XX\n\x80 BAD\nAfter bad character< /ElementB> <ElementC> In element C</ElementC> </ElementA> ") _Element. text.__ get__ (src/lxml/ lxml.etree. c:44847) _collectText (src/lxml/ lxml.etree. c:20031) lxml.etree. c:26255)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "bs4/diagnose.py", line 86, in lxml_trace
print("%s, %4s, %s" % (event, element.tag, element.text))
File "lxml.etree.pyx", line 947, in lxml.etree.
File "apihelpers.pxi", line 647, in lxml.etree.
File "apihelpers.pxi", line 1373, in lxml.etree.funicode (src/lxml/
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 24: invalid start byte
You might learn something by comparing the output of this script to what I have:
from bs4 import UnicodeDammit
xml = "<ElementA> <ElementB> Before bad character XX\n\x80 BAD\nAfter bad character< /ElementB> <ElementC> In element C</ElementC> </ElementA> " detwingle( xml) xml).markup
detwingled = UnicodeDammit.
unicoded = UnicodeDammit(
from bs4.diagnose import lxml_trace detwingled) unicoded)
print lxml_trace(
print
print lxml_trace(
print
My output:
end, elementb, Before bad character XX
€ BAD
After bad character
end, elementc, In element C
end, elementa, None
end, body, None
end, html, None
None
end, elementb, Before bad character XX
€ BAD
After bad character
end, elementc, In element C
end, elementa, None
end, body, None
end, html, None
None