XML+Unicode: Certain input strings fail to fully parse

Bug #1471485 reported by C McNally
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

Using the BeautifulSoup + XML option on my machine, certain input strings fail to fully parse.

For example:

    from bs4 import BeautifulSoup
    xml = "<ElementA><ElementB>Before bad character XX\n\x80 BAD\nAfter bad character</ElementB><ElementC>In element C</ElementC></ElementA>"
    soup = BeautifulSoup(xml,"xml")
    soup

Gives incorrect output of:

    <?xml version="1.0" encoding="utf-8"?>
    <ElementA><ElementB/></ElementA>

But if I change just one thing and remove a single character 'X':

    from bs4 import BeautifulSoup
    xml = "<ElementA><ElementB>Before bad character X\n\x80 BAD\nAfter bad character</ElementB><ElementC>In element C</ElementC></ElementA>"
    soup = BeautifulSoup(xml,"xml")
    soup

Then I get the more complete output of:

    <?xml version="1.0" encoding="utf-8"?>
    <ElementA><ElementB>Before bad character X
     BAD
    After bad character</ElementB><ElementC>In element C</ElementC></ElementA>

The same problem also occurs with this string (this one contains only valid characters):

    xml = "<ElementA><ElementB>Before bad character XX\n• BAD\nAfter bad character</ElementB><ElementC>In element C</ElementC></ElementA>"

Note I queried this bug on StackOverflow. I thought I had a workaround, but unfortunately it turns out this causes other problems, so I can no longer use it.
http://stackoverflow.com/questions/31126831/beautifulsoup-with-xml-fails-to-parse-full-unicode-strings

I first encountered the problem on BS4.3.2, however it still occurs even after upgrading to BS4.4.0.

My other installed packages are:
    beautifulsoup4 (4.4.0)
    cchardet (0.3.5)
    chardet (2.3.0)
    lxml (3.4.4)

I use Python 3.4.3 on OSX 10.10.3 (Yosemite).

Revision history for this message
Leonard Richardson (leonardr) wrote :

There are a number of open bugs against Beautiful Soup in which lxml behaves poorly on Mac OS X (bug 1105148) or Windows (bug 1438111, bug 1417011). I can't test on those platforms, and the problems don't occur on Linux, so there's not much I can do except offer advice.

The quick-and-dirty solution may be to run Unicode.detwingle() on your markup before giving it to lxml. detwingle() can solve a lot of problems with inconsistent encodings, and at the very least it will use REPLACEMENT CHARACTER to make sure the document comes out in UTF-8: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#inconsistent-encodings

I've noticed that lxml will behave oddly or segfault on Mac OS X depending on how it was installed. (See bug 197243 for instance). That's something to look into. See http://lxml.de/FAQ.html#my-application-crashes-on-macos-x for more information.

Assuming that's not the problem, it's theoretically possible there's a bug in the BS lxml treebuilder where it treats the same input events differently on different platforms, but it's much more likely there's a bug in lxml that causes it to send different input events on different platforms. I say this because Beautiful Soup is pure Python and lxml is mainly a wrapper around a C library.

To actually solve the problem I'd advise trying to recreate it using only lxml code and filing a bug against lxml. Normally in this case I'd recommend starting with the diagnose.lxml_trace() function, which explains how lxml handles a document when there is no Beautiful Soup code running. However, just loading your sample markup string into lxml gives an exception:

>>> lxml_trace("<ElementA><ElementB>Before bad character XX\n\x80 BAD\nAfter bad character</ElementB><ElementC>In element C</ElementC></ElementA>")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bs4/diagnose.py", line 86, in lxml_trace
    print("%s, %4s, %s" % (event, element.tag, element.text))
  File "lxml.etree.pyx", line 947, in lxml.etree._Element.text.__get__ (src/lxml/lxml.etree.c:44847)
  File "apihelpers.pxi", line 647, in lxml.etree._collectText (src/lxml/lxml.etree.c:20031)
  File "apihelpers.pxi", line 1373, in lxml.etree.funicode (src/lxml/lxml.etree.c:26255)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 24: invalid start byte

You might learn something by comparing the output of this script to what I have:

from bs4 import UnicodeDammit

xml = "<ElementA><ElementB>Before bad character XX\n\x80 BAD\nAfter bad character</ElementB><ElementC>In element C</ElementC></ElementA>"
detwingled = UnicodeDammit.detwingle(xml)
unicoded = UnicodeDammit(xml).markup

from bs4.diagnose import lxml_trace
print lxml_trace(detwingled)
print
print lxml_trace(unicoded)
print

My output:

end, elementb, Before bad character XX
€ BAD
After bad character
end, elementc, In element C
end, elementa, None
end, body, None
end, html, None
None

end, elementb, Before bad character XX
€ BAD
After bad character
end, elementc, In element C
end, elementa, None
end, body, None
end, html, None
None

Changed in beautifulsoup:
status: New → Incomplete
Revision history for this message
C McNally (theconor) wrote : Re: [Bug 1471485] Re: XML+Unicode: Certain input strings fail to fully parse
Download full text (5.6 KiB)

Hi Leonard, thank you for the detailed reply and helpful advice. I will
try some of this stuff tonight and will report back.

On Sun, Jul 5, 2015 at 5:11 PM, Leonard Richardson <email address hidden>
wrote:

> There are a number of open bugs against Beautiful Soup in which lxml
> behaves poorly on Mac OS X (bug 1105148) or Windows (bug 1438111, bug
> 1417011). I can't test on those platforms, and the problems don't occur
> on Linux, so there's not much I can do except offer advice.
>
> The quick-and-dirty solution may be to run Unicode.detwingle() on your
> markup before giving it to lxml. detwingle() can solve a lot of problems
> with inconsistent encodings, and at the very least it will use
> REPLACEMENT CHARACTER to make sure the document comes out in UTF-8:
> http://www.crummy.com/software/BeautifulSoup/bs4/doc/#inconsistent-
> encodings
>
> I've noticed that lxml will behave oddly or segfault on Mac OS X
> depending on how it was installed. (See bug 197243 for instance). That's
> something to look into. See http://lxml.de/FAQ.html#my-application-
> crashes-on-macos-x for more information.
>
> Assuming that's not the problem, it's theoretically possible there's a
> bug in the BS lxml treebuilder where it treats the same input events
> differently on different platforms, but it's much more likely there's a
> bug in lxml that causes it to send different input events on different
> platforms. I say this because Beautiful Soup is pure Python and lxml is
> mainly a wrapper around a C library.
>
> To actually solve the problem I'd advise trying to recreate it using
> only lxml code and filing a bug against lxml. Normally in this case I'd
> recommend starting with the diagnose.lxml_trace() function, which
> explains how lxml handles a document when there is no Beautiful Soup
> code running. However, just loading your sample markup string into lxml
> gives an exception:
>
> >>> lxml_trace("<ElementA><ElementB>Before bad character XX\n\x80
> BAD\nAfter bad character</ElementB><ElementC>In element
> C</ElementC></ElementA>")
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "bs4/diagnose.py", line 86, in lxml_trace
> print("%s, %4s, %s" % (event, element.tag, element.text))
> File "lxml.etree.pyx", line 947, in lxml.etree._Element.text.__get__
> (src/lxml/lxml.etree.c:44847)
> File "apihelpers.pxi", line 647, in lxml.etree._collectText
> (src/lxml/lxml.etree.c:20031)
> File "apihelpers.pxi", line 1373, in lxml.etree.funicode
> (src/lxml/lxml.etree.c:26255)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 24:
> invalid start byte
>
> You might learn something by comparing the output of this script to what
> I have:
>
> from bs4 import UnicodeDammit
>
> xml = "<ElementA><ElementB>Before bad character XX\n\x80 BAD\nAfter bad
> character</ElementB><ElementC>In element C</ElementC></ElementA>"
> detwingled = UnicodeDammit.detwingle(xml)
> unicoded = UnicodeDammit(xml).markup
>
> from bs4.diagnose import lxml_trace
> print lxml_trace(detwingled)
> print
> print lxml_trace(unicoded)
> print
>
> My output:
>
> end, elementb, Before bad character XX
> € BAD
> After bad character
> end, ele...

Read more...

Revision history for this message
C McNally (theconor) wrote :

Hi, I've done some more digging and I' wondering if it could be a Python 3 problem? I notice that your print statements indicate you are using Python 2.X, and indeed, when I run the same test on Python 2.7 it works for me too. The input string also works fine using lxml on its own.

I have updated the test case to the following:

from bs4 import BeautifulSoup
from bs4 import UnicodeDammit
from lxml import etree
#from bs4.diagnose import lxml_trace

for i in range(0,64):

    xml = u"<A><B>%s \xe2 BAD</B>" \
          u"<C>next</C></A>"%(i*u"X",)
    xml = UnicodeDammit.detwingle(xml) # Doesn't make any difference

    soup = BeautifulSoup(xml,"xml")
    print("\n",i,"\nsoup: ", soup)

    root = etree.fromstring(xml)
    print("lxml: ", etree.tostring(root, encoding='unicode'))

    if soup.C is None:
        print("\n\nERROR!\n\n")
        lxml_trace(xml, html=False)
        break

On Pyhon 3.4, on my machine, the error happens exactly on iteration #37. All other iterations from 0 to 63 work fine, so could it be something to do with the way the string is chunked when it is passed through to lxml?

I tried running lxml_trace() as you suggested but found it did not work in the python 3.4 environment. Could there be something wrong with my installation? Anyway, I created a new version as follows:

def lxml_trace(data, html=True, **kwargs):
    """ Hacked for Python3 """
    from io import BytesIO
    for event, element in etree.iterparse(BytesIO(data.encode('utf-8')), html=html, **kwargs):
        print("%s, %4s, %s" % (event, element.tag, element.text))

And got:

end, B, XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX â BAD
end, C, next
end, A, None

So all that looks OK too, but I don't know I was unsure about adding the data.encode() part. I'm quite new to Python so I find this quite complicated I'm afraid. I did find this FAQ on lxml that suggests I should have being using bytes all along?
http://lxml.de/FAQ.html#can-lxml-parse-from-file-objects-opened-in-unicode-text-mode

Using bytes was actually the fix I strayed across on my StackOverflow post. It did look like it was mostly working, it definitely captured all of the elements, but unfortunately a small number of documents had corrupt characters on the output. So my plan for that was to pass every piece of text through detwingle (or actually ftfy) again, which would be a pretty horrible hack.

Finally, I have tried to test this on a Windows machine to see if there are any differences, but have so far been unable to install lxml. Would be interested to hear if the test case works on you Linux machine or not, running Python 3.4?

Revision history for this message
Leonard Richardson (leonardr) wrote :

I ran your test code on Linux using Python 3.4.3 and lxml 3.7.0 and did not get any errors. Looking at the changelogs for the past few versions of lxml, I don't see any bug fixes that seem relevant. The only Mac OS X specific change was a fix to a build error.

Revision history for this message
C McNally (theconor) wrote :

Hi Leonard,

Thanks for looking at this. I've actually got a brand new mac now, so everything is reinstalled from scratch, yet I still get the same exception at iteration #37 when running the test script. It must be something to do with how data is chunked before being decoded, but I wouldn't know what.

My current config is:

beautifulsoup4 (4.5.1)
lxml (3.6.4)
libxml2.2
macOS Sierra 10.12.1
Python 3.5.2

Let me know if there's anything else I should check.

Regards,

Conor

Revision history for this message
Isaac Muse (facelessuser) wrote :

I cannot reproduce any of these issues on my mac. I am on 10.13.6 which is not a particular new version either.

I suspect whatever the issue was, it has since been fixed and was very specific to a combination of very specific versions. As this issue is years old, I would consider just closing and seeing if it resurfaces.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for looking into this, Isaac.

Changed in beautifulsoup:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.