Beautiful Soup

nzb file parsing failing

Bug #1034883 reported by Andreas Kostyrka on 2012-08-09

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Won't Fix	Undecided	Unassigned

Bug Description

partimag@andidesk:/tmp$ ./xmltest test3.nzb
BS4 /usr/local/lib/python2.6/dist-packages/beautifulsoup4-4.1.1-py2.6.egg/bs4/__init__.pyc
test3.nzb segment found 3
     1. <segment bytes="84703" number="52"><email address hidden></segment>
     2. <segment bytes="793586" number="6"><email address hidden></segment>
     3. <segment bytes="793332" number="33">1342408408</segment>

xmltest and test3.nzb are attached.

Platform is Py 2.6 on Ubuntu 10.04LTS

Revision history for this message

Andreas Kostyrka (andreas-kostyrka) wrote on 2012-08-09:

data file bs411 is borking on. Edit (6.7 KiB, text/xml)

Revision history for this message

Andreas Kostyrka (andreas-kostyrka) wrote on 2012-08-09:

test driver Edit (388 bytes, text/plain)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2012-08-09:

To make a long story short, this is a bug in lxml. Specifically, it's bug 963936, which prevents any substantial Unicode XML document from being parsed by lxml through the feed() interface.

Version 4.0.1 of Beautiful Soup worked around this by passing in the document in 512-character chunks. This workaround exposed bug Beautiful Soup bug 972466: HTML documents whose <meta> tags declared the encoding as other than UTF-8 became mangled, due to an unknown bug in lxml--possibly the same as bug 963936, possibly a different bug, certainly a related bug since they both involve Unicode data.

Your NZB file starts with this line:

<?xml version="1.0" encoding="iso-8859-1" ?>

Replace it with this line and the file will parse:

<?xml version="1.0" encoding="iso-8859-1" ?>

It looks like you've reproduced bug 972466 with an XML document. The 'encoding' attribute in the XML declaration triggers the same underlying lxml bug as does the <meta> tag declaration in an HTML document.

For HTML documents, the solution was to remove the 963936 workaround (in Beautiful Soup 4.0.2). lxml's HTML parser doesn't have 963936. But the XML parser does have 963936, and the bug has yet to be fixed upstream. That bug is much worse than this one, so the Beautiful Soup workaround needs to stay in place.

I recommend parsing your documents as HTML. Both the built-in parser and lxml's HTML parser find all the <segment> tags in the document.

You can also use UnicodeDammit to convert the documents to Unicode, then replace 'encoding="iso-8859-1"' with 'encoding="utf8"'. The lxml XML parser will then parse the documents correctly.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2012-08-09:

Should be:

Replace it with this line and the file will parse:

<?xml version="1.0" encoding="utf-8" ?>

Revision history for this message

Andreas Kostyrka (andreas-kostyrka) wrote on 2012-08-09: Re: [Bug 1034883] Re: nzb file parsing failing

Well, parsing via html is probably not a solution because I need a
parse-modify-save solution, but I think I can just convert the file rather
in a rough way to utf-8 without to much pain.

Thanks,

Andreas

2012/8/9 Leonard Richardson <email address hidden>

> Should be:
>
> Replace it with this line and the file will parse:
>
> <?xml version="1.0" encoding="utf-8" ?>
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1034883
>
> Title:
> nzb file parsing failing
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/beautifulsoup/+bug/1034883/+subscriptions
>