nzb file parsing failing

Reported by Andreas Kostyrka on 2012-08-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Undecided
Unassigned

Bug Description

partimag@andidesk:/tmp$ ./xmltest test3.nzb
BS4 /usr/local/lib/python2.6/dist-packages/beautifulsoup4-4.1.1-py2.6.egg/bs4/__init__.pyc
test3.nzb segment found 3
     1. <segment bytes="84703" number="52"><email address hidden></segment>
     2. <segment bytes="793586" number="6"><email address hidden></segment>
     3. <segment bytes="793332" number="33">1342408408</segment>

xmltest and test3.nzb are attached.

Platform is Py 2.6 on Ubuntu 10.04LTS

Leonard Richardson (leonardr) wrote :

To make a long story short, this is a bug in lxml. Specifically, it's bug 963936, which prevents any substantial Unicode XML document from being parsed by lxml through the feed() interface.

Version 4.0.1 of Beautiful Soup worked around this by passing in the document in 512-character chunks. This workaround exposed bug Beautiful Soup bug 972466: HTML documents whose <meta> tags declared the encoding as other than UTF-8 became mangled, due to an unknown bug in lxml--possibly the same as bug 963936, possibly a different bug, certainly a related bug since they both involve Unicode data.

Your NZB file starts with this line:

 <?xml version="1.0" encoding="iso-8859-1" ?>

Replace it with this line and the file will parse:

 <?xml version="1.0" encoding="iso-8859-1" ?>

It looks like you've reproduced bug 972466 with an XML document. The 'encoding' attribute in the XML declaration triggers the same underlying lxml bug as does the <meta> tag declaration in an HTML document.

For HTML documents, the solution was to remove the 963936 workaround (in Beautiful Soup 4.0.2). lxml's HTML parser doesn't have 963936. But the XML parser does have 963936, and the bug has yet to be fixed upstream. That bug is much worse than this one, so the Beautiful Soup workaround needs to stay in place.

I recommend parsing your documents as HTML. Both the built-in parser and lxml's HTML parser find all the <segment> tags in the document.

You can also use UnicodeDammit to convert the documents to Unicode, then replace 'encoding="iso-8859-1"' with 'encoding="utf8"'. The lxml XML parser will then parse the documents correctly.

Leonard Richardson (leonardr) wrote :

Should be:

Replace it with this line and the file will parse:

 <?xml version="1.0" encoding="utf-8" ?>

Well, parsing via html is probably not a solution because I need a
parse-modify-save solution, but I think I can just convert the file rather
in a rough way to utf-8 without to much pain.

Thanks,

Andreas

2012/8/9 Leonard Richardson <email address hidden>

> Should be:
>
> Replace it with this line and the file will parse:
>
> <?xml version="1.0" encoding="utf-8" ?>
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1034883
>
> Title:
> nzb file parsing failing
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/beautifulsoup/+bug/1034883/+subscriptions
>

Leonard Richardson (leonardr) wrote :

There's now a fix for bug 963936, but it'll take a while to make it into released versions of lxml.

Changed in beautifulsoup:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers