Strange behaviour in python-bs4

Bug #973129 reported by Michael Pitra
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
beautifulsoup4 (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

After upgrading from 4.0.1-1 to 4.0.2-1, my python script broke with the following behaviour:

original text (for harvesting):

[...]
<body nof="(MB=(DefaultMasterborder, 65, 60, 150, 10), L=(HomeLayout, 700, 600))" bgcolor="#EAF7F7" text="#000000" link="#0033CC" vlink="#990099" alink="#FF0000" topmargin=0 leftmargin=0 marginwidth=0 marginheight=0>^M
<form method="post" action="">^M <table cellspacing="0" cellpadding="0" width="770" nof="ly">^M
[...]

souped text (printed with prettify()):

[...]
 <body alink="#FF0000" bgcolor="#EAF7F7" leftmargin="0" link="#0033CC" marginheight="0" marginwidth="0" nof="(MB=(DefaultMasterborder, 65, 60, 150, 10), L=(HomeLayout, 700, 600))" text="#000000" topmargin="0" vlink="#990099"> <form action="" method="post">
   <table cellpa="" cellspacing="0">
    d d i n g = " 0 " w i d t h = " 7 7 0 " n o f = " l y " &gt; ^M
[...]

This seems to happen after the change to StringIO in beautifulsoup4 (when using lxml parser) with a fixed chunk size. So I'm rather convinced this seems to be a bug in bs4 itself. Maybe someone can file this to the upstream team?

Revision history for this message
Michael Pitra (mortomanos) wrote :

This is related to https://bugs.launchpad.net/beautifulsoup/+bug/972466, so it can be marked as duplicate.

Changed in beautifulsoup4 (Ubuntu):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.