Strange behaviour in python-bs4
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
beautifulsoup4 (Ubuntu) |
Invalid
|
Undecided
|
Unassigned |
Bug Description
After upgrading from 4.0.1-1 to 4.0.2-1, my python script broke with the following behaviour:
original text (for harvesting):
[...]
<body nof="(MB=
<form method="post" action="">^M <table cellspacing="0" cellpadding="0" width="770" nof="ly">^M
[...]
souped text (printed with prettify()):
[...]
<body alink="#FF0000" bgcolor="#EAF7F7" leftmargin="0" link="#0033CC" marginheight="0" marginwidth="0" nof="(MB=
<table cellpa="" cellspacing="0">
d d i n g = " 0 " w i d t h = " 7 7 0 " n o f = " l y " > ^M
[...]
This seems to happen after the change to StringIO in beautifulsoup4 (when using lxml parser) with a fixed chunk size. So I'm rather convinced this seems to be a bug in bs4 itself. Maybe someone can file this to the upstream team?
This is related to https:/ /bugs.launchpad .net/beautifuls oup/+bug/ 972466, so it can be marked as duplicate.