Cleaning html file cleans it wrong

Bug #671636 reported by Ravi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

I am using the default Cleaner i.e. lxml.html.clean.Cleaner() . I am cleaning an html file which is the home page of FSF. It is cleaning the file wrong way. It removes some of the page structure and then it removes the start of the style tag and the end of the style tag but the contents in between persist.

Attached is the tar of the original file and the cleaned version.

Revision history for this message
Ravi (ra-ravi-rav-gmail) wrote :
Revision history for this message
Ravi (ra-ravi-rav-gmail) wrote :

Similar happens with rest of the style tags too, the content between the start tag and the end tag is not removed.

Revision history for this message
scoder (scoder) wrote :

More likely a problem with the HTML parser in libxml2 rather than a cleaner issue.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.