Tag replacement

Bug #1833083 reported by Alexander Lebedev
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

Replaces `html` tag with `div`.

Example code:

In [20]: import lxml

In [21]: import lxml.html

In [22]: from lxml.html.clean import Cleaner

In [23]: text = u"""<html>
    ...: <body>
    ...: <h1>Hello, Parsel!</h1>
    ...: <ul>
    ...: <li><a href="http://example.com">Link 1</a></li>
    ...: <li><a href="http://scrapy.org">Link 2</a></li>
    ...: </ul>
    ...: </body>
    ...: </html>"""

In [24]: html_root = lxml.html.document_fromstring(text)

In [25]: html = Cleaner().clean_html(html_root)

In [26]: lxml.html.tostring(html)
Out[26]: '<div>\n <body>\n <h1>Hello, Parsel!</h1>\n <ul>\n <li><a href="http://example.com">Link 1</a></
li>\n <li><a href="http://scrapy.org">Link 2</a></li>\n </ul>\n </body>\n </div>'

description: updated
description: updated
Revision history for this message
Removed by request (removed6476464) wrote :

Hi,

I have noticed the same behavior. The html tags are replaced by a div which messes up a load of stuff. If you load from string and then to string a div is added for no reason and that messes up xpath for example.

if page_structure is set to False, then the html tag isn't replaced or removed, but the divs that may be in the head tag are still placed in the body tag.

Bug or feature ?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.