lxml

Tag replacement

Bug #1833083 reported by Alexander Lebedev on 2019-06-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	New	Undecided	Unassigned

Bug Description

Replaces `html` tag with `div`.

Example code:

In [20]: import lxml

In [21]: import lxml.html

In [22]: from lxml.html.clean import Cleaner

In [23]: text = u"""<html>
    ...: <body>
    ...: <h1>Hello, Parsel!</h1>
    ...: <ul>
    ...: <li><a href="http://example.com">Link 1</a></li>
    ...: <li><a href="http://scrapy.org">Link 2</a></li>
    ...: </ul>
    ...: </body>
    ...: </html>"""

In [24]: html_root = lxml.html.document_fromstring(text)

In [25]: html = Cleaner().clean_html(html_root)

In [26]: lxml.html.tostring(html)
Out[26]: '<div>\n <body>\n <h1>Hello, Parsel!</h1>\n <ul>\n <li><a href="http://example.com">Link 1</a></
li>\n <li><a href="http://scrapy.org">Link 2</a></li>\n </ul>\n </body>\n </div>'

See original description

Alexander Lebedev (woutut) on 2019-10-20

description:	updated
description:	updated

Revision history for this message

Removed by request (removed6476464) wrote on 2020-06-24:

Hi,

I have noticed the same behavior. The html tags are replaced by a div which messes up a load of stuff. If you load from string and then to string a div is added for no reason and that messes up xpath for example.

if page_structure is set to False, then the html tag isn't replaced or removed, but the divs that may be in the head tag are still placed in the body tag.

Bug or feature ?

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.