html cleaner assert parent is not None AssertionError

Bug #1851029 reported by guyskk
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Triaged
Low
Unassigned

Bug Description

The html cleaner raise AssertionError when both root tag and child tag in kill_tags set.

Reproduce:
```
from lxml.html.clean import Cleaner

html_cleaner = Cleaner(kill_tags=['pre', 'code'])

content = '<pre><code>xxx</code></pre>'
print(html_cleaner.clean_html(content))
```

Output:
```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src/lxml/html/clean.py", line 520, in lxml.html.clean.Cleaner.clean_html
  File "src/lxml/html/clean.py", line 394, in lxml.html.clean.Cleaner.__call__
  File "/Users/kk/.pyenv/versions/rssant/lib/python3.7/site-packages/lxml/html/__init__.py", line 339, in drop_tree
    assert parent is not None
AssertionError
```

Expected output:
```
<div></div>
```

I tested in Python 3.7.4, lxml-4.3.3 and lxml-4.4.1.

Python : sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
lxml.etree : (4, 4, 1, 0)
libxml used : (2, 9, 9)
libxml compiled : (2, 9, 9)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 33)

The bug seems similar to https://bugs.launchpad.net/lxml/+bug/773715, but occurs in different line.

Revision history for this message
scoder (scoder) wrote :

Hmm, yes, it's probably worth reconsidering that assertion. In the end, what you're asking for is to discard the entire "document". That's a valid thing to do, and it should not run into an AssertionError.
OTOH, what should it return in such a case?
Maybe it could be enough to remove all content from the tag (children and inner text), and then still return it?

Changed in lxml:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
guyskk (guyskk) wrote :

It's OK to return empty div in the case.

When I remove the code tag in the input content, it return empty div, I think it's fine.
lxml only raise AssertError when both root tag and child tag in kill_tags set.

```
>>> from lxml.html.clean import Cleaner
>>> html_cleaner = Cleaner(kill_tags=['pre', 'code'])
>>> content = '<pre>xxx</pre>'
>>> print(html_cleaner.clean_html(content))
<div></div>
```

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.