lxml

HTMLParser has useless option "strip_cdata"

Bug #2067707 reported by James Hewitt on 2024-05-31

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Fix Committed	Low	scoder	lxml 5.3

Bug Description

This may very well be PEBCAK, but when I use the HTMLParser, CDATA content is lost.

Unsurprisingly, the XMLParser converts it to text nodes as expected.

The [API doc](https://lxml.de/api/lxml.etree.HTMLParser-class.html) implies that the HTMLParser should do the same, but it doesn't.

```python
#!/usr/bin/env python

from lxml import etree

document = '<html><head><title><![CDATA[title]]></title></head><body><![CDATA[body]]></body></html>'

root = etree.fromstring(document, etree.HTMLParser())
print(etree.tostring(root))

root = etree.fromstring(document, etree.XMLParser())
print(etree.tostring(root))
```

```sh
(venv) jammy@ibm007470:~/temp/lxml$ ./cdata.py
b'<html><head><title/></head><body/></html>'
b'<html><head><title>title</title></head><body>body</body></html>'
(venv) jammy@ibm007470:~/temp/lxml$ pip list
Package Version
------- -------
lxml 5.2.2
pip 23.3.2
```

```
Python : sys.version_info(major=3, minor=12, micro=3, releaselevel='final', serial=0)
lxml.etree : (5, 2, 2, 0)
libxml used : (2, 12, 6)
libxml compiled : (2, 12, 6)
libxslt used : (1, 1, 39)
libxslt compiled : (1, 1, 39)
```

Revision history for this message

scoder (scoder) wrote on 2024-06-02:

"CDATA" is an XML thing. It has no meaning in HTML:
https://developer.mozilla.org/en-US/docs/Web/API/CDATASection#Specifications

Try this with plain libxml2:
"""
$ echo '<html><head><title><![CDATA[title]]></title></head><body><![CDATA[body]]></body></html>' | xmllint --html -
-:1: HTML parser error : htmlParseStartTag: invalid element name
<html><head><title><![CDATA[title]]></title></head><body><![CDATA[body]]></body>
^
-:1: HTML parser error : htmlParseStartTag: invalid element name
<html><head><title><![CDATA[title]]></title></head><body><![CDATA[body]]></body>
^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><title></title></head>
<body></body>
</html>
"""

Changed in lxml:
status:	New → Invalid

Revision history for this message

James Hewitt (jammy) wrote on 2024-06-03:

So a couple of things then:
- In the API, HTMLParser has a strip_cdata option that says it replaces CDATA element by text, which is set to true by default. Should this just be removed from the API?
- The MDN web docs don't say its not supported, they say "Note: CDATA sections should not be used within HTML they are considered as comments and not displayed.". Comments are successfully retained using the HTMLParser and can be accessed in the tree, so why not CDATA?
- Another MDN web doc says its not supported at all: https://developer.mozilla.org/en-US/docs/Web/API/Document/createCDATASection - "This will only work with XML, not HTML documents (as HTML documents do not support CDATA sections); attempting it on an HTML document will throw NOT_SUPPORTED_ERR."

I expect the right course of action is to treat them as unsupported:
- I've opened this for clarification of the MDN docs: https://github.com/mdn/content/issues/33894
- I think it makes sense to remove the strip_cdata option from the HTMLParser class. WDYT?

Changed in lxml:
status:	Invalid → New

Revision history for this message

scoder (scoder) wrote on 2024-06-04:

> In the API, HTMLParser has a strip_cdata option that says it replaces CDATA element by text, which is set to true by default. Should this just be removed from the API?

If it doesn't do anything, then deprecation seems a reasonable first step. We shouldn't break working user code by removing it.

I added a DeprecationWarning in
https://github.com/lxml/lxml/commit/fc57ffefe54509e7dbc76290665f4136de15b24a

summary:	- HTMLParser loses CDATA content + HTMLParser has useless option "strip_cdata"
Changed in lxml:
assignee:	nobody → scoder (scoder)
importance:	Undecided → Low
milestone:	none → 5.3
status:	New → Fix Committed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-github-mdn-content #33894
[closed Content:WebAPI] Edit

Bug watches keep track of this bug in other bug trackers.