HTMLParser has useless option "strip_cdata"

Bug #2067707 reported by James Hewitt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Fix Committed
Low
scoder

Bug Description

This may very well be PEBCAK, but when I use the HTMLParser, CDATA content is lost.

Unsurprisingly, the XMLParser converts it to text nodes as expected.

The [API doc](https://lxml.de/api/lxml.etree.HTMLParser-class.html) implies that the HTMLParser should do the same, but it doesn't.

```python
#!/usr/bin/env python

from lxml import etree

document = '<html><head><title><![CDATA[title]]></title></head><body><![CDATA[body]]></body></html>'

root = etree.fromstring(document, etree.HTMLParser())
print(etree.tostring(root))

root = etree.fromstring(document, etree.XMLParser())
print(etree.tostring(root))
```

```sh
(venv) jammy@ibm007470:~/temp/lxml$ ./cdata.py
b'<html><head><title/></head><body/></html>'
b'<html><head><title>title</title></head><body>body</body></html>'
(venv) jammy@ibm007470:~/temp/lxml$ pip list
Package Version
------- -------
lxml 5.2.2
pip 23.3.2
```

```
Python : sys.version_info(major=3, minor=12, micro=3, releaselevel='final', serial=0)
lxml.etree : (5, 2, 2, 0)
libxml used : (2, 12, 6)
libxml compiled : (2, 12, 6)
libxslt used : (1, 1, 39)
libxslt compiled : (1, 1, 39)
```

Revision history for this message
scoder (scoder) wrote :

"CDATA" is an XML thing. It has no meaning in HTML:
https://developer.mozilla.org/en-US/docs/Web/API/CDATASection#Specifications

Try this with plain libxml2:
"""
$ echo '<html><head><title><![CDATA[title]]></title></head><body><![CDATA[body]]></body></html>' | xmllint --html -
-:1: HTML parser error : htmlParseStartTag: invalid element name
<html><head><title><![CDATA[title]]></title></head><body><![CDATA[body]]></body>
                    ^
-:1: HTML parser error : htmlParseStartTag: invalid element name
<html><head><title><![CDATA[title]]></title></head><body><![CDATA[body]]></body>
                                                          ^
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><title></title></head>
<body></body>
</html>
"""

Changed in lxml:
status: New → Invalid
Revision history for this message
James Hewitt (jammy) wrote :

So a couple of things then:
- In the API, HTMLParser has a strip_cdata option that says it replaces CDATA element by text, which is set to true by default. Should this just be removed from the API?
- The MDN web docs don't say its not supported, they say "Note: CDATA sections should not be used within HTML they are considered as comments and not displayed.". Comments are successfully retained using the HTMLParser and can be accessed in the tree, so why not CDATA?
- Another MDN web doc says its not supported at all: https://developer.mozilla.org/en-US/docs/Web/API/Document/createCDATASection - "This will only work with XML, not HTML documents (as HTML documents do not support CDATA sections); attempting it on an HTML document will throw NOT_SUPPORTED_ERR."

I expect the right course of action is to treat them as unsupported:
- I've opened this for clarification of the MDN docs: https://github.com/mdn/content/issues/33894
- I think it makes sense to remove the strip_cdata option from the HTMLParser class. WDYT?

Changed in lxml:
status: Invalid → New
Revision history for this message
scoder (scoder) wrote :

> In the API, HTMLParser has a strip_cdata option that says it replaces CDATA element by text, which is set to true by default. Should this just be removed from the API?

If it doesn't do anything, then deprecation seems a reasonable first step. We shouldn't break working user code by removing it.

I added a DeprecationWarning in
https://github.com/lxml/lxml/commit/fc57ffefe54509e7dbc76290665f4136de15b24a

summary: - HTMLParser loses CDATA content
+ HTMLParser has useless option "strip_cdata"
Changed in lxml:
assignee: nobody → scoder (scoder)
importance: Undecided → Low
milestone: none → 5.3
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.