HTMLParser has useless option "strip_cdata"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Fix Committed
|
Low
|
scoder |
Bug Description
This may very well be PEBCAK, but when I use the HTMLParser, CDATA content is lost.
Unsurprisingly, the XMLParser converts it to text nodes as expected.
The [API doc](https:/
```python
#!/usr/bin/env python
from lxml import etree
document = '<html>
root = etree.fromstrin
print(etree.
root = etree.fromstrin
print(etree.
```
```sh
(venv) jammy@ibm007470
b'<html>
b'<html>
(venv) jammy@ibm007470
Package Version
------- -------
lxml 5.2.2
pip 23.3.2
```
```
Python : sys.version_
lxml.etree : (5, 2, 2, 0)
libxml used : (2, 12, 6)
libxml compiled : (2, 12, 6)
libxslt used : (1, 1, 39)
libxslt compiled : (1, 1, 39)
```
"CDATA" is an XML thing. It has no meaning in HTML: /developer. mozilla. org/en- US/docs/ Web/API/ CDATASection# Specifications
https:/
Try this with plain libxml2: <head>< title>< ![CDATA[ title]] ></title> </head> <body>< ![CDATA[ body]]> </body> </html> ' | xmllint --html - head><title> <![CDATA[ title]] ></title> </head> <body>< ![CDATA[ body]]> </body>
^ head><title> <![CDATA[ title]] ></title> </head> <body>< ![CDATA[ body]]> </body>
^ www.w3. org/TR/ REC-html40/ loose.dtd"> title>< /title> </head>
"""
$ echo '<html>
-:1: HTML parser error : htmlParseStartTag: invalid element name
<html><
-:1: HTML parser error : htmlParseStartTag: invalid element name
<html><
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
<html>
<head><
<body></body>
</html>
"""