lxml

Bug #2067707
Comment #2

Comment 2 for bug 2067707

Revision history for this message

James Hewitt (jammy) wrote on 2024-06-03: Re: HTMLParser loses CDATA content

So a couple of things then:
- In the API, HTMLParser has a strip_cdata option that says it replaces CDATA element by text, which is set to true by default. Should this just be removed from the API?
- The MDN web docs don't say its not supported, they say "Note: CDATA sections should not be used within HTML they are considered as comments and not displayed.". Comments are successfully retained using the HTMLParser and can be accessed in the tree, so why not CDATA?
- Another MDN web doc says its not supported at all: https://developer.mozilla.org/en-US/docs/Web/API/Document/createCDATASection - "This will only work with XML, not HTML documents (as HTML documents do not support CDATA sections); attempting it on an HTML document will throw NOT_SUPPORTED_ERR."

I expect the right course of action is to treat them as unsupported:
- I've opened this for clarification of the MDN docs: https://github.com/mdn/content/issues/33894
- I think it makes sense to remove the strip_cdata option from the HTMLParser class. WDYT?