lxml backend built against libxml2 2.9.11+ does not strip CDATA anymore
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Triaged
|
Low
|
Unassigned |
Bug Description
When lxml is built against libxml2 2.9.11+, the parser behavior seems to change causing bs4 output to be inconsistent with other parsers. Not sure if this is to be considered a feature or a bug.
For example:
$ python -c 'import bs4; print(bs4.
<html><
while the old libxml2 version caused CDATA to be stripped:
$ python -c 'import bs4; print(bs4.
<html><
This causes soupsieve's tests to fail, see: https:/
Little debugging I did suggests that previously CDATA was not reported by the parser at all, while now it is reported as two data method calls: first with content of '<', and then with '![CDATA[that]]>'.
The relevant libxml2 commit is:
commit 173a0830dcec769
Author: Nick Wellnhofer <email address hidden>
Date: 2020-07-22 23:15:35 +0200
Fix quadratic runtime when push parsing HTML start tags
Make sure that htmlParseStartTag doesn't terminate on characters for
which IS_CHAR_CH is false like control chars.
In htmlParseTryOrF
starts a valid name. Otherwise, htmlParseStartTag might return without
consuming all characters up to the final '>'.
Found by OSS-Fuzz.
Note that in order to reproduce this you need to build lxml from source, as binary wheels are statically linked to libxml2 2.9.10.
Changed in beautifulsoup: | |
status: | New → Triaged |
importance: | Undecided → Low |
This could be good news. BS4 has been passing strip_cdata=False into lxml for a very long time (see https:/ /bugs.launchpad .net/beautifuls oup/+bug/ 1275085) but CData blocks were always stripped anyway.
However the way the data is passed from lxml to Beautiful Soup might make it impossible to recognize the CDATA _as_ a CDATA block rather than regular markup.
Can you try replacing strip_cdata=False with strip_cdata=True in bs4/builder/ _lxml.py and see if that makes a difference?