lxml backend built against libxml2 2.9.11+ does not strip CDATA anymore

Bug #1930164 reported by Michał Górny
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Triaged
Low
Unassigned

Bug Description

When lxml is built against libxml2 2.9.11+, the parser behavior seems to change causing bs4 output to be inconsistent with other parsers. Not sure if this is to be considered a feature or a bug.

For example:

$ python -c 'import bs4; print(bs4.BeautifulSoup("<body><![CDATA[that]]></body>", "lxml"))'
<html><body>&lt;![CDATA[that]]&gt;</body></html>

while the old libxml2 version caused CDATA to be stripped:

$ python -c 'import bs4; print(bs4.BeautifulSoup("<body><![CDATA[that]]></body>", "lxml"))'
<html><body></body></html>

This causes soupsieve's tests to fail, see: https://github.com/facelessuser/soupsieve/issues/220. I am not sure whether this is something that can/should be fixed in bs4, lxml or libxml2 itself. The parser is a bit beyond my comprehension, so I've figured out that I'll ask here first.

Little debugging I did suggests that previously CDATA was not reported by the parser at all, while now it is reported as two data method calls: first with content of '<', and then with '![CDATA[that]]>'.

The relevant libxml2 commit is:

commit 173a0830dcec769a5f12c5c55ef4ab424b388efb
Author: Nick Wellnhofer <email address hidden>
Date: 2020-07-22 23:15:35 +0200

    Fix quadratic runtime when push parsing HTML start tags

    Make sure that htmlParseStartTag doesn't terminate on characters for
    which IS_CHAR_CH is false like control chars.

    In htmlParseTryOrFinish, only switch to START_TAG if the next character
    starts a valid name. Otherwise, htmlParseStartTag might return without
    consuming all characters up to the final '>'.

    Found by OSS-Fuzz.

Note that in order to reproduce this you need to build lxml from source, as binary wheels are statically linked to libxml2 2.9.10.

Revision history for this message
Leonard Richardson (leonardr) wrote :

This could be good news. BS4 has been passing strip_cdata=False into lxml for a very long time (see https://bugs.launchpad.net/beautifulsoup/+bug/1275085) but CData blocks were always stripped anyway.

However the way the data is passed from lxml to Beautiful Soup might make it impossible to recognize the CDATA _as_ a CDATA block rather than regular markup.

Can you try replacing strip_cdata=False with strip_cdata=True in bs4/builder/_lxml.py and see if that makes a difference?

Revision history for this message
Michał Górny (mgorny) wrote :

> Can you try replacing strip_cdata=False with strip_cdata=True in bs4/builder/_lxml.py and see if that makes a difference?

It doesn't make any difference.

Revision history for this message
Michał Górny (mgorny) wrote :

Reported to lxml: https://bugs.launchpad.net/lxml/+bug/1930224

I've experimented a bit with the libxml2 C API and at least using htmlSAXParseDoc() I seem to get the old behavior, so maybe they're switching some magic that triggers this.

Changed in beautifulsoup:
status: New → Triaged
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.