Beautiful Soup

lxml backend built against libxml2 2.9.11+ does not strip CDATA anymore

Bug #1930164 reported by Michał Górny on 2021-05-30

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Triaged	Low	Unassigned

Bug Description

When lxml is built against libxml2 2.9.11+, the parser behavior seems to change causing bs4 output to be inconsistent with other parsers. Not sure if this is to be considered a feature or a bug.

For example:

$ python -c 'import bs4; print(bs4.BeautifulSoup("<body><![CDATA[that]]></body>", "lxml"))'
<html><body><![CDATA[that]]></body></html>

while the old libxml2 version caused CDATA to be stripped:

$ python -c 'import bs4; print(bs4.BeautifulSoup("<body><![CDATA[that]]></body>", "lxml"))'
<html><body></body></html>

This causes soupsieve's tests to fail, see: https://github.com/facelessuser/soupsieve/issues/220. I am not sure whether this is something that can/should be fixed in bs4, lxml or libxml2 itself. The parser is a bit beyond my comprehension, so I've figured out that I'll ask here first.

Little debugging I did suggests that previously CDATA was not reported by the parser at all, while now it is reported as two data method calls: first with content of '<', and then with '![CDATA[that]]>'.

The relevant libxml2 commit is:

commit 173a0830dcec769a5f12c5c55ef4ab424b388efb
Author: Nick Wellnhofer <email address hidden>
Date: 2020-07-22 23:15:35 +0200

Fix quadratic runtime when push parsing HTML start tags

Make sure that htmlParseStartTag doesn't terminate on characters for
which IS_CHAR_CH is false like control chars.

    In htmlParseTryOrFinish, only switch to START_TAG if the next character
    starts a valid name. Otherwise, htmlParseStartTag might return without
    consuming all characters up to the final '>'.

Found by OSS-Fuzz.

Note that in order to reproduce this you need to build lxml from source, as binary wheels are statically linked to libxml2 2.9.10.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2021-05-30:

This could be good news. BS4 has been passing strip_cdata=False into lxml for a very long time (see https://bugs.launchpad.net/beautifulsoup/+bug/1275085) but CData blocks were always stripped anyway.

However the way the data is passed from lxml to Beautiful Soup might make it impossible to recognize the CDATA _as_ a CDATA block rather than regular markup.

Can you try replacing strip_cdata=False with strip_cdata=True in bs4/builder/_lxml.py and see if that makes a difference?

Revision history for this message

Michał Górny (mgorny) wrote on 2021-05-31:

> Can you try replacing strip_cdata=False with strip_cdata=True in bs4/builder/_lxml.py and see if that makes a difference?

It doesn't make any difference.

Revision history for this message

Michał Górny (mgorny) wrote on 2021-05-31:

Reported to lxml: https://bugs.launchpad.net/lxml/+bug/1930224

I've experimented a bit with the libxml2 C API and at least using htmlSAXParseDoc() I seem to get the old behavior, so maybe they're switching some magic that triggers this.

Leonard Richardson (leonardr) on 2021-10-25

Changed in beautifulsoup:
status:	New → Triaged
importance:	Undecided → Low

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.