HTMLParser handling of <![CDATA[...]]> changed w/ libxml2 2.9.11+

Bug #1930224 reported by Michał Górny
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

Python : sys.version_info(major=3, minor=9, micro=5, releaselevel='final', serial=0)
lxml.etree : (4, 6, 3, 0)
libxml used : (2, 9, 12)
libxml compiled : (2, 9, 12)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

It seems that the handling of <![CDATA[...]]> inside HTMLParser has changed when built against libxml2 2.9.11+. I'm currently trying to figure out whether it's a regression/behavior change in libxml2 itself or a bug in lxml, however I wasn't able to easily reproduce it using the C API and the Cython code in lxml is above my paygrade.

I'm attaching a trivial reproducer using lxml.etree.HTMLParser. The reproducer parses the following string:

  b"<html><body><![CDATA[test]]></body></html>"

With older libxml2, the result is:

  start html
  start body
  end body
  end html

(i.e. CDATA is ignored). With newer libxml2, the result is:

  start html
  start body
  data <
  data ![CDATA[test]]>
  end body
  end html

(i.e. CDATA is reported raw as data() method calls)

This breaks the assumptions made by beautifulsoup4 and soupsieve. I've reported the problem there previously to get some pointers:

  https://bugs.launchpad.net/beautifulsoup/+bug/1930164
  https://github.com/facelessuser/soupsieve/issues/220

I've also bisected libxml2 and found out that the following commit causes the behavior change:

commit 173a0830dcec769a5f12c5c55ef4ab424b388efb
Author: Nick Wellnhofer <email address hidden>
Date: 2020-07-22 23:15:35 +0200

    Fix quadratic runtime when push parsing HTML start tags

    Make sure that htmlParseStartTag doesn't terminate on characters for
    which IS_CHAR_CH is false like control chars.

    In htmlParseTryOrFinish, only switch to START_TAG if the next character
    starts a valid name. Otherwise, htmlParseStartTag might return without
    consuming all characters up to the final '>'.

    Found by OSS-Fuzz.

I can also file a bug against libxml2 but I'm going to need help getting a trivial reproducer there. I've tried using htmlSAXParseDoc() but I can't reproduce the new behavior there (i.e. CDATA is just not reported at all, via cDataBlock or characters callback).

Revision history for this message
Michał Górny (mgorny) wrote :
Revision history for this message
Michał Górny (mgorny) wrote :
Revision history for this message
Leonard Richardson (leonardr) wrote :

As the author of Beautiful Soup let me say that I would probably prefer the new behavior. I haven't been able to get CDATA sections from lxml the way I have been from html.parser and html5lib.

I've been using the strip_cdata=False argument mentioned here:
https://lxml.de/api.html#cdata

But in the context in which I'm using it, it's never worked:
https://bugs.launchpad.net/beautifulsoup/+bug/1275085

I say I'd _probably_ prefer the new behavior because the way in which the CDATA section is being sent over -- as chunked data blocks -- means I don't think I can recognize it as CDATA and create a special CData object on my side. But I'd definitely rather have the data than not.

Revision history for this message
Stefano Rivera (stefanor) wrote :

Finished your minimal reproducer and filed https://gitlab.gnome.org/GNOME/libxml2/-/issues/312 with upstream libxml2.

Revision history for this message
Stefano Rivera (stefanor) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.