Xml parsing is not converting CDATA fields to bs4.element.CData

Bug #1275085 reported by Nate Thelen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

Versions:
Python: 2.7.5
BeautifulSoup: 4.3.2
lxml: 3.2.5

> xml = '<?xml version="1.0" encoding="utf-8"?>\n<root><![CDATA[http://site.com/?a=b&c=d]]></root>'
> soup = BeautifulSoup(xml, ['lxml', 'xml'])
> str(soup)
'<?xml version="1.0" encoding="utf-8"?>\n<root>http://site.com/?a=b&amp;c=d</root>'
> type(soup.root.string)
bs4.element.NavigableString
> xml == str(soup)
False

If I force it back to a CData

> soup.root.string = CData(soup.root.string)

Everything is now correct

> str(soup)
'<?xml version="1.0" encoding="utf-8"?>\n<root><![CDATA[http://site.com/?a=b&c=d]]></root>'
> type(soup.root.string)
bs4.element.CData
> xml == str(soup)
True

I have also tried it without lxml by using:

> soup = BeautifulSoup(xml, 'xml')

with the same results

Tags: cdata lxml xml
Revision history for this message
Leonard Richardson (leonardr) wrote :

Unfortunately this is what lxml does. (When you specify 'xml' as the parser Beautiful Soup looks for the best available xml parser, which is lxml.)

http://lxml.de/api.html#cdata

"By default, lxml's parser will strip CDATA sections from the tree and replace them by their plain text content. As real applications for CDATA are rare, this is the best way to deal with this issue.

However, in some cases, keeping CDATA sections or creating them in a document is required to adhere to existing XML language definitions. For these special cases, you can instruct the parser to leave CDATA sections in the document:"

Beautiful Soup uses the strip_cdata=False argument mentioned in that page, but I've never seen it actually work.

Changed in beautifulsoup:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.