Beautiful Soup

Xml parsing is not converting CDATA fields to bs4.element.CData

Bug #1275085 reported by Nate Thelen on 2014-01-31

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Invalid	Undecided	Unassigned

Bug Description

Versions:
Python: 2.7.5
BeautifulSoup: 4.3.2
lxml: 3.2.5

> xml = '<?xml version="1.0" encoding="utf-8"?>\n<root><![CDATA[http://site.com/?a=b&c=d]]></root>'
> soup = BeautifulSoup(xml, ['lxml', 'xml'])
> str(soup)
'<?xml version="1.0" encoding="utf-8"?>\n<root>http://site.com/?a=b&c=d</root>'
> type(soup.root.string)
bs4.element.NavigableString
> xml == str(soup)
False

If I force it back to a CData

> soup.root.string = CData(soup.root.string)

Everything is now correct

> str(soup)
'<?xml version="1.0" encoding="utf-8"?>\n<root><![CDATA[http://site.com/?a=b&c=d]]></root>'
> type(soup.root.string)
bs4.element.CData
> xml == str(soup)
True

I have also tried it without lxml by using:

> soup = BeautifulSoup(xml, 'xml')

with the same results

Tags:

Revision history for this message

Leonard Richardson (leonardr) wrote on 2015-06-26:

Unfortunately this is what lxml does. (When you specify 'xml' as the parser Beautiful Soup looks for the best available xml parser, which is lxml.)

http://lxml.de/api.html#cdata

"By default, lxml's parser will strip CDATA sections from the tree and replace them by their plain text content. As real applications for CDATA are rare, this is the best way to deal with this issue.

However, in some cases, keeping CDATA sections or creating them in a document is required to adhere to existing XML language definitions. For these special cases, you can instruct the parser to leave CDATA sections in the document:"

Beautiful Soup uses the strip_cdata=False argument mentioned in that page, but I've never seen it actually work.

Changed in beautifulsoup:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.