Xml parsing is not converting CDATA fields to bs4.element.CData
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Versions:
Python: 2.7.5
BeautifulSoup: 4.3.2
lxml: 3.2.5
> xml = '<?xml version="1.0" encoding=
> soup = BeautifulSoup(xml, ['lxml', 'xml'])
> str(soup)
'<?xml version="1.0" encoding=
> type(soup.
bs4.element.
> xml == str(soup)
False
If I force it back to a CData
> soup.root.string = CData(soup.
Everything is now correct
> str(soup)
'<?xml version="1.0" encoding=
> type(soup.
bs4.element.CData
> xml == str(soup)
True
I have also tried it without lxml by using:
> soup = BeautifulSoup(xml, 'xml')
with the same results
Unfortunately this is what lxml does. (When you specify 'xml' as the parser Beautiful Soup looks for the best available xml parser, which is lxml.)
http:// lxml.de/ api.html# cdata
"By default, lxml's parser will strip CDATA sections from the tree and replace them by their plain text content. As real applications for CDATA are rare, this is the best way to deal with this issue.
However, in some cases, keeping CDATA sections or creating them in a document is required to adhere to existing XML language definitions. For these special cases, you can instruct the parser to leave CDATA sections in the document:"
Beautiful Soup uses the strip_cdata=False argument mentioned in that page, but I've never seen it actually work.