BS4 drops '?' from processing instructions

Bug #1504383 reported by Andrew Mercer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

Running the following script:
    import bs4
    xml = '<?xml version="1.0" encoding="utf-8"?><p>Test xml with PI <?dtall break="line"?>a..z<?dtall break="line"?></p>'
    soup = bs4.BeautifulSoup(xml, 'xml')
    str(soup)

Produces the following output:
    '<?xml version="1.0" encoding="utf-8"?>\n<p>Test xml with PI <?dtall break="line">a..z<?dtall break="line"></p>'

The closing ? has been stripped from the <?dtall ... ?> tags.

OS: Windows 7
Python 3.4.3
Parser lxml 3.4.4

Revision history for this message
Andrew Mercer (akmercer) wrote :

I believe the following needs to be changed in element.py"

class ProcessingInstruction(PreformattedString):

    PREFIX = '<?'
    SUFFIX = '>'

should be

class ProcessingInstruction(PreformattedString):

    PREFIX = '<?'
    SUFFIX = '?>'

Revision history for this message
Leonard Richardson (leonardr) wrote :

SGML (HTML) processing instructions look like this: <?foo>
XML processing instructions look like this: <?foo?>

Beautiful Soup handles SGML processing instructions correctly but not XML processing instructions. Revision 399 makes sure that when a document is parsed as XML the processing instructions are treated as XML processing instructions.

Changed in beautifulsoup:
status: New → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.