lxml.html.fragment_fromstring with create_parent fails

Bug #511252 reported by Tes
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Low
scoder

Bug Description

When feeding fragment_fromstring some broken html, it may brake with a ParserError: Multiple elements found.

For example:

s = '<i>This wil</i></div><b>fail</b>'
lxml.html.fragment_fromstring(s, create_parent='div')

I expected this to work, and just drop the </div>.

I propose the following (not tested):

in lxml/html/html5parser.py:

def fragment_fromstring(html, create_parent=False,
                        guess_charset=False, parser=None):
    """Parses a single HTML element; it is an error if there is more than
    one element, or if anything but whitespace precedes or follows the
    element.

    If create_parent is true (or is a tag name) then a parent node
    will be created to encapsulate the HTML in a single element.
    """
    if not isinstance(html, _strings):
        raise TypeError('string required')

    children = fragments_fromstring(html, True, guess_charset, parser)
    if not children:
        raise etree.ParserError('No elements found')
    if len(children) > 1:
        if not create_parent:
            raise etree.ParserError('Multiple elements found')
        else:
            container = etree.Element(create_parent or 'div')
            for element in children:
                if isinstance(element, _strings):
                    if len(container) == 0:
                        container.text = element
                    else:
                        container[-1].tail += element
                else:
                    container.append(element)
            children = container

    result = children[0]
    if result.tail and result.tail.strip():
        raise etree.ParserError('Element followed by text: %r' % result.tail)
    result.tail = None
    return result

Revision history for this message
scoder (scoder) wrote :
Changed in lxml:
assignee: nobody → Stefan Behnel (scoder)
importance: Undecided → Low
milestone: none → 2.3
status: New → Fix Committed
Revision history for this message
scoder (scoder) wrote :

Fixed in lxml 2.3alpha1.

Changed in lxml:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.