lxml.html.fragment_fromstring() strips an enclosing body when present

Bug #1665936 reported by Chris Jerdonek
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Triaged
Undecided
Unassigned

Bug Description

There seems to be an undocumented inconsistency / asymmetry with lxml.html's fragment_fromstring() when the fragment is enclosed in a body.

Specifically, it looks like fragment_fromstring() will strip any enclosing body tag.

You can observe this as follows:

    from lxml.html import fragment_fromstring

    def parse(html):
        element = fragment_fromstring(html, create_parent=False)
        print(element, list(element))

    # Outputs:
    # <Element p at 0x10cadf9f8> [<Element i at 0x10cacc3b8>]
    # <Element p at 0x10cadfb38> [<Element i at 0x10cadf9f8>]
    parse('<p><i>foo</i></p>')
    parse('<body><p><i>foo</i></p></body>')

It seems like this could be an issue with lxml because of the body manipulation it does inside fragments_fromstring(). There is even a FIXME around this issue here:
https://github.com/lxml/lxml/blob/95dc3640dd3df477c3bbb6179b96a28df8970045/src/lxml/html/__init__.py#L786

Here is the requested information about my system (on Mac OS X):

Python : sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 6, 4, 0)
libxml used : (2, 9, 2)
libxml compiled : (2, 9, 2)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Revision history for this message
scoder (scoder) wrote :

Could be considered a bug, but it might not be all that easy to fix, given the way fragment_fromstring() is supposed to work.

That leaves me torn whether this should be fixed. I think I would accept a pull request with a reasonable and properly tested solution.

Changed in lxml:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.