lxml

lxml.html.fragment_fromstring() strips an enclosing body when present

Bug #1665936 reported by Chris Jerdonek on 2017-02-18

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Triaged	Undecided	Unassigned

Bug Description

There seems to be an undocumented inconsistency / asymmetry with lxml.html's fragment_fromstring() when the fragment is enclosed in a body.

Specifically, it looks like fragment_fromstring() will strip any enclosing body tag.

You can observe this as follows:

from lxml.html import fragment_fromstring

    def parse(html):
        element = fragment_fromstring(html, create_parent=False)
        print(element, list(element))

# Outputs:
 # <Element p at 0x10cadf9f8> [<Element i at 0x10cacc3b8>]
 # <Element p at 0x10cadfb38> [<Element i at 0x10cadf9f8>]
 parse('foo')
 parse('<body>foo</body>')

It seems like this could be an issue with lxml because of the body manipulation it does inside fragments_fromstring(). There is even a FIXME around this issue here:
https://github.com/lxml/lxml/blob/95dc3640dd3df477c3bbb6179b96a28df8970045/src/lxml/html/__init__.py#L786

Here is the requested information about my system (on Mac OS X):

Python : sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 6, 4, 0)
libxml used : (2, 9, 2)
libxml compiled : (2, 9, 2)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Revision history for this message

scoder (scoder) wrote on 2017-08-12:

Could be considered a bug, but it might not be all that easy to fix, given the way fragment_fromstring() is supposed to work.

That leaves me torn whether this should be fixed. I think I would accept a pull request with a reasonable and properly tested solution.

Changed in lxml:
status:	New → Triaged

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.