lxml.html.fragment_fromstring suppresses attribute URL-decoding

Bug #1487738 reported by Chris Jerdonek
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

I found that attribute values are not properly URL-decoded when creating an HtmlElement using `lxml.html.fragment_fromstring()`. For example, the following code:

    import sys
    from lxml import etree
    import lxml.html
    from lxml.html import builder as E

    print("%-20s: %s" % ('Python', sys.version_info))
    print("%-20s: %s" % ('lxml.etree', etree.LXML_VERSION))
    print("%-20s: %s" % ('libxml used', etree.LIBXML_VERSION))
    print("%-20s: %s" % ('libxml compiled', etree.LIBXML_COMPILED_VERSION))
    print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_VERSION))
    print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_COMPILED_VERSION))

    def display(el):
        print()
        assert type(el).__name__ == 'HtmlElement'
        assert el.tag == 'img'
        print("tostring: {0}".format(lxml.html.tostring(el, encoding='unicode')))
        print(" src: {0}".format(el.attrib['src']))

    # Correct behavior. Shows--
    # tostring: <img src="abcd%C3%A9">
    # src: abcdé
    el = E.IMG(src=u"abcdé")
    display(el)

    # Incorrect behavior. Shows--
    # tostring: <img src="abcd%C3%A9">
    # src: abcd%C3%A9
    html = """<img src="abcd%C3%A9">"""
    el = lxml.html.fragment_fromstring(html)
    display(el)

Yields--

    Python : sys.version_info(major=3, minor=4, micro=3, releaselevel='final', serial=0)
    lxml.etree : (3, 4, 4, 0)
    libxml used : (2, 9, 2)
    libxml compiled : (2, 9, 2)
    libxslt used : (1, 1, 28)
    libxslt compiled : (1, 1, 28)

    tostring: <img src="abcd%C3%A9">
         src: abcdé

    tostring: <img src="abcd%C3%A9">
         src: abcd%C3%A9

Revision history for this message
Chris Jerdonek (chris-jerdonek) wrote :

Actually, it's possible that I've mislabeled which case is behaving "correctly."

In any case, the two cases should be yielding the same "src" value since the tostring() values for both are the same.

Revision history for this message
scoder (scoder) wrote :

Not sure if this is incorrect. There are certain rules how to treat a URN in src="". It depends on the actual content and whether it has the form of a URL or not.
I'm closing this as it's a) probably not a bug and b) not done by lxml but by the parser in libxml2.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.