lxml.html.fragment_fromstring suppresses attribute URL-decoding
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Invalid
|
Undecided
|
Unassigned |
Bug Description
I found that attribute values are not properly URL-decoded when creating an HtmlElement using `lxml.html.
import sys
from lxml import etree
import lxml.html
from lxml.html import builder as E
print("%-20s: %s" % ('Python', sys.version_info))
print("%-20s: %s" % ('lxml.etree', etree.LXML_
print("%-20s: %s" % ('libxml used', etree.LIBXML_
print("%-20s: %s" % ('libxml compiled', etree.LIBXML_
print("%-20s: %s" % ('libxslt used', etree.LIBXSLT_
print("%-20s: %s" % ('libxslt compiled', etree.LIBXSLT_
def display(el):
print()
assert type(el).__name__ == 'HtmlElement'
assert el.tag == 'img'
print(" src: {0}".format(
# Correct behavior. Shows--
# tostring: <img src="abcd%C3%A9">
# src: abcdé
el = E.IMG(src=u"abcdé")
display(el)
# Incorrect behavior. Shows--
# tostring: <img src="abcd%C3%A9">
# src: abcd%C3%A9
html = """<img src="abcd%
el = lxml.html.
display(el)
Yields--
Python : sys.version_
lxml.etree : (3, 4, 4, 0)
libxml used : (2, 9, 2)
libxml compiled : (2, 9, 2)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)
tostring: <img src="abcd%C3%A9">
src: abcdé
tostring: <img src="abcd%C3%A9">
src: abcd%C3%A9
Actually, it's possible that I've mislabeled which case is behaving "correctly."
In any case, the two cases should be yielding the same "src" value since the tostring() values for both are the same.