html.parse() truncates attribute value
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
New
|
Undecided
|
Unassigned |
Bug Description
If you run the following code with the attached repro.html in the current working directory, you will see from the output that the title attribute of the last img element on the page is truncated.
#!/usr/bin/env python
from re import search
from lxml import html
page = open("repro.html", encoding=
img = search('<img id="_426"[^>]+.', page).group(0)
print("SERIALIZED IMG ELEMENT:", img)
title = search(
print("SERIALIZED TITLE ATTRIBUTE:", title)
for node in html.parse(
if node.get("id") == "_426":
Requested environment information:
Python : sys.version_
lxml.etree : (4, 4, 2, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)
That's on MacOS. Also reproducible on Linux and Windows, Python 3.8.0, lxml 4.5.0.