lxml

html.parse() truncates attribute value

Bug #1866647 reported by Bob Kline on 2020-03-09

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	New	Undecided	Unassigned

Bug Description

If you run the following code with the attached repro.html in the current working directory, you will see from the output that the title attribute of the last img element on the page is truncated.

#!/usr/bin/env python

from re import search
from lxml import html

page = open("repro.html", encoding="utf-8").read()
img = search('<img id="_426"[^>]+.', page).group(0)
print("SERIALIZED IMG ELEMENT:", img)
title = search('title="([^"]+)"', img).group(1)
print("SERIALIZED TITLE ATTRIBUTE:", title)

for node in html.parse("repro.html").iter("img"):
if node.get("id") == "_426":
print("PARSED (AND TRUNCATED) TITLE ATTRIBUTE:", node.get("title"))

Requested environment information:

Python : sys.version_info(major=3, minor=7, micro=6, releaselevel='final', serial=0)
lxml.etree : (4, 4, 2, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

That's on MacOS. Also reproducible on Linux and Windows, Python 3.8.0, lxml 4.5.0.