html.parse() truncates attribute value

Bug #1866647 reported by Bob Kline
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

If you run the following code with the attached repro.html in the current working directory, you will see from the output that the title attribute of the last img element on the page is truncated.

#!/usr/bin/env python

from re import search
from lxml import html

page = open("repro.html", encoding="utf-8").read()
img = search('<img id="_426"[^>]+.', page).group(0)
print("SERIALIZED IMG ELEMENT:", img)
title = search('title="([^"]+)"', img).group(1)
print("SERIALIZED TITLE ATTRIBUTE:", title)

for node in html.parse("repro.html").iter("img"):
    if node.get("id") == "_426":
        print("PARSED (AND TRUNCATED) TITLE ATTRIBUTE:", node.get("title"))

Requested environment information:

Python : sys.version_info(major=3, minor=7, micro=6, releaselevel='final', serial=0)
lxml.etree : (4, 4, 2, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

That's on MacOS. Also reproducible on Linux and Windows, Python 3.8.0, lxml 4.5.0.

Revision history for this message
Bob Kline (bob.kline) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.