iter(tag) finds nothing when feed parser HTMLParser() used

Bug #1014290 reported by Marty on 2012-06-17
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Medium
scoder

Bug Description

Using the iter() method with the "tag" parameter specified yields no elements when I created the tree using the HTMLParser() feed parser interface, however it did yield the expected element when I used the HTML() function, or the parse() function. Looks like I can still use the xpath() method as a workaround.

Test program:

def test(root):
    print("Root element:", root)
    i = root.iter("body")
    print('List of <body> elements via iter("body"):', list(i))
    print("List of all elements via iter():", list(root.iter()))
    print('xpath("//body"):', root.xpath("//body"))

markup = "<html><body></body></html>"

print("==== Using string parser HTML()")
from lxml.etree import HTML
root = HTML(markup)
test(root)

print("==== Using feed parser HTMLParser()")
from lxml.etree import HTMLParser
parser = HTMLParser()
parser.feed(markup)
root = parser.close()
test(root)

Output:

==== Using string parser HTML()
Root element: <Element html at 0x7f65e9edfe60>
List of <body> elements via iter("body"): [<Element body at 0x7f65e9edfeb0>]
List of all elements via iter(): [<Element html at 0x7f65e9edfe60>, <Element body at 0x7f65e9edfeb0>]
xpath("//body"): [<Element body at 0x7f65e9edff00>]
==== Using feed parser HTMLParser()
Root element: <Element html at 0x7f65e9edfeb0>
List of <body> elements via iter("body"): []
List of all elements via iter(): [<Element html at 0x7f65e9edfeb0>, <Element body at 0x7f65e9edff50>]
xpath("//body"): [<Element body at 0x7f65e9edfe60>]

Arch Linux:
Python 3.2.3, releaselevel='final', serial=0
lxml.etree (2, 3, 4, 0)
libxml2 2.7.8-1, libxslt 1.1.26-2

Also happens with Python 2.7.3 (same lxml.etree version) on Arch Linux
Originally happened with Python 3.2.3, lxml.etree (2, 3, 2, 0) on Ubuntu

scoder (scoder) wrote :

Thanks for the report, I can reproduce this.

Changed in lxml:
status: New → Confirmed
Dr. Dénes Vadász (python2) wrote :

After a few hours of investigation we have discovered that the tags of the nodes in the tree produced by libxml2 are strings that are not in the document dictionary.

Next step could be to reproduce this in a small C program using libxml2.

scoder (scoder) wrote :
Changed in lxml:
assignee: nobody → scoder (scoder)
importance: Undecided → Medium
milestone: none → 3.3
status: Confirmed → Fix Committed
scoder (scoder) wrote :

Fixed in lxml 3.3.1.

Changed in lxml:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers