lxml.html interface for iterparse?

Bug #1932486 reported by danny0838
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Triaged
Low
Unassigned

Bug Description

I am parsing an HTML using iterparse. However, the returned element in each iteration does not implement the interface of lxml.html (https://lxml.de/lxmlhtml.html#html-element-methods), such as .classes, .text_content(), etc., even if html=True parameter is passed to iterparse.

Is there a way to parse an HTML iteratively like iterparse with lxml.html interface supported?

Tags: html
danny0838 (danny0838)
description: updated
Revision history for this message
scoder (scoder) wrote :

Seems a reasonable addition that should be trivial to add. PR welcome. Tests can go into a new file "test_parser.py" in "src/lxml/html/tests/".

Changed in lxml:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
danny0838 (danny0838) wrote :

What should the right interface for it to be? Should it be lxml.html.iterparse()?

Revision history for this message
scoder (scoder) wrote :

> lxml.html.iterparse()?

Yes.

Revision history for this message
danny0838 (danny0838) wrote :

OK. Unfortunately I'm not familiar with C programming and probably can't contribute the code.

Revision history for this message
scoder (scoder) wrote :
Revision history for this message
danny0838 (danny0838) wrote :

Unlike etree.fromstring(), etree.iterparse() does not support the "parser" argument, and thus lxml.html.iterparse() cannot be simply implemented as lxml.html.fromstring(), which simply extends the passed "parser" argument. It seems that some tweaks of etree.pyx or iterparse.pxi is required, which at least require some Python C module programming knowledge.

Revision history for this message
scoder (scoder) wrote :

You can add a new keyword argument "element_class_lookup" to iterparse():
https://github.com/lxml/lxml/blob/master/src/lxml/iterparse.pxi

I trust that a little bit of Python knowledge is sufficient for this.

Tests can go into src/lxml/html/tests/test_basic.py, but it seems worth adding a new unittest test class for it there. Or a new doctest file, if you prefer that.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.