Some text missing when parsing nested nodes

Bug #1942757 reported by qian jia huan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

description:
when i parse a html, i found some text missing when node nested node, is it a bug or a normal result?

version info:
Python : sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0)
lxml.etree : (4, 6, 3, 0)
libxml used : (2, 9, 5)
libxml compiled : (2, 9, 5)
libxslt used : (1, 1, 30)
libxslt compiled : (1, 1, 30)

code:
from lxml import etree
src = '<html><body><p>test1<strong>test2</strong>test3<strong>test4</strong>test5</p></body></html>'
doc = etree.HTML(src, etree.HTMLParser())
for tag in doc.iter():
    if None==tag.text:
        continue
    print (tag.text)

expect: test1
        test2
        test3
        test4
        test5

output: test1
        test2
        test4

Revision history for this message
scoder (scoder) wrote :
Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.