lxml.etree returns an incomplete node

Bug #1592580 reported by rick
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

###BUG
from lxml import etree,html
import urllib2
index = 'http://www.uukanshu.com/b/18652/'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
req = urllib2.Request(url = index,headers = { 'User-Agent' : user_agent })
response = urllib2.urlopen(req)
html = response.read()
print(html) #Original HTML - 922 links under xpath expression as described below
element = etree.HTML(html)
res = etree.tostring(element,pretty_print=True,method = "html")
print(res) #Proccessed HTML - incomplete node - ignore encoding problem
s = element.xpath("/html/body//ul[@id='chapterList']/child::li/a/attribute::href",smart_strings=True)
print(len(s))
###
Above is a runnable code. Let's put encoding problem aside.
the original html(read from response) contains about 922 links under /html/body//ul[@id='chapterList']/child::li
However, after process html with etree.HTML() function
There are only 403 links exists. We lose 519 links for some reason.

Revision history for this message
Pablo Cabrera (pablo-rocka) wrote :

I have a similar bug using element.xpath

To reproduce, I have a few different XML files produced by equipment of two different companies. These files comply with 3GPP standard, and are valid as checked with xmllint. The xml use namespaces.

This is more or less how the some specific tags look (the complete structure is much more complex):

<xn:meContext id="1234">
    <xn:attributes>
        <xn:meContextId>1234</xn:meContextId>
        <xn:dnPrefix>DC=cn</xn:dnPrefix>
    </xn:attributes>
    <xn:ManagedElement id="1234">
        <xn:attributes>
            <xn:userDefinedState>2</xn:userDefinedState>
            <xn:vendorName>vendor_here</xn:vendorName>
            <xn:locationName>location</xn:locationName>
            <xn:managedElementType>ABC123</xn:managedElementType>
            <xn:userLabel>SOMENAME</xn:userLabel>
            <xn:swVersion>sw version 1.2.3.4</xn:swVersion>
        </xn:attributes>
        <xn:VsDataContainer id="1234">
            <xn:some_more_tags></xn:some_more_tags>
        </xn:VsDataContainer>
    </xn:ManagedElement>
</xn:meContext>

I'm doing an event-driven parsing with iterparse, and somewhere in the XML there are many of these meContext tags. The text I'm interesting in extract and print is xn:managedElementType, however, for some instances, only some elements are returned under the xn:attributes.

Here is a script that reproduces my problem:

#!/usr/bin/env python3

from lxml import etree

tags = ["{http://www.3gpp.org/ftp/specs/archive/32_series/32.625#genericNrm}meContext"]
NSMAP = {
    "xn": 'http://www.3gpp.org/ftp/specs/archive/32_series/32.625#genericNrm'
}
qry = "xn:ManagedElement/xn:attributes/*"

context = etree.iterparse("test.xml", events=("start", "end"), tag=tags, huge_tree=True)
# also tried without huge_tree
# context = etree.iterparse("test.xml", events=("start", "end"), tag=tags)

for event, element in context:
    if event == "start":
        children = element.xpath(qry, namespaces=NSMAP)
        print("Id: {} Children: {}".format(element.get("id"), children))

in the printout, I can see the children of elements of xn:ManagedElement/xn:attributes/ and in some cases, only the first child is returned.

I have examined the XML in BaseX, and in a text editor, and the fields are there. The strangest thing is that, if I indent the file using xmllint -format, then it works fine. or even if I manipulate the file by removing some tags.

The files contain sensitive information, so I can't attach the file publicly. but if someone needs to reproduce the issue to asses if is a bug or not, I can share the file privately.

python 3.6.6
lxml==4.2.5

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.