lxml.etree returns an incomplete node
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
New
|
Undecided
|
Unassigned |
Bug Description
###BUG
from lxml import etree,html
import urllib2
index = 'http://
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
req = urllib2.Request(url = index,headers = { 'User-Agent' : user_agent })
response = urllib2.
html = response.read()
print(html) #Original HTML - 922 links under xpath expression as described below
element = etree.HTML(html)
res = etree.tostring(
print(res) #Proccessed HTML - incomplete node - ignore encoding problem
s = element.
print(len(s))
###
Above is a runnable code. Let's put encoding problem aside.
the original html(read from response) contains about 922 links under /html/body/
However, after process html with etree.HTML() function
There are only 403 links exists. We lose 519 links for some reason.
I have a similar bug using element.xpath
To reproduce, I have a few different XML files produced by equipment of two different companies. These files comply with 3GPP standard, and are valid as checked with xmllint. The xml use namespaces.
This is more or less how the some specific tags look (the complete structure is much more complex):
<xn:meContext id="1234">
<xn:meContextI d>1234< /xn:meContextId >
<xn:dnPrefix> DC=cn</ xn:dnPrefix> attributes> ManagedElement id="1234">
<xn:attributes >
<xn: userDefinedStat e>2</xn: userDefinedStat e>
<xn: vendorName> vendor_ here</xn: vendorName>
<xn: locationName> location< /xn:locationNam e>
<xn: managedElementT ype>ABC123< /xn:managedElem entType>
<xn: userLabel> SOMENAME< /xn:userLabel>
<xn: swVersion> sw version 1.2.3.4< /xn:swVersion>
</xn:attribute s>
<xn:VsDataCont ainer id="1234">
<xn: some_more_ tags></ xn:some_ more_tags>
</xn:VsDataCon tainer> ManagedElement>
<xn:attributes>
</xn:
<xn:
</xn:
</xn:meContext>
I'm doing an event-driven parsing with iterparse, and somewhere in the XML there are many of these meContext tags. The text I'm interesting in extract and print is xn:managedEleme ntType, however, for some instances, only some elements are returned under the xn:attributes.
Here is a script that reproduces my problem:
#!/usr/bin/env python3
from lxml import etree
tags = ["{http:// www.3gpp. org/ftp/ specs/archive/ 32_series/ 32.625# genericNrm}meContext"] www.3gpp. org/ftp/ specs/archive/ 32_series/ 32.625# genericNrm' ent/xn: attributes/ *"
NSMAP = {
"xn": 'http://
}
qry = "xn:ManagedElem
context = etree.iterparse ("test. xml", events=("start", "end"), tag=tags, huge_tree=True) ("test. xml", events=("start", "end"), tag=tags)
# also tried without huge_tree
# context = etree.iterparse
for event, element in context: element. get("id" ), children))
if event == "start":
children = element.xpath(qry, namespaces=NSMAP)
print("Id: {} Children: {}".format(
in the printout, I can see the children of elements of xn:ManagedEleme nt/xn:attribute s/ and in some cases, only the first child is returned.
I have examined the XML in BaseX, and in a text editor, and the fields are there. The strangest thing is that, if I indent the file using xmllint -format, then it works fine. or even if I manipulate the file by removing some tags.
The files contain sensitive information, so I can't attach the file publicly. but if someone needs to reproduce the issue to asses if is a bug or not, I can share the file privately.
python 3.6.6
lxml==4.2.5