iterparse does not parse correctly a DTB content file
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Invalid
|
Undecided
|
Unassigned |
Bug Description
I am trying to emulate the interface of QXmlStreamReader with iterparse() in order to parse the content file (AreYouReadyV3.xml) of the DTB found here (http://
I am using lxml.etree v 3.2.3 on openSUSE 12.3 64 bits. The similar parser using directly libxml2 with xmlTextReader does not produce this error. Below is the script I use
class XmlStreamReader:
__it = object()
__name = ""
__attrib= dict()
__foundText = 0 #1 - text, 2 - tail
__text = ""
__tail = ""
NODE_
NODE_
NODE_TYPE_TEXT = int(1)
NODE_
NODE_
def __toUtf8(self, val):
if isinstance(val, unicode):
return val.encode('utf-8')
elif isinstance(val, str):
return val
else:
return ""
def setFileName(self, fn):
self.__it = etree.iterparse(fn, events = ("start", "end"), remove_
def next(self):
try:
if 0 == self.__foundText:
e = self.__it.next()
else: #return the text or tail
if 2 == self.__foundText:
except:
return self.NODE_
if "start" == e[0]:
#get name
#get attributes
if len(e[1].attrib):
for k in e[1].keys():
#get text
if None != e[1].text:
else:
return self.NODE_
elif "end" == e[0]:
#get name
#check for tail
val = e[1].tail
if None != e[1].tail:
else:
return self.NODE_
return self.NODE_
def name(self):
return self.__name
def attributes(self):
return self.__attrib
def text(self):
return self.__text
#needed to be called from C++
def attributesCount
return len(self.
def attributeKey(self, index):
return self.__
def attributeValue(
return self.__
Here are more details about my system:
Python : sys.version_
lxml.etree : (3, 2, 3, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 0)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)
Seems to work for me:
--------------
Python 2.7.5+ (default, Sep 19 2013, 13:48:49)
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.etree as et
>>> t=et.parse( 'Are_you_ ready_z3986/ AreYouReadyV3. xml') '//*[@id= "dtb210" ]') www.daisy. org/z3986/ 2005/dtbook/}sent at 0x27896e0>]
>>> t.xpath(
[<Element {http://
>>> it=et.iterparse ('Are_you_ ready_z3986/ AreYouReadyV3. xml', blank_text= True, huge_tree=True, load_dtd=True) xpath(' .//*[@id= "dtb210" ]') www.daisy. org/z3986/ 2005/dtbook/}sent at 0x27896e0>]
... remove_
>>> all(it)
True
>>> it.root.
[<Element {http://
>>> it=et.iterparse ('Are_you_ ready_z3986/ AreYouReadyV3. xml', blank_text= True, huge_tree=True, load_dtd=True)
... remove_
>>> any(1 for ev,el in it if el.get('id') == 'dtb210')
True
--------------
Your code looks somewhat unwieldy and complicated, though. If you could clean it up a bit, it might help you understand the problem better. Obvious antipatterns are: comparison to None with "!=" instead of "is" (in this case, better use "if el.text:" instead), index access instead of tuple unpacking into named variables, bare except clauses (I guess you want "except StopIteration" instead), etc. At least in this specific case, the "huge_tree" option isn't necessary and should thus be avoided for security reasons.
Note that you are reading the ".text" attribute while handling the "start" event in iterparse. This is documented to be problematic because the text content of the tag may not have been parsed yet, so you are relying on undefined behaviour.
Also, IMHO, the mailing list seems like a better place to discuss this problem than a bug tracker.