iterparse does not parse correctly a DTB content file

Bug #1249254 reported by Bogdan Cristea
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

I am trying to emulate the interface of QXmlStreamReader with iterparse() in order to parse the content file (AreYouReadyV3.xml) of the DTB found here (http://www.daisy.org/sample-content#t4). However, iterparse() seems to loose some content, for example the content of element sent, with attribute id="dtb210"

I am using lxml.etree v 3.2.3 on openSUSE 12.3 64 bits. The similar parser using directly libxml2 with xmlTextReader does not produce this error. Below is the script I use

class XmlStreamReader:
    __it = object()
    __name = ""
    __attrib= dict()
    __foundText = 0 #1 - text, 2 - tail
    __text = ""
    __tail = ""
    NODE_TYPE_UNKNOWN = int(-1)
    NODE_TYPE_ELEMENT = int(0)
    NODE_TYPE_TEXT = int(1)
    NODE_TYPE_END_ELEMENT = int(2)
    NODE_TYPE_END_FILE = int(3)
    def __toUtf8(self, val):
        if isinstance(val, unicode):
            return val.encode('utf-8')
        elif isinstance(val, str):
            return val
        else:
            return ""
    def setFileName(self, fn):
        self.__it = etree.iterparse(fn, events = ("start", "end"), remove_blank_text=True, huge_tree = True, load_dtd = True)
    def next(self):
        try:
            if 0 == self.__foundText:
                e = self.__it.next()
            else: #return the text or tail
                if 2 == self.__foundText:
                    self.__text = self.__tail
                self.__foundText = 0
                return self.NODE_TYPE_TEXT
        except:
            return self.NODE_TYPE_END_FILE
        if "start" == e[0]:
            #get name
            self.__name = e[1].tag.split('}')[-1]
            #get attributes
            self.__attrib = dict()
            if len(e[1].attrib):
                for k in e[1].keys():
                    attrValue = e[1].attrib[k]
                    self.__attrib[k.split('}')[-1]] = self.__toUtf8(attrValue)
            #get text
            if None != e[1].text:
                self.__text = self.__toUtf8(e[1].text)
                self.__foundText = 1
            else:
                self.__text = ""
            return self.NODE_TYPE_ELEMENT
        elif "end" == e[0]:
            #get name
            self.__name = e[1].tag.split('}')[-1]
            #check for tail
            val = e[1].tail
            if None != e[1].tail:
                self.__tail = self.__toUtf8(e[1].tail)
                self.__foundText = 2
            else:
                self.__tail = ""
            return self.NODE_TYPE_END_ELEMENT
        return self.NODE_TYPE_UNKNOWN
    def name(self):
        return self.__name
    def attributes(self):
        return self.__attrib
    def text(self):
        return self.__text
    #needed to be called from C++
    def attributesCount(self):
        return len(self.__attrib.keys())
    def attributeKey(self, index):
        return self.__attrib.keys()[index]
    def attributeValue(self, index):
        return self.__attrib[self.__attrib.keys()[index]]

Here are more details about my system:
Python : sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (3, 2, 3, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 0)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Revision history for this message
scoder (scoder) wrote :

Seems to work for me:

--------------
Python 2.7.5+ (default, Sep 19 2013, 13:48:49)
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.etree as et

>>> t=et.parse('Are_you_ready_z3986/AreYouReadyV3.xml')
>>> t.xpath('//*[@id="dtb210"]')
[<Element {http://www.daisy.org/z3986/2005/dtbook/}sent at 0x27896e0>]

>>> it=et.iterparse('Are_you_ready_z3986/AreYouReadyV3.xml',
... remove_blank_text=True, huge_tree=True, load_dtd=True)
>>> all(it)
True
>>> it.root.xpath('.//*[@id="dtb210"]')
[<Element {http://www.daisy.org/z3986/2005/dtbook/}sent at 0x27896e0>]

>>> it=et.iterparse('Are_you_ready_z3986/AreYouReadyV3.xml',
... remove_blank_text=True, huge_tree=True, load_dtd=True)
>>> any(1 for ev,el in it if el.get('id') == 'dtb210')
True
--------------

Your code looks somewhat unwieldy and complicated, though. If you could clean it up a bit, it might help you understand the problem better. Obvious antipatterns are: comparison to None with "!=" instead of "is" (in this case, better use "if el.text:" instead), index access instead of tuple unpacking into named variables, bare except clauses (I guess you want "except StopIteration" instead), etc. At least in this specific case, the "huge_tree" option isn't necessary and should thus be avoided for security reasons.

Note that you are reading the ".text" attribute while handling the "start" event in iterparse. This is documented to be problematic because the text content of the tag may not have been parsed yet, so you are relying on undefined behaviour.

Also, IMHO, the mailing list seems like a better place to discuss this problem than a bug tracker.

Revision history for this message
Bogdan Cristea (cristeab) wrote :

Allright, I should post first on the discussion list. However the issue is that I cannot get the text of that element with iterparse, in your tests you don't check for this

Revision history for this message
scoder (scoder) wrote :

Closing, works as documented.

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.