Python crashes when setting elem.text during etree.iterparse

Bug #1743420 reported by danny0838 on 2018-01-15
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Undecided
Unassigned

Bug Description

I try to run a XML parser which sets elem.text during an iterparse, and it seems to cause the Python to always crash on parsing certain files. (check attached file for an illustration)

The crash can be reproduced on at least 2 computers running Windows 7 SP1.

Removing the setting of elem.text (Line 17-18 in script.py of the attached file) seems to stop the crash.

--
script.py (same as the attached file)
--
#!/usr/bin/env python3
import os
import platform
import traceback
import re
import lxml.etree as etree

def main():
    fsrc = 'data.xml'

    for event, elem in etree.iterparse(fsrc, events=('start', 'end')):
        print(event, elem.tag, elem.attrib, elem.text, elem.tail)
        if event == 'start':
            tag_name = elem.tag
            if re.search(r'^_.*_$', tag_name):
                tag_name = tag_name[1:-1]
                if elem.text is not None:
                    elem.text = re.sub(r'^\n', r'', elem.text)

        elif event == 'end':
            tag_name = elem.tag
            if re.search(r'^_.*_$', tag_name):
                tag_name = tag_name[1:-1]
                if elem.tail is not None:
                    elem.tail = re.sub(r'^\n', r'', elem.tail)

            elem.clear()

if __name__ == "__main__":
    if platform.system() == 'Windows' and not 'PROMPT' in os.environ:
        try:
            main()
        except Exception:
            traceback.print_exc()
        os.system('pause')
    else:
        main()

--
Python : sys.version_info(major=3, minor=6, micro=4, releaselevel='final', serial=0)
lxml.etree : (4, 1, 1, 0)
libxml used : (2, 9, 5)
libxml compiled : (2, 9, 5)
libxslt used : (1, 1, 30)
libxslt compiled : (1, 1, 30)

danny0838 (danny0838) wrote :
description: updated
danny0838 (danny0838) on 2018-01-15
description: updated
description: updated
description: updated
summary: - Python crashes when setting elem.text or elem.tail during
- etree.iterparse
+ Python crashes when setting elem.text during etree.iterparse
scoder (scoder) wrote :

I agree that it shouldn't crash, but this is difficult to prevent and your usage example is explicitly forbidden in the docs.

http://lxml.de/parsing.html#modifying-the-tree

Changed in lxml:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments