iterparse cannot parse gzip compressed files

Bug #1843193 reported by Funky Future
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

the parsing tutorial states that gzip-compressed files can be feeded to the parser. however, it doesn't:

# BEGIN console replay
u@h:/tmp/lxml-gz $ pip install lxml
Collecting lxml
  Using cached https://files.pythonhosted.org/packages/e7/a8/40115c84414c017e1a293f331709eb7534303d3ccd11ef805ac09b1481e7/lxml-4.4.1-cp37-cp37m-manylinux1_x86_64.whl
Installing collected packages: lxml
Successfully installed lxml-4.4.1
u@h:/tmp/lxml-gz $ echo "<root/>" > test.xml
u@h:/tmp/lxml-gz $ gzip test.xml
u@h:/tmp/lxml-gz $ file text.xml.gz
text.xml.gz: gzip compressed data, was "text.xml", last modified: Sun Sep 8 18:51:23 2019, from Unix, original size 8
u@h:/tmp/lxml-gz $ python
>>> for _, el in etree.iterparse("./test.xml.gz"):
... print(el)
...
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    for _, el in etree.iterparse("./test.xml.gz"):
  File "src/lxml/iterparse.pxi", line 209, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 194, in lxml.etree.iterparse.__next__
  File "src/lxml/iterparse.pxi", line 229, in lxml.etree.iterparse._read_more_events
  File "src/lxml/parser.pxi", line 1364, in lxml.etree._FeedParser.feed
  File "src/lxml/parser.pxi", line 592, in lxml.etree._ParserContext._handleParseResult
  File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
  File "./test.xml.gz", line 1
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
>>>
u@h:/tmp/lxml-gz $ gunzip test.xml.gz
u@h:/tmp/lxml-gz $ python
>>> from lxml import etree
>>> for _, el in etree.iterparse("./test.xml"):
... print(el)
...
<Element root at 0x7f2c530996c8>
# END console replay

# BEGIN versions info
Python : sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (4, 4, 1, 0)
libxml used : (2, 9, 9)
libxml compiled : (2, 9, 9)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 33)
# END versions info

there's also thois quqestion on SO: https://stackoverflow.com/questions/12902756/lxml-cant-parse-gzipped-xml

scoder (scoder)
summary: - gzip compressed files aren't parsed
+ iterparse cannot parse gzip compressed files
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.