read_csv on bzip2 file unzips only the first block

Bug #1683428 reported by Darko Veberic
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
pandas (Ubuntu)
Invalid
Undecided
Unassigned
python2.7 (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

It seems that the read_csv() suffers the same symptoms as eg the early boost implementations, see https://svn.boost.org/trac/boost/ticket/3853 for details. The bz2 files can namely be composed of many concatenated bz2 blocks which have to be treated as a continuous stream.

How to test: create large csv file, much larger than 900k. Compress with pbzip2 (each process creates one bz2 block). Alternatively create many such csv files, bzip2 them individually and then cat *.bz2 >joined.bz2

read_csv() will uncompress and read only the first block.

Note that this is a severe bug since the parallel bzip2 is getting increasingly common on multi-core systems.

ProblemType: Bug
DistroRelease: Ubuntu 16.10
Package: python-pandas 0.17.1-3ubuntu2
ProcVersionSignature: Ubuntu 4.8.0-42.45-generic 4.8.17
Uname: Linux 4.8.0-42-generic x86_64
ApportVersion: 2.20.3-0ubuntu8.2
Architecture: amd64
CurrentDesktop: XFCE
Date: Mon Apr 17 18:42:52 2017
InstallationDate: Installed on 2014-10-21 (909 days ago)
InstallationMedia: Ubuntu 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.2)
PackageArchitecture: all
SourcePackage: pandas
UpgradeStatus: Upgraded to yakkety on 2016-10-20 (179 days ago)

Revision history for this message
Darko Veberic (darko-veberic-kit) wrote :
summary: - read_csv on bzip2 file unzips only the first bucket
+ read_csv on bzip2 file unzips only the first block
Revision history for this message
Darko Veberic (darko-veberic-kit) wrote :

Sorry, in fact this seems to be a bug of the python bz2 module and not a pandas issue itself...

Revision history for this message
Darko Veberic (darko-veberic-kit) wrote :

Furthermore, according to https://bugs.python.org/issue20781 this is in their opinion "not a bug" ie wont-fix. Unfortunately, the bz2 container clearly allows for multiple concatenated streams (blocks) and therefore IMHO this is a bug since a legally formatted bz2 file is not read correctly and is truncated after the first block.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in pandas (Ubuntu):
status: New → Confirmed
Changed in python2.7 (Ubuntu):
status: New → Confirmed
Revision history for this message
Matthias Klose (doko) wrote :

won't diverge from upstream

Changed in python2.7 (Ubuntu):
status: Confirmed → Won't Fix
Revision history for this message
Rebecca Palmer (rebecca-palmer) wrote :

('invalid' = 'not our (pandas') bug')

As noted in the linked upstream bug, this only affects Python 2 (python-pandas), not current Python 3 (python3-pandas).

Changed in pandas (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.