read_csv on bzip2 file unzips only the first block
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
pandas (Ubuntu) |
Invalid
|
Undecided
|
Unassigned | ||
python2.7 (Ubuntu) |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
It seems that the read_csv() suffers the same symptoms as eg the early boost implementations, see https:/
How to test: create large csv file, much larger than 900k. Compress with pbzip2 (each process creates one bz2 block). Alternatively create many such csv files, bzip2 them individually and then cat *.bz2 >joined.bz2
read_csv() will uncompress and read only the first block.
Note that this is a severe bug since the parallel bzip2 is getting increasingly common on multi-core systems.
ProblemType: Bug
DistroRelease: Ubuntu 16.10
Package: python-pandas 0.17.1-3ubuntu2
ProcVersionSign
Uname: Linux 4.8.0-42-generic x86_64
ApportVersion: 2.20.3-0ubuntu8.2
Architecture: amd64
CurrentDesktop: XFCE
Date: Mon Apr 17 18:42:52 2017
InstallationDate: Installed on 2014-10-21 (909 days ago)
InstallationMedia: Ubuntu 14.04.1 LTS "Trusty Tahr" - Release amd64 (20140722.2)
PackageArchitec
SourcePackage: pandas
UpgradeStatus: Upgraded to yakkety on 2016-10-20 (179 days ago)
Sorry, in fact this seems to be a bug of the python bz2 module and not a pandas issue itself...