Processing time issues with bzip2 archives
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
pbzip2 |
New
|
Undecided
|
Unassigned |
Bug Description
version used: 1.1.6, Gentoo Linux, 64bit, 8core machine, 144GB RAM
pbunzip2 works great on archives that were compressed with pbzip2.
Archives, that were compressed with bzip2, however, exhibit a major issue in runtime CPU requirements.
Consider the bzip2 packed XML-dump of the german wikipedia deu-20120116.
Unpacking the archive with bunzip2 takes on my system
real 11m55.422s
user 10m5.700s
sys 0m13.809s
repacking the archive with pbzip2 and unpacking it with pbunzip2 is
real 2m50.423s
user 14m41.283s
sys 1m42.309s
so far so good (pbunzip takes around 650% on the 8-core machine)
Even bunzip2 on the repacked archive is ok
real 18m44.606s
user 10m10.242s
sys 0m13.589s
The real culprit is when pbunzip2 has to decompress a bzip2 packed archive:
real 18m32.003s
user 95m37.038s
sys 13m46.703s
In other words: It is as slow as bunzip2, which would be ok if there is technically no other way, but as you can see from the user time, it STILL hogs around 650% CPU time to no avail. I wonder what the CPUs do in that case.
My hope is now to either convince the wikipedia guys to pack their archives with pbzip2 or you to fix this bug.
Both would be great - of course. ;-)
regards,
Richard
Looking at pbzip2.cpp of version 1.1.6-1 (precise), around line 4470 there's this bit:
4470 decompress( hInfile, InFileSize, fifo);
4471 // start reading in data for decompression
4472 ret = producer_
4473 if (ret == -99)
4474 {
4475 // only 1 block detected, use single threaded code to decompress
4476 noThreads = 1;
Looking through producer_ decompress( ), I can't see how it ever returns -99 when only one bzip2 stream has been found. Was there, in the past, some code early in producer_ decompress( ) that checked for a second header and returned -99 if there wasn't one, before any data was loaded into RAM and added to the queue? I'm guessing that storing a large single-stream file in RAM for a single thread to read is far less efficient than using directdecompress().