Processing time issues with bzip2 archives

Bug #922804 reported by PetaMem
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
pbzip2
New
Undecided
Unassigned

Bug Description

version used: 1.1.6, Gentoo Linux, 64bit, 8core machine, 144GB RAM

pbunzip2 works great on archives that were compressed with pbzip2.

Archives, that were compressed with bzip2, however, exhibit a major issue in runtime CPU requirements.
Consider the bzip2 packed XML-dump of the german wikipedia deu-20120116.xml.bz2: It's around 2,5GB in size and unpacked the data is around 10GB.

Unpacking the archive with bunzip2 takes on my system

real 11m55.422s
user 10m5.700s
sys 0m13.809s

repacking the archive with pbzip2 and unpacking it with pbunzip2 is

real 2m50.423s
user 14m41.283s
sys 1m42.309s

so far so good (pbunzip takes around 650% on the 8-core machine)
Even bunzip2 on the repacked archive is ok

real 18m44.606s
user 10m10.242s
sys 0m13.589s

The real culprit is when pbunzip2 has to decompress a bzip2 packed archive:

real 18m32.003s
user 95m37.038s
sys 13m46.703s

In other words: It is as slow as bunzip2, which would be ok if there is technically no other way, but as you can see from the user time, it STILL hogs around 650% CPU time to no avail. I wonder what the CPUs do in that case.

My hope is now to either convince the wikipedia guys to pack their archives with pbzip2 or you to fix this bug.
Both would be great - of course. ;-)

regards,
 Richard

Revision history for this message
Andrew McCarthy (andrewmccarthy) wrote :

Looking at pbzip2.cpp of version 1.1.6-1 (precise), around line 4470 there's this bit:

4470
4471 // start reading in data for decompression
4472 ret = producer_decompress(hInfile, InFileSize, fifo);
4473 if (ret == -99)
4474 {
4475 // only 1 block detected, use single threaded code to decompress
4476 noThreads = 1;

Looking through producer_decompress(), I can't see how it ever returns -99 when only one bzip2 stream has been found. Was there, in the past, some code early in producer_decompress() that checked for a second header and returned -99 if there wasn't one, before any data was loaded into RAM and added to the queue? I'm guessing that storing a large single-stream file in RAM for a single thread to read is far less efficient than using directdecompress().

Revision history for this message
Yavor Nikolov (yavor-nikolov) wrote :

Can I have a link to a file for which I can test this?

Indeed - there was such check returning -99 in past. However it has gone (obviously we have unreachable section of code there... - good point).
But even in the multi-threaded processing it shouldn't hog CPU much if only a single-stream file is decompressed. And the file is supposed to be processed in smaller portions (not the whole one into RAM).

* I would recommend using 1.1.7 (available here: https://launchpad.net/pbzip2). (Won't fix the observed issue but patches some other issues)

Revision history for this message
Andrew McCarthy (andrewmccarthy) wrote :

Hi,

The best guess I had with the CPU hogging is that the "spare" threads aren't sleeping properly in consumer_decompress() at the safe_cond_wait() (pbzip2.cpp v.1.1.7, line 1666), but I'm not seeing anything obviously wrong with the mutex handling or anything.

producer_decompress() just takes the output of bz2StreamScanner.getNextStream() and adds it to the queue. For a single-stream file, this is the entire file as I understand it. Do you think it's possible to split a single stream into smaller chunks and share it between threads?

Cheers,

Andrew

Revision history for this message
Yavor Nikolov (yavor-nikolov) wrote :

Thanks for your comments Andrew,
I found a wiki dump (dewiki-20120116-pages-articles.xml.bz2 - 2.4G compressed). I tested pbzip2 decompression on it - just one of the CPUs was heavily loaded (the other being pretty idle). real and user time being pretty close.

bz2StreamScanner.getNextStream() is chunking data into pieces if stream is big (There is a size limit - something like 1MB, if the stream is larger - there are many for one with a sequence number attached to each piece, last one has isLastInSequence=true).

Sharing between threads when decompressing single stream is something which is implemented in another utility - lbzip2. (But has more complicated logic for splitting streams).
Another thing is I think lbzip2 is also able to parallel-compress producing single stream.

Revision history for this message
Andrew McCarthy (andrewmccarthy) wrote :

I can't reproduce the problem under Ubuntu 12.04 LTS beta. I generated a 100MB random file, and compressed it with bzip2 to produce testfile-bzip2.bz2, and again with pbzip2 1.1.7 to produce testfile-pbzip2.bz2. Results at the end.

After searching around I've found this thread http://lists.denx.de/pipermail/eldk/2009-September/000982.html which suggests that the pthread_cond_wait() can cause 100% CPU usage if a program isn't compiled with the right pthread library. Since the original bug report mentioned gentoo, I'd be curious to know how the original package was built.

$ time ./pbzip2 -d -c > /dev/null testfile-pbzip2.bz2
real 0m6.478s
user 0m24.982s
sys 0m0.412s

$ time bzip2 -d -c > /dev/null testfile-pbzip2.bz2
real 0m15.225s
user 0m15.149s
sys 0m0.040s

$ time ./pbzip2 -d -c > /dev/null testfile-bzip2.bz2
real 0m15.079s
user 0m15.197s
sys 0m0.072s

$ time bzip2 -d -c > /dev/null testfile-bzip2.bz2
real 0m15.167s
user 0m15.097s
sys 0m0.032s

Revision history for this message
PetaMem (info-petamem) wrote :

> Since the original bug report mentioned gentoo, I'd be curious to know how the original package was built.

I can help ;-)

# ldd /usr/bin/pbzip2
        linux-vdso.so.1 => (0x00007fff89b16000)
        libbz2.so.1 => /lib64/libbz2.so.1 (0x00007f98afa90000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f98af873000)
        libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.5.3/libstdc++.so.6 (0x00007f98af573000)
        libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.5.3/libgcc_s.so.1 (0x00007f98af35d000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f98aefce000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f98afca1000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f98aed4b000)

# ll /lib64/libpthread.so.0
lrwxrwxrwx 1 root root 20 Jan 31 20:33 /lib64/libpthread.so.0 -> libpthread-2.14.1.so

One can certainly not speak of the absence of a pthreads library. And as for the version ...
it comes from glibc 2.14.1, should not be that obsolete.

Richard

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.