mksquashfs hangs

Bug #722168 reported by Steve Newcomb
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
squashfs-tools (Ubuntu)
New
Undecided
Unassigned

Bug Description

Binary package hint: squashfs-tools

# lsb_release -rd
Description: Ubuntu 10.10
Release: 10.10

# apt-cache policy squashfs-tools
squashfs-tools:
  Installed: 1:4.0-8
  Candidate: 1:4.0-8
  Version table:
 *** 1:4.0-8 0
        500 http://us.archive.ubuntu.com/ubuntu/ maverick/main amd64 Packages
        100 /var/lib/dpkg/status

I attempted to make a series of squashfs filesystems. (Many times!). I'm using The Transparent Archivist (ftp://ftp.flaterco.com/ta/) to organize the making of the filesystems; this tool calls mksquashfs repeatedly as it divides a directory-to-be-archived into CD- or DVD-sized squashfs filesystems. I don't think Transparent Archivist is implicated in the problem, whatever it is. As far as mksquashfs is concerned, it is being invoked in the usual way as an independent subprocess.

Sometimes -- not very often, maybe once in 30-60 invocations -- mksquashfs simply hangs. Having made some of the filesystem it's supposed to be making, It stops making the filesystem for no apparent reason. It continues to use a wee bit of processing power after that -- very occasionally rising to the top of the "top" display -- but it makes no observable progress on the filesystem that it's supposed to be making. Once I left it running for 4 days; no progress, and no resumption of work occurred during all that time.

Reading about mksquashfs's earlier bugs led me to edit the Transparent Archivist's source code in such a way as to add '-no-sparse' to the invocation parameters of the call to mksquashfs. This change did not fix the problem; the occasional hanging continued just as before.

I also read Philip Lougher's remarks in Launchpad about problems in earlier Ubuntu distros with mixing kernel module versions and mksquashfs versions incorrectly, with respect to lzma support/defaulting. One fix for that earlier "bug" was to invoke mksquashfs with -no-lzma. The version of mksquashfs in my 10.10 distro does not recognize that argument, so there's no way to try that fix.

It's important that you understand that, while I can reliably reproduce the error, it often takes many invocations of mksquashfs to do so, and the number of invocations cannot be predicted, EVEN FOR EXACTLY THE SAME DATA. When mksquashfs hangs, and I abort it manually, I can restart mksquashfs with exactly the source data and invocation parameters, and so far it has always worked just fine when I do that.

You're about to ask me, "Is your hardware stable?". The answer is "Yes". If this is a hardware problem, it is one that appears ONLY during mksquashfs processing and at no other time. No other applications are affected by it, and no rebooting is necessary even when it does happen. Everything else works perfectly.

Realizing that mksquashfs uses all 8 threads of my i7-950 processor, I have checked its temperature during maximum mksquashfs-mediated loads, and it always stays below 55 C, and usually below 45C. And all other processes continue to work without a hitch. So I don't think this is a hardware problem.

Could "ionice -c 3" be implicated? It's something to think about, anyway. I need to use that in order to keep the computer useful for small interactive tasks during these giant, days-long archiving jobs. However, the hanging problem was occurring before I started using ionice.

Revision history for this message
Phillip Lougher (phillip-lougher) wrote :

-no-sparse isn't going to have any effect here, this was a workaround for some sparse file handling bugs that were fixed for Mksquashfs 3.4 (i.e. sometime before Mksquashfs 4.0 which you're using). LikewSise, the -no-lzma mess was due to a mismatch between Ubuntu's patched Mksquashfs (inherited from Debian) and Ubuntu's kernel Squashfs code which not being derived from Debian lacked their lzma patches. Again these are not relevant here because it is not a squashfs-tools/kernel code interoperability problem, and in any case that problem went away a couple of Ubuntu releases ago.

So what is the problem? From your description it sounds like a multi-threading synchronisation problem. One extremely rare synchronisation bug has come to light since Mksquashfs 4.0, plus some other bugs have come to light since Mksquashfs 4.0, which could possibly cause Mksquashfs to get sufficiently confused to hang. On the other hand this is the first Mksquashfs hang reported against Mksquashfs 4.0 in nearly two years since it's release... and on that basis the threading code seems to be very stable and almost bug free. You may of course be extremely unlucky and have a hardware/source filesystem combination that's triggered the bugs fixed in Mksquashfs 4.1, or an unknown bug.

I would suggest your first step is to download squashfs-tools 4.1 from squashfs.sourceforge.net, and see if the problem still occurs.

If the problem still occurs then your second step should be to raise a bug on the Squashfs bug tracker (squashfs.sourceforge.net).

Revision history for this message
Phillip Lougher (phillip-lougher) wrote :

"As far as mksquashfs is concerned, it is being invoked in the usual way as an independent subprocess."

What is the exact command line passed to Mksquashfs? (from looking at the Transparent Archivist tool's source code it doesn't seem to be passing any non-standard options).

"Sometimes -- not very often, maybe once in 30-60 invocations -- mksquashfs simply hangs."

Is this 30-60 invocations of Mksquashfs (by the Transparent Archivist tool) or 30-60 invocations of the Transparent Archivist tool (which is expected to invoke Mksquashfs itself multiple times)?

"Having made some of the filesystem it's supposed to be making, It stops making the filesystem for no apparent reason. It continues to use a wee bit of processing power after that -- very occasionally rising to the top of the "top" display -- but it makes no observable progress on the filesystem that it's supposed to be making. Once I left it running for 4 days; no progress, and no resumption of work occurred during all that time."

With difficult to track down bugs (as this will be if it is a threading issue), even the smallest amount of clues can be useful. The fact that Mksquashfs when hung occasionally rises to the top of the top display suggests it is doing something (even though it is making no progress). This in turns suggests the hang is a "livelock" issue rather than a "deadlock" issue (in which case Mksquashfs will consume no CPU). When Mksquashfs rises to the top of the top display how much CPU is it using? and how often does it rise to the top of the top display?

As you've proved to be comfortable with modifying the Transparent Archivist tool's source code to alter the Mksquashfs invocation parameters, it would to useful to add the following options

-no-progress -info

This addresses two issues:

First it gets rid of the progress bar (with the twirling progress indicator), which as this is handled by a separate thread could be completely responsible for Mksquashfs occasionally appearing in the top of top - this is unlikely unless your system is doing absolutely nothing else, but -no-progress eliminates that potential red herring.

Second it makes Mksquashfs display files as they are processed, in particular it will show the files processed immediately before Mksquashfs hangs. If over a number of hangs, the files processed before the hang are identical it will strongly suggest there is something about these files which are causing Mksquashfs to hang. This information will be extremely useful.

Revision history for this message
Steve Newcomb (srn-coolheads) wrote : Re: [Bug 722168] Re: mksquashfs hangs
Download full text (4.9 KiB)

Dear Mr. Lougher,

(1) Here's something else for you to think about; it may be relevant to
the solution of the mystery. Trying to make a workaround for the
problem, I wrote a Python tool (syscall.py) that I could call from the
Transparent Archivist to call mksquashfs with a timeout and an automatic
restart after timeout. It worked, but at first there was a problem:
when syscall.py killed mksquashfs (because time ran out) and then
restarted mksquashfs, mksquashfs complained that it couldn't find the
superblock, and it advised that I should use -noappend to simply
overwrite the new squashfs filesystem instead of attempting to append to
the one that was incompletely created in the previous (interrupted)
attempt. So of course I added -noappend to Transparent Archivist's
invocation of mksquashfs:

     case squashfs:
       {
         char *args[] = {
             "/usr/local/ch-tools/syscall.py",
             "-timeout", "3600",
             "-exec", "mksquashfs", before.aschar(),
tempimgfilename.aschar(), "-noappend",
             NULL
       };
     sys (args);

...and so far the hanging problem has not recurred. So maybe the
hanging problem has only to do with updating an already-existing
squashfs filesystem. Or maybe not; I've tried only once so far. I
don't yet have a statistically significant sample of trials using -noappend.

I should have mentioned before that Transparent Archivist's approach to
packing an archive onto the minimum number of CDs and DVDs involves
successive approximation, in which mksquashfs (or mkisofs, etc.) is
called repeatedly as the size of the portion of the archive is adjusted
up or down. In view of that, and assuming that mksquashfs runs reliably
either way (ignoring the fact that my Ubuntu 10.10 amd64 version
evidently doesn't run reliably when appending), shouldn't Transparent
Archivist just use -noappend anyway? Or is mksquashfs likely to run
faster, and to produce an equally-compact image, when appending to the
fs produced in the previous iteration, in such a
successive-approximation application? Your advice on this point would
be valuable to know, and I'll be glad to pass it on to Dave Flater, the
source of Transparent Archivist.

Thanks for your suggestion to try 4.1. I'll try that when I get to it.
One experimental variable at a time!

(2) I'm very impressed by your attention to detail, and by your overall
dedication to praxis. On behalf of people everywhere who don't like to
lose their data, please let me say, "THANK YOU FOR YOUR OUTSTANDING
CONTRIBUTION of squashfs !!!!" It is a very beautiful thing, really,
and much needed for all kinds of purposes.

(3) If it would be of any help to you in figuring out what's going on,
I'm prepared to contribute to you the use of the desktop of my archiving
machine. I see no reason not to trust you, personally, with such access
to my system and data. Your contributions to civilization are your
credentials, as far as I'm concerned. My archives are pretty demanding,
with a single DVD filesystem often containing hundreds of thousands of
small files, along with large files that are often around 500 Mb. If
you want to ta...

Read more...

Revision history for this message
Steve Newcomb (srn-coolheads) wrote :
Download full text (3.5 KiB)

Answers to your questions follow:

On 02/21/2011 12:08 AM, Phillip Lougher wrote:
> "As far as mksquashfs is concerned, it is being invoked in the usual way
> as an independent subprocess."
>
> What is the exact command line passed to Mksquashfs? (from looking at
> the Transparent Archivist tool's source code it doesn't seem to be
> passing any non-standard options).
Right. No options were being passed during these tests, just the source
directory and the target filepath. Then I tried -no-sparse which is, as
you noted, irrelevant. Then I tried -noappend, which may or may not
have solved the immediate problem.
> "Sometimes -- not very often, maybe once in 30-60 invocations --
> mksquashfs simply hangs."
>
> Is this 30-60 invocations of Mksquashfs (by the Transparent Archivist
> tool) or 30-60 invocations of the Transparent Archivist tool (which is
> expected to invoke Mksquashfs itself multiple times)?
Invocations of mksquashfs by TA. TA does successive approximations
until it's satisfied that it has put as much into a given squashfs
filesystem as will fit on a DVD or CD or whatever.
> "Having made some of the filesystem it's supposed to be making, It stops
> making the filesystem for no apparent reason. It continues to use a wee
> bit of processing power after that -- very occasionally rising to the
> top of the "top" display -- but it makes no observable progress on the
> filesystem that it's supposed to be making. Once I left it running for 4
> days; no progress, and no resumption of work occurred during all that
> time."
>
> With difficult to track down bugs (as this will be if it is a threading
> issue)

(ain't that the truth!)

> , even the smallest amount of clues can be useful. The fact that
> Mksquashfs when hung occasionally rises to the top of the top display
> suggests it is doing something (even though it is making no progress).
> This in turns suggests the hang is a "livelock" issue rather than a
> "deadlock" issue (in which case Mksquashfs will consume no CPU). When
> Mksquashfs rises to the top of the top display how much CPU is it using?
> and how often does it rise to the top of the top display?
Not much CPU, but enough to be noticed on the top display maybe once a
minute when the computer is otherwise idle. I don't know the % figure;
I watched it flash by a couple of times and wasn't impressed by it.
Maybe 5% or less for one top iteration, among many others in which it
was effectively 0%.
> As you've proved to be comfortable with modifying the Transparent
> Archivist tool's source code to alter the Mksquashfs invocation
> parameters, it would to useful to add the following options
>
> -no-progress -info
>
> This addresses two issues:
>
> First it gets rid of the progress bar (with the twirling progress
> indicator), which as this is handled by a separate thread could be
> completely responsible for Mksquashfs occasionally appearing in the top
> of top - this is unlikely unless your system is doing absolutely nothing
> else, but -no-progress eliminates that potential red herring.
Ah. Good idea. Thanks!
> Second it makes Mksquashfs display files as they are processed, in
> particular it will show the files p...

Read more...

Revision history for this message
Steve Newcomb (srn-coolheads) wrote :

Well, I'm trying a run with -info and -no-progress , but, gosh, -info really seems to slow the mksquashfs process down to a small fraction of its usual speed (I'm not sure why, or whether it's my fault somehow). At this rate, it could be weeks before I finish a single run, so don't hold your breath for a report.

Meanwhile, on another machine, I've just started running a similar but not-identical archiving job without -info being passed to mksquashfs.

Maybe, one of these days, I'll finish one of these archiving tasks, or one of them will break down. Either way, we'll know something. I have a hunch that -noappend will turn out to have done the trick. At some point I'll try mksquashfs 4.1, too.

Revision history for this message
Phillip Lougher (phillip-lougher) wrote :

-info makes Mksquashfs write to stdout for every file it squashes, and so yes, -info will probably slow things down if you've got lots of small files, and stdout is throttled in some way (i.e. mksquashfs waits on stdout). Where is the output of mksquashfs going? If it's going to the console then redirecting stdout to a file will probably speed things up. Unfortunately at the moment -info is the only way to discover what Mksquashfs was doing prior to it hanging [1].

The -noappend option changing things is interesting but distinctly perplexing. If the Transparent Archivist deleted the output Squashfs file before running Mksquashfs again then -noappend shouldn't make any difference! -noappend will only make a difference if Mksquashfs is being run with a pre-existing file (i.e. the Transparent Archivist hasn't deleted the previous output file). But if that is happening it is easy to see why things are going wrong, Squashfs will simply append to the file, making it grow and grow, which is probably the exact opposite behaviour expected. This isn't the first time that the default Mksquashfs behaviour of Mksquashfs to append to the output file if it exists has caused unexpected problems [2].

Until you discover otherwise, I'll assume -noappend fixes the problem....

[1] I have long wanted to add a "post mortem" dump facility to Mksquashfs, so in the case of a hung Mksquashfs, it will print out exactly what Mksquashfs was doing when ^C is hit. Maybe this year...

[2] This behaviour is entirely defensive and arguably the correct behaviour - the alternative behaviour is to have Mksquashfs always delete the output file if it exists (i.e. -noappend is default). Unfortunately such an default action leads to all sorts of user mistakes, where they specify the wrong file, and get their important archived data deleted ...

Revision history for this message
Steve Newcomb (srn-coolheads) wrote :

On 02/22/2011 11:08 PM, Phillip Lougher wrote:
> Where is the output of mksquashfs going?
To a pipe, and thence to a file. But I suspect that I'm flushing it
frequently, which may contribute greatly to the problem. I'll have to
look into that.
> The -noappend option changing things is interesting but distinctly
> perplexing. If the Transparent Archivist deleted the output Squashfs
> file before running Mksquashfs again then -noappend shouldn't make any
> difference!
Well, having taken an admittedly cursory glance at ta.cc, it appears to
me that TA does delete the squashfs file before generating the next
iteration of it (via the checkedUnlink() function). So, I agree with
you, it would be perplexing if -noappend fixed the problem. I still
don't know whether it does or not.

Alas, I have only one machine with enough disk space (available at a
reasonable speed -- USB 2.0 disks simply aren't fast enough for this) to
make these experiments. I set up another machine with 4.1, but its
disks are too congested -- with unsaved archives! -- to do the
experiment. "There's a hole in the bucket, dear Charlie, dear Charlie..."
> This behaviour is entirely defensive and arguably the correct
> behaviour - the alternative behaviour is to have Mksquashfs always
> delete the output file if it exists (i.e. -noappend is default).
> Unfortunately such an default action leads to all sorts of user
> mistakes, where they specify the wrong file, and get their important
> archived data deleted ...
>
I can't argue with your logic, here. It makes perfect sense to me.

Revision history for this message
Steve Newcomb (srn-coolheads) wrote :

I took Philip Lougher's advice to heart. I downloaded mksquashfs 4.1 source from Sourceforge, compiled it, and my hanging problem has not occurred since.

I therefore urge that the squashfs-tools package in Ubuntu be upgraded from 4.0 to 4.1. When that happens, this bug will be history.

Revision history for this message
Steve Newcomb (srn-coolheads) wrote :

There is still a halting problem in mksquashfs 4.1, but it's MUCH rarer than it was in 4.0. I'm guessing one hang in 300 invocations.

I've written a (horrible) way around the problem, which is to call mksquashfs indirectly via a Python program that kills mksquashfs and re-invokes mksquashfs, invisibly to the calling program (Transparent Archivist), if it doesn't finish within 1 hour. (See attached kluges.)

Phillip, I'd like to comply with your request to run -info each time, but that makes the thing so slow that I just don't have time to do it. The *real* problem with your code is that it works so well -- so well that I need to USE it.

If you ever give this thing a post-mortem facility, it would be good. I could kill it and save the post-mortem output, and then keep right on going.

Revision history for this message
Cody A.W. Somerville (cody-somerville) wrote :
Download full text (11.4 KiB)

I believe that I'm also running into this bug except in my case mksquashfs (from squashfs-tools 1:4.0-8) is being invoked by live-helper (now known as live-build). Occasionally, mksquashfs will hang at seemingly 100%. I've captured the following debug information:

  `-lh_build,16803 /usr/bin/lh_build
      `-lh_binary,7979 /usr/bin/lh_binary
          `-lh_binary_rootf,9495 /usr/bin/lh_binary_rootfs
              `-mksquashfs,9596 chroot filesystem.squashfs
                  |-{mksquashfs},9598
                  |-{mksquashfs},9599
                  |-{mksquashfs},9600
                  |-{mksquashfs},9601
                  |-{mksquashfs},9602
                  |-{mksquashfs},9603
                  |-{mksquashfs},9604
                  |-{mksquashfs},9605
                  |-{mksquashfs},9606
                  `-{mksquashfs},9607

strace of mksquashfs,9596:
sched_yield() = 0
sched_yield() = 0
sched_yield() = 0
sched_yield() = 0
....

strace of {mksquashfs},9598;
futex(0x92c6998, FUTEX_WAIT_PRIVATE, 5745, NULL

strace of {mksquashfs},9599:

strace of {mksquashfs},9600:
futex(0x92c6100, FUTEX_WAIT_PRIVATE, 73393, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x92c6100, FUTEX_WAIT_PRIVATE, 73396, NULL

strace of {mksquashfs},9601:
futex(0x92cb368, FUTEX_WAIT_PRIVATE, 852, NULL

strace of {mksquashfs},9602:
futex(0x92c6100, FUTEX_WAIT_PRIVATE, 73396, NULL

strace of {mksquashfs},9603:
futex(0x92cb368, FUTEX_WAIT_PRIVATE, 851, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x92cb368, FUTEX_WAIT_PRIVATE, 852, NULL

strace of {mksquashfs},9604:
futex(0x92c6100, FUTEX_WAIT_PRIVATE, 73394, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x92c6100, FUTEX_WAIT_PRIVATE, 73396, NULL

strace of {mksquashfs},9605:
futex(0x92cb368, FUTEX_WAIT_PRIVATE, 849, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x92cb368, FUTEX_WAIT_PRIVATE, 852, NULL

strace of {mksquashfs},9606:
futex(0x92c6100, FUTEX_WAIT_PRIVATE, 73395, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x92c6100, FUTEX_WAIT_PRIVATE, 73396, NULL

strace of {mksquashfs},9607:
futex(0x92cb368, FUTEX_WAIT_PRIVATE, 850, NULL) = -1 EAGAIN (Resource temporarily unavailable)
futex(0x92cb368, FUTEX_WAIT_PRIVATE, 852, NULL

strace of mksquashfs,9596 with -f option (to follow children):
...
[pid 9596] sched_yield() = 0
[pid 9596] sched_yield() = 0
[pid 9596] --- SIGALRM (Alarm clock) @ 0 (0) ---
[pid 9596] sigreturn() = ? (mask now [])
[pid 9596] sched_yield() = 0
[pid 9596] sched_yield() = 0
...
pid 9596] sched_yield() = 0
[pid 9599] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 9599] gettimeofday({1302110049, 888547}, NULL) = 0
[pid 9599] futex(0x80e17a0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 9599] clock_gettime(CLOCK_REALTIME, {1302110049, 888637707}) = 0
[pid 9599] futex(0x8363de4, FUTEX_WAIT_PRIVATE, 4655307, {0, 249909293} <unfinished ...>
[pid 9598] <... futex resumed> ) = ? ERESTARTSYS (To be restarted)
[pid 9596] --- SIGALRM (Alarm clock) @ 0 (0) ---
[pid 9598] futex(0x92c6998, FUTEX_WAIT_PRIVATE, 5745, NULL <unfinished ...>
[pid 9596] sigreturn() = ? (mask now [])
[pid 9596] sched_yield() = 0
[pid 9596] sched_yield() = 0
...

(...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.