Bug #571707 “fsck progress stalls at boot, plymouthd/mountall ea...” : Bugs : plymouth package : Ubuntu

Revision history for this message

mikbini (mikbini) wrote on 2010-04-29:

#1

Dependencies.txt Edit (1.5 KiB, text/plain; charset="utf-8")

Revision history for this message

Ernst (ernst-blaauw) wrote on 2010-04-29:

#2

As four people are affected, I set the status to confirmed.

I experience this behavior on 64 bit. The 'problematic' area starts at 74%. I'm running a fully up to date Lucid.

Changed in mountall (Ubuntu):
status:	New → Confirmed

mikbini (mikbini) on 2010-04-29

description:

updated

Revision history for this message

D J Eddyshaw (david-eddyshaw) wrote on 2010-04-29:

#3

I get this on a Dell Mini 9; the slowdown starts at 70% and gets even worse at 90%,to the point that I eventually just shut down the machine.

The odd behaviour from 70% on has been present ever since I installed the beta. Initially there was a complete hang at 70%; this was "fixed" (#554737) inasmuch as the machine no longer locked up, but there was clearly something still wrong; fsck got to 70% and stopped and then the login screen came up.

Over the past couple of days something has changed; fsck proceeds beyond 70% but impossibly slowly.

Revision history for this message

Fabio Marzocca (thesaltydog) wrote on 2010-04-29:

#4

I am waiting for fscheck to finish booting my other PC since 7 minutes now... stil all 95%, very very slow.

Revision history for this message

Fabio Marzocca (thesaltydog) wrote on 2010-04-29:

#5

This not happening only on ext4: on ext3 it is even worst

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2010-04-29:

#6

Thanks for the report, guys.

It _sounds_like this could be a plymouth issue, but I'd like to collect some more information to be sure.

Could you install the "bootchart" package and reboot (with fsck forced I guess), attach the resulting image from /var/log/bootchart

Thanks

Changed in mountall (Ubuntu):
importance:	Undecided → High

Revision history for this message

D J Eddyshaw (david-eddyshaw) wrote on 2010-04-29:

#7

mini-lucid-20100429-1.tgz Edit (8.1 MiB, application/x-tar)

Bootchart logs

Revision history for this message

Barry Drake (b-drake) wrote on 2010-04-30:

#8

Dell Inspiron Mini 10v 8Gb SSD running Lucid 10.4 with latest updates.
I have exactly the same fault. As with the other reports, three or so updates back, the boot process froze completely. Taking out quiet splash showed that it froze after fsck had completed. The last two updates have stopped it freezing, but fsck now takes in excess of 20 mins to complete on this netbook.

My suggestion for now would be to kill plymouth when /forcefsck is detected. I don't think this would be a popular suggestion, but for me, it would be nice!
Barry.

Revision history for this message

Barry Drake (b-drake) wrote on 2010-04-30:

#9

Forgot to say: I'm running an ext2 partition!!!

Revision history for this message

Martin Erik Werner (arand) wrote on 2010-04-30:

#10

I'm sorry if this end up as a hijack, but I'm assuming this is the same issue...

This seems to be a very common thing.
I would suspect plymouth for the problem, since if you do jump out to a TTY during this then the boot seems to complete nicely.

In fact, if you jump to tty and then jump back to plymouth it's noticable that it has managed to get much further in the percent compared to what it would have if you just stayed on the plymouth screen...

Each time I jump to tty the normal "fsck...clean...non-contiguous #%" message is repeated (one extra each jump) which presumably indicates that fsck has finished, and plymouth (or something else) is messing about with other things...

Revision history for this message

letstrynl (letstry-deactivatedaccount) wrote on 2010-04-30:

#11

bootlogs + pics Edit (1.2 MiB, application/x-tar)

srv-1-lucid-20100430-1.tgz and srv-1-lucid-20100430-1.png
'quiet splash' set in grub kernel commandline
takes 6:30 to complete

srv-1-lucid-20100430-2.tgz and srv-1-lucid-20100430-2.png
*NO* 'quiet splash' set in grub kernel commandline
takes 00:26 to complete

Revision history for this message

Martin Erik Werner (arand) wrote on 2010-04-30:

#12

Also notable is that if I boot with quiet and splash disabled, everything is fine and I'm up in less than a minute.

One thing worth notice is that there is no fsck progress given when booting without the splash.

All my testing done on a virtualbox instance of Lucid

Revision history for this message

Barry Drake (b-drake) wrote on 2010-04-30:

#13

netbook-lucid-20100430-2.png Edit (297.8 KiB, image/png)

Here is the picture from bootchart. Again, boot process completed after some 20mins or so. Bootchart also produces a compressed archive containing some logs - do you need this as well?

Revision history for this message

Barry Drake (b-drake) wrote on 2010-04-30:

#14

Just to confirm - with quiet splash removed from grub, my netbook boots in seconds rather than minutes when fsck is forced.

Revision history for this message

Anders Kaseorg (andersk) wrote on 2010-04-30:

#15

balanced-tree-lucid-20100430-1.png Edit (1.2 MiB, image/png)

I saw this too, and can reproduce with touch /forcefsck; reboot. Here’s a bootchart; it shows mountall spinning at 100% CPU for about 15s after the first fsck finishes, and again for over 200s after the last fsck finishes. I wonder what it’s doing with all that CPU…

Revision history for this message

Anders Kaseorg (andersk) wrote on 2010-04-30:

#16

balanced-tree-lucid-20100430-1.tgz Edit (3.8 MiB, application/x-tar)

Martin Erik Werner (arand) on 2010-04-30

description:

updated

Revision history for this message

Anders Kaseorg (andersk) wrote on 2010-05-01:

#17

balanced-tree-lucid-20100430-3.png Edit (791.0 KiB, image/png)

I think removing ‘quiet splash’ is a red herring. I can reproduce the problem by creating /forcefsck, whether or not ‘quiet splash’ is in the boot flags. Here is a bootchart without ‘quiet splash’ that demonstrates the same problem (mountall spins at 100% CPU for 200 seconds after all the fscks are complete).

description:

updated

Martin Erik Werner (arand) on 2010-05-01

description:

updated

Revision history for this message

Martin Erik Werner (arand) wrote on 2010-05-01:

#18

@Anders Kaseorg:
Sorry, I was in the process of updating the bug description and inadvertedly overwrite your changes.

I am however most definitely able to work around the issue removing quiet and splash...

Maybe we're even bunching two or more separate bugs here...

Revision history for this message

Martin Erik Werner (arand) wrote on 2010-05-01:

#19

Yup, just tested now, and disabling quiet and splash makes this virtualbox able too boot no problem...

I'm trying to figure out the arrow-out workaround now... it seems to be very fickle.

description:

updated

Martin Erik Werner (arand) on 2010-05-01

description:	updated
description:	updated

Revision history for this message

Anders Kaseorg (andersk) wrote on 2010-05-01:

#20

mountall-backtrace Edit (3.4 KiB, text/plain)

Here’s a backtrace from mountall while it is spinning. It starts with
#0 0x00007ff69c9fc6e3 in ply_list_find_node (list=0x7ff69e0af860,
data=0x7ff69faf6460) at ply-list.c:105

Revision history for this message

Martin Erik Werner (arand) wrote on 2010-05-01:

#21

Comaparing letstrynl's and Anders Kaseorg's bootcharts it seems like there a two separate issues here.
On letstrynl's bootchart it's plymouthd that's eating the CPU, whereas on Anders Kaseorg's it's mountall.

This could account for our disagreement as to the workarounds.

We should maybe split off the plymouthd instance into a new bug, to avoid confusion.

Revision history for this message

Martin Erik Werner (arand) wrote on 2010-05-01:

#22

arand_bootcharts.tar.gz Edit (4.1 MiB, application/x-tar)

Okay. Let me first start out retracting pretty much everything I've said so far... there, now let's start anew:

(Using a virtualbox Lucid 32bit guest on 32bit Karmic host)

PROBLEM

When a disk check is performed, the progress stalls somewhere around 70% and will then take a very long time finishing the remaining percent (in my case, around an hour).

TEST CASE:

(sudo aptitude install bootchart)
sudo touch /forcefsck && sudo reboot

WORKAROUNDS

1. Removing "quiet" and "splash" from the kernel boot line

3. When the progress has stalled, switch away from the splash screen using the left arrowkey (presumably any arrowkey works).

* Both these approaches speeds up the boot process to ~1 minute instead.

OBSERVATIONS

The fsck message "somethingsomething non-contiguous somethingsomething" Which I assume indicates the end of the fsck, is printed in the Virtual Terminal (Not-plymouth) at around 70% + ~10-20 seconds.

Disk activity is null from this point on (presumed end of fsck above).

Bootchart crashes if trying to catch the whole boot at once with plymouth (at least for my 1h boot).

This problem seems to occur in both plymouthd and mountall, semi-simultaneously:
If you are in the plymouth screen, plymouthd is the cpu-gobbler, if you switch away from it using the arrow keys, mountall instead takes over the cpu-eating.

BOOTCHARTS
(attached along with complete bootchart log as arand_bootcharts.tar.gz)

0arand_clean
######
Reference clean boot, with plymouth and no fsck.

1arand_switch_to_vt_early
######
In this boot I switched to VT (allowkey out from splash) quite early, as seen in that the shift plymouthd->mountall cpu-hogging is early. mountal takes a little over 100 seconds to finish.

2arand_switch_to_vt_later
######
In this boot I switched to VT later on.
It might be noteworthy that the time that mountall cpu-hogs is approximately the same (100s)

3arand_no_quiet_splash
#####
mountall still hogs the cpu, but for a considerably shorter time, overall boot finishes much faster.

Please do tell if there is anything else useful I could provide.