Regression: quotas+lvm prevent yakkety from booting (boots once every ~4 attempts)

Bug #1635023 reported by Sergio Callegari
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
quota (Ubuntu)
Incomplete
Undecided
Unassigned

Bug Description

After an upgrade from xenial to yakkety, the system does not boot reliably. Only 1 boot over 3-4 seems to succeed.

Quite often, the system seems to hang during boot and then drops into an emergency prompt.

Pressing esc during the boot to show what is going on reveals that there are processes waiting for the disks to become ready.

This is likely to be related to the specific system configuration, with a raid array of two disks managed by dmraid (it is an intel fake raid) and lvm2 on top of it. Most stuff is on it in logical disks, but for the boot partition that is on a fast ssd, and the root of the filesystem that is on lvm2 with a physical volume spanning the rest of the ssd.

The very same configuration worked just fine on xenial.

To make the matter worse, it is impossible to boot in "recovery mode" to get a root prompt, because after you login as root something suddenly times out and the screen gets messed up (not a graphics card issue, but pieces of messages popping out here and there, the recovery mode menu suddenly reappearing, etc.

Please give this bug appropriately high priority, because it prevents servers from coming up after power failures.

ProblemType: Bug
DistroRelease: Ubuntu 16.10
Package: systemd 231-9git1
ProcVersionSignature: Ubuntu 4.8.0-22.24-generic 4.8.0
Uname: Linux 4.8.0-22-generic x86_64
ApportVersion: 2.20.3-0ubuntu8
Architecture: amd64
Date: Wed Oct 19 21:34:16 2016
EcryptfsInUse: Yes
MachineType: Dell Inc. Precision WorkStation T5400
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.8.0-22-generic root=/dev/mapper/vg_ssd-root ro quiet splash mtrr_gran_size=2g mtrr_chunk_size=2g resume=/dev/sdc3
SourcePackage: systemd
UpgradeStatus: Upgraded to yakkety on 2016-10-18 (1 days ago)
dmi.bios.date: 04/30/2012
dmi.bios.vendor: Dell Inc.
dmi.bios.version: A11
dmi.board.name: 0RW203
dmi.board.vendor: Dell Inc.
dmi.chassis.type: 7
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvrA11:bd04/30/2012:svnDellInc.:pnPrecisionWorkStationT5400:pvr:rvnDellInc.:rn0RW203:rvr:cvnDellInc.:ct7:cvr:
dmi.product.name: Precision WorkStation T5400
dmi.sys.vendor: Dell Inc.

Revision history for this message
Sergio Callegari (callegar) wrote :
Revision history for this message
Sergio Callegari (callegar) wrote :

Issue is present also moving do mdadm (which is possible for my fake raid since it uses intel metadata).

When it happens, the boot hangs for 1 minute and a half on

A start job is running for dev-disk-by...

with all disks appearing in turn on the line. After that the machine drops to an emergency prompt.
Analyzing the systemd journal shows the corresponding jobs timing out.

Interestingly:

1) *all* the items mentioned in /etc/fstab get shown here. Namely, when the issue occurs, the error is shown even for the swap partintion (that is on a plain ssd partition...

2) It is not a problem with /etc/fstab but some race: every few attempts the system succeeds to boot

3) When you get to the emergency prompt all the disks on which systemd hangs appear to be working just fine

4) Removing from the fstab the entries corresponding to stuff on the slower fake raid, the boot starts being reliable (I think)

How can I debug this? How can I force systemd to wait before it tries to access the disks?

Should I open a bug on systemd?

Can someone provide pointers at some systemd documentation that might be releant to this issue? It seems such a complex device that I do not know where to start.

Revision history for this message
Sergio Callegari (callegar) wrote : Re: Regression: yakkety does not boot (boots once every ~4 attempts)
summary: - yakkkety boot is erratic
+ Regression: yakkety does not boot (boots once every 4 attempts)
summary: - Regression: yakkety does not boot (boots once every 4 attempts)
+ Regression: yakkety does not boot (boots once every ~4 attempts)
Revision history for this message
Sergio Callegari (callegar) wrote :
Revision history for this message
Sergio Callegari (callegar) wrote :

Got it! Sort of...

Misabling quotas make the system boot reliably.

No clue why. It may not be a race, rather the fact that the initial quota check is supposed to happen only in certain cases.

No idea. Debugging boot with systemd is beyond my expertise.

Masking the systemd quota.service (the initial quota check) seems to be sufficient to enable a reliable boot. Still I see failures on shutdown related to the disabling of the stuff in fstab.

summary: - Regression: yakkety does not boot (boots once every ~4 attempts)
+ Regression: quotas prevent yakkety from booting (boots once every ~4
+ attempts)
Changed in systemd:
status: Unknown → New
Revision history for this message
Sergio Callegari (callegar) wrote : Re: Regression: quotas prevent yakkety from booting (boots once every ~4 attempts)

Systemd devels suggest that the bug is due to downstream (ubuntu), since they do not ship the quota.service unit (whose removal makes the system boot again) at all.

Revision history for this message
Martin Pitt (pitti) wrote :

I think you misunderstood this. Felipe said that "quota.service is not shipped by systemd" -- this means *not* that we forget to ship that unit downstream, but that it's not related to the systemd project at all.

In this case, quota.service is shipped by the "quota" package in /lib/systemd/system/quota.service.

Revision history for this message
Sergio Callegari (callegar) wrote :

Sorry, I'm not understanding your comment.

What I understood is that the systemd developers said "It is not our responsibility if ubuntu does not boot, this is not a systemd bug! In fact, the issue in booting is caused by some quota related units, shipped by Ubuntu, that are not part of systemd". Then, I saw that the faulty systemd units are shipped with the "quota" package, just as you say. Yet, the upstream quota package does not contain these units. The systemd quota related units and scripts are a downstream ubuntu (maybe from debian) addition (in fact, they live in the debian dir of the package).

This is why I was reporting that the systemd developers suggest that this is a downstream bug (a bug in the Ubuntu distro), not an upstream bug (a bug in systemd 231).

I never implied that ubuntu forgot to ship some units. I'm only saying that the quota units shipped with ubuntu evidently mistake something, because when they are masked the system boots, otherwise it doesn't.

Possibly, the issue is only triggered if one has quotas+lvm2+fake or soft raid. This may explain why the mistake got unnoticed when it slipped in.

I suppose that this bug should be fixed by the maintainers of the quota package in ubuntu, possibly with the help of the systemd maintainers.

affects: systemd (Ubuntu) → quota (Ubuntu)
Revision history for this message
Martin Pitt (pitti) wrote :

Yes, that's exactly what I tried to say. Sorry for any confusion!

Revision history for this message
Sergio Callegari (callegar) wrote :

Good! Thanks! As you can see, I have accordingly changed the package from systemd to quota.

Changed in systemd:
status: New → Fix Released
no longer affects: systemd
Revision history for this message
Sergio Callegari (callegar) wrote :

Can this be bug 930551 biting again?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi Sergio,
I'm currently passing through bugs dormant for too long and this is on.
I beg your pardon but while the server team was interested in this priorities didn't allow to work on that more int he last months.

I wanted to check with you if you have resolved the issue some way already.
Also given that yakkety is EOL now an obvious question might be if you upgraded and if that changed the behavior.

Finally I must admit that I didn't see any similar reports coming by, also I haven't seen this on any of my machines int he time. I wonder if we could together create a better reproducer that is available to Devs - e.g. creating a virtual machine setup where the issue shows up. We could raid there, not sure yet.

Revision history for this message
Sergio Callegari (callegar) wrote :

Hi,

my only option, so far, has been to disable quotas and manually check that homes do not get too large, which is inconvenient.

Unfortunately, the systemd developers could not help, since they said that the unit logic for lvm and quotas is not layed down by them.

Even more unfortunately, trying to debug systemd on this (which likely involves a race and certainly dynamically generated units) is beyond my current abilities and beyond my time availability to perspectively increase such abilities by thoroughly reading all about the systemd debug.

In a nutshell, the problem is still open. When I upgrade to 17.10, I can tentatively try to re-enable quotas and see what happens.

summary: - Regression: quotas prevent yakkety from booting (boots once every ~4
+ Regression: quotas+lvm prevent yakkety from booting (boots once every ~4
attempts)
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Can you share your /etc/fstab file, and also "sudo fdisk -l", so that we can get an idea what your filesystem layout is?

To try to reproduce this, I would create a VM with two disks, raid1 them, then put lvm on top, create some LVs and use quotas on them. I can't do that with yakkety nowadays, but I would try zesty (17.04).

Have you upgraded to zesty (17.04) already?

Changed in quota (Ubuntu):
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.