xfs left inconsistent after reboot, causing grub to fail

Bug #1103187 reported by Péter Prőhle
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
grub2 (Ubuntu)
Invalid
Medium
Unassigned
linux (Ubuntu)
Fix Released
High
Unassigned

Bug Description

take a 12.10 with XFS root partition containing the /boot/ tree
dpkg-reconfigure grub-pc ; sync
reboot (either by command, or by menu)
diverse error messages + grub rescue + /boot/ is partially accessible

boot UBUNTU 12.10 pendrive + choose "try Ubuntu" + open a terminal
sudo -s; mkdir foo; mount [the XFS root partition] foo; umount foo
reboot into the "original" Ubuntu is NOW successful

It appears that the kernel xfs filesystem driver leaves the fs in an inconsistent state even after a sync. That is corrected by a journal playback done either via having the kernel mount or running fsck, but grub does not use the xfs journal.
---
ApportVersion: 2.6.1-0ubuntu10
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: prohlep 1969 F.... pulseaudio
 /dev/snd/controlC0: prohlep 1969 F.... pulseaudio
 /dev/snd/seq: timidity 1533 F.... timidity
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
DistroRelease: Ubuntu 12.10
InstallationDate: Installed on 2012-10-19 (98 days ago)
InstallationMedia: Ubuntu 12.10 "Quantal Quetzal" - Release amd64 (20121017.5)
IwConfig:
 eth0 no wireless extensions.

 lo no wireless extensions.
MachineType: System manufacturer System Product Name
MarkForUpload: True
Package: linux 3.5.0.23.29
PackageArchitecture: amd64
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 radeondrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.5.0-23-generic root=UUID=39082c0a-8360-4ade-8f2e-f557a8a65035 ro quiet splash vt.handoff=7
ProcVersionSignature: Ubuntu 3.5.0-23.35-generic 3.5.7.2
PulseList:
 Error: command ['pacmd', 'list'] failed with exit code 1: Home directory /home/prohlep not ours.
 No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-3.5.0-23-generic N/A
 linux-backports-modules-3.5.0-23-generic N/A
 linux-firmware 1.95
RfKill:

Tags: quantal package-from-proposed
Uname: Linux 3.5.0-23-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

dmi.bios.date: 07/18/2011
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 2301
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: M4A89GTD-PRO/USB3
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: Asset-1234567890
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr2301:bd07/18/2011:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKComputerINC.:rnM4A89GTD-PRO/USB3:rvrRev1.xx:cvnChassisManufacture:ct3:cvrChassisVersion:
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
Péter Prőhle (prohlep) wrote :

UpgradeStatus: No upgrade log present (probably fresh install)

No, this is an old install from October, and I update daily, but due to a non-server grade SSD system disk, I choose in a dialog not to record the history.

Due to the user friendly interface, I did not know, that this can switch off the important system logs as well.

Concerning this particular bug, the upgrade log would be the most essential piece of information.

I just wonder, why virtually nobody else reports this bug, when I suffer from it on 3 machines of really different booting systems. It is very unlikely that this bug depends on hardware or on installed packages or on anything.

Each day I'm scared, that a package will be automatically udated, and I have to repair 3 installation.

Especially bad, if I have to repair my laptop in front of the gathering audience, waiting for the late beginning of my university lecture, just because the laptop sucked an update just before I left my office to go to give a lecture. On the other hand, I do not wish to switch off the automatic update. Just because of philosophy.

Revision history for this message
Phillip Susi (psusi) wrote :

Run sudo dpkg-reconfigure grub-pc and make sure it is correctly configured to install grub to your boot drive. If it is not, that would tend to lead to some of the errors you mentioned after grub is updated.

Changed in grub2 (Ubuntu):
status: New → Incomplete
Revision history for this message
Péter Prőhle (prohlep) wrote :

Last 6 hours I made quite a few experiments. The result is, that the first 1MB of the boot device and the /boot/ tree never changed, according to diff -q -r /boot/ ~/boot-reference/, and I used the command dd if=/dev/sda of=first-mega.bin bs=512 count=2048 to save the first MB.

The each of the dpkg-reconfigure grub-pc,

        dpkg -P grub-common grub-gfxpayload-lists grub-pc grub-pc-bin grub2-common
        rm -fr /boot/grub/
        apt-get install grub-pc

and grub-install /dev/sda ; update-grub attempts was enough for falling into grub rescue at next boot. Today I could never boot by hand at grub rescue prompt. However a simple boot with the Boot an existing Linux system installed on the disk menu item of the System-Rescue-CD 3.1.2 written on a pendrive, so a simple boot into the Ubuntu without any repair attempt was enough for successful reboots after this rescue boot. And surpisingly neither the first 1MB, nor the /boot/ tree changed.

One thing changed time to time, and I did not find any correlation with the destroy method, namely what kind of complain was above the grub rescue prompt:

Error: invalid arch-indepentdent ELF magic.

Error: symbol `g Glib-CR' not found.

Error: file `/boot/grub/i386-pc/normal.mod' not found.

Infinite many rows of: "not a correct XFS node", and then the usual normal mod not found message.

Each time the set at rescue promt gave the correct prefix and root information, and the ls was able the show the /boot/ tree. However, sometimes the item /boot/grub/i386-pc was told NOT a directory!

Regardless of these, today I was not able to load the /boot/grub/i386-pc/normal.mod with insmod, and the command linux and initrd was not available.

tags: removed: apport-bug
Revision history for this message
Péter Prőhle (prohlep) wrote :

A possible reason why this bug is not reported by dozens of others is, that ALL of my personal installations use the root partition for the /boot/ tree, and my root partitions are of type XFS.

On the other hand, even falling to grub rescue prompt, even this downgraded grub can correctly handle the XFS filesystem, except for only a really few cases (2 or 3) of the several dozens of test cases, i.e. below 10%.

Revision history for this message
Phillip Susi (psusi) wrote :

It does sound like there may be a problem with XFS, can you try putting /boot on an ext4 partition and see if that resolves it? Also can you run the boot info script and attach the results? And I assume you have tried fscking the filesystem?

http://sourceforge.net/projects/bootinfoscript/

Revision history for this message
Péter Prőhle (prohlep) wrote :

Thanks for the http://sourceforge.net/projects/bootinfoscript/ suggestion.

Today I made quite few experiments, and the result in brief is:

no reboot problem after "dpkg-reconfigure grub-pc" and neither after the other two boot-tampering, in the 3 cases below:

        /boot/ tree is in a separated ext4 filesystem, root is xfs

        /boot/ tree is in a separated xfs filesystem, root is xfs

        /boot/ tree is in the root filesystem, which is ext4

deterministically appearing reboot problem in the original case:

        /boot/ tree is in the root filesystem, which is xfs

However I can not predict the prospective reboot error, today I saw additional 2 new kind of error messeges:

        "error: attempt to read or write outside of partition."

        "error: file `/boot/grub/i386-pc/normal.mod' not found."

At grub rescue prompt the command "ls /boot/grub/i386-pc/normal.mod" tend to give various error messages, like

        "error: not a correct XFS node."

sometimes even the /boot/grub appears to be empty, or one can see the entry i386-pc in it, but it can't be listed by ls /boot/grub/i386-pc, and so on.

While at the earlier missing initrd.img problem I could boot by hand, this reboot error gives no opportunity to boot by hand.

However if I boot by SYSRESCCD, and I do nothing else (or work for a while) and I reboot, than this second reboot is guaranted to be successful.

SYSRESCCD I think does not tamper the boot system, at least the bootinfoscript has an identical output before the unsuccessful boot, and after the successful boot.

Now I try to attach the 4 results.txt files, corresponding the 4 cases, according to whether the /boot/ tree is in separate partition, and whether the partition containing the /boot/ tree is xfs or ext4.

Comment: for testing the case of when /boot is in the root partition of type ext4, I installed a new Ubuntu on an other drive replaced into my linux box.

Revision history for this message
Phillip Susi (psusi) wrote :

It sounds like there is something wrong with your filesystem. Did you ever try the root and boot on xfs with a fresh filesystem, or fsck the old one?

Revision history for this message
Péter Prőhle (prohlep) wrote :

Thanks for your time and suggestions. I think, we are closer to narrowing the source of the problem. The phenomenon appears on other box as well, and I did also a fresh install with XFS root containing the /boot/ tree. The deterministically repeatable phenomenon is:

        take a 12.10 with XFS root partition containing the /boot/ tree

        dpkg-reconfigure grub-pc ; sync

        reboot (either by command, or by menu)

        diverse error messages + grub rescue + /boot/ is partially accessible

        boot UBUNTU 12.10 pendrive + choose "try Ubuntu" + open a terminal

        sudo -s; mkdir foo; mount [the XFS root partition] foo; umount foo

        reboot into the "original" Ubuntu is NOW successful

Key information, that in "try Ubuntu" the xfs_check of my XFS root partition told, that "error: the filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log".

This gave me the idea, that not "the boot into my Ubuntu by using SYSRESCCD" itself is important, but the sideeffect of it, that namely the root partition is mounted, and hence the pending changes are discharged, and hence the intention of the "dpkg-reconfigure grub-pc" and the reality in the physical blocks on the drive get into coherence with each other, and hence all the later reboots will be successful!

That's why I gave up booting my Ubuntu with the help of SYSRESCCD, but simply to get somehow (say with "try Ubuntu") into a living linux, and issue a mount concerning the partition in question. And it does work, everywhere of my 3 boxes, and even on the fresh install YOU SUGGESTED ME. Hence the problem is somewhere around the temporary incoherency due to the delayed feautures of the journaled filesystem in question.

Yet another error message: ELF sections outside core.

It is not clear, how much the "dpkg-reconfigure grub-pc" is responsible, since it perhaps should be prepared for journaled filesystems.

To put /boot/ tree into a NON-root partitions appears to be a reliable work around. Probably because the shut down process can bring down a NON-root partition coheretntly, even if there are delayed transactions in it's journal.

In the case of the root partition, perhaps as a final stage, the shut down process should migrate to a ram rooted system, and then it can bring down the original root partition the same way as all the other NON-root partitions.

Revision history for this message
Péter Prőhle (prohlep) wrote :

Further experiment: after "dpkg-reconfigure grub-pc" I waited 30 minutes, and then the reboot was almost successful, there was one complain only flashed just for a moment, namely that "... font file format error ...", otherwise the grub could start the booting process.

But, when I made a new "dpkg-reconfigure grub-pc" destruction, the error was:

        error: ELF header smaller than expected.

and this error DID NOT disappear just mounting the root file system in a living Ubuntu. Interestingly, it was enough to boot my Ubuntu with the help of SYSRESCCD, and immediately after that the next reboot was successful.

Hence, mounting the root filesystem is not a hard rock solution, only it helps often, almost always.

Anyways, I conducted quite a few WAITING TIME experiments, and my impression is, that except for the only one case above, the observable difference is what part of the /boot/ tree is not coherent. If I wait more before the reboot, than less vital or influental parts are incoherent. However this is not clear, an impression only.

Somehow the grub should tell the filesystem to flash the delayed or postponed transactions, otherwise the block information grabbed out by grub will not be coherent at the next boot, as grub works BEFORE the mounting of the filesystem, which would ensure the coherency, desperately missing during the grub activity.

CORRECTION: spending extra two houres, now I don't see any reasonible correlation between the waiting time and what kind of error message will appear above the rescue prompt.

Revision history for this message
Péter Prőhle (prohlep) wrote :

New progression in confirming, that "grub-install /dev/sda" writes block information into boot area, while the content of these block are refreshed much later, latest at the next mounting of the XFS root filesystem.

The "dpkg-reconfigure grub-pc" contains first "grub-install /dev/sda" and second "update-grub".

My separated experinces shows, that the second, "update-grub" makes no reboot problem, while "grub-install /dev/sda" is guaranteed to cause reboot fail.

I found an interesting new error, occuring wery late during the boot process, namely I could already see the graphical works space, when a small window popped up informing me, that

        Sorry, Ubuntu 12.10 experienced an internal error.

There was a facility to automatically send a report of it, I sent it.

Now the real progress is, that I found that the OTHER PROBLEMS described in the very first post

        "booting the new kernel hangs at the missing initial ram disk"

now knowing much more about the incoherency problem, so now it is clear, that the initial ram disk FAILS TO BE MISSING, instead, due to the incoherency problems, the boot process can't see the new initial ram disk of the new kernel!

Just after the previous post, I received the new kernel 3.5.0-23, this is the 6. kernel update since the release of 12.10. As a surprise, this time there was NO reboot problem. Hence I decided to remove the previous kernel, the 3.5.0-22, and immediately I got reboot problem, but slighly different from the previous ones:

namely I got NORMAL grub prompt, but in turn of it, the linux.mod was not loadable in spite of the fact, that it was listable,

and the most instructive fact was, that I could reach the DELETED kernel, ramdisk etc!, while the new ramdisk was missing, while at the last boot I could boot it!

This means, that when dpkg purged the previous kernel, then the file system journaled what to do, but did not eventually do it on the block level before the reboot.

Now I see, that the diverse kind of problems in the last 2 1/2 months were NOT diverse, but only this incoherency problem.

As a result, sometimes I got normal grub, but initrd was missing, or it was not missing, but the grub menu was missing, etc. Other times the grub stoped at it's rescue mode due to the apparent inaccessibility of normal.mod, etc. These caused the impression of having diverse problems instead of a single one.

Revision history for this message
Phillip Susi (psusi) wrote :

Ok, I think I know what is going on here. Grub does not know how to replay the XFS journal and your reboot must not be unmounting the fs, leaving it in a damaged state. Mounting or fscking the filesystem from a livecd replays the journal and corrects the errors.

The sync should make sure the filesystem is consistent though, so it sounds like there is a bug in the kernel filesystem driver.

summary: - automatic updates tend to reboot and die into grub rescue
+ xfs left inconcistent after reboot, causing grub to fail
Phillip Susi (psusi)
description: updated
Changed in grub2 (Ubuntu):
importance: Undecided → Medium
status: Incomplete → Triaged
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1103187

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Phillip Susi (psusi) wrote : Re: xfs left inconcistent after reboot, causing grub to fail

Seems upstream just found this bug and has a patch for it:

http://oss.sgi.com/archives/xfs/2012-12/msg00324.html

tags: added: bot-stop-nagging patch
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in grub2 (Ubuntu):
status: Triaged → Invalid
Revision history for this message
Péter Prőhle (prohlep) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected package-from-proposed
description: updated
Revision history for this message
Péter Prőhle (prohlep) wrote : BootDmesg.txt

apport information

Revision history for this message
Péter Prőhle (prohlep) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Péter Prőhle (prohlep) wrote : Dependencies.txt

apport information

Revision history for this message
Péter Prőhle (prohlep) wrote : Lspci.txt

apport information

Revision history for this message
Péter Prőhle (prohlep) wrote : Lsusb.txt

apport information

Revision history for this message
Péter Prőhle (prohlep) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Péter Prőhle (prohlep) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Péter Prőhle (prohlep) wrote : ProcModules.txt

apport information

Revision history for this message
Péter Prőhle (prohlep) wrote : UdevDb.txt

apport information

Revision history for this message
Péter Prőhle (prohlep) wrote : UdevLog.txt

apport information

Revision history for this message
Péter Prőhle (prohlep) wrote : WifiSyslog.txt

apport information

Revision history for this message
Péter Prőhle (prohlep) wrote : Re: xfs left inconcistent after reboot, causing grub to fail

I executed "apport-collect 1103187" (after installing the missing python-launchpadlib),

and now I try "to change the status of the bug to 'Confirmed'",

... but I found the linux (what probably means the kernel) categorization confirmed already.

Revision history for this message
Péter Prőhle (prohlep) wrote :

http://oss.sgi.com/archives/xfs/2012-12/msg00324.html looks promising: "There is a logic inversion in xfssyncd_worker() which means that the log is not periodically forced or idled correctly. This means that metadata changes aggregated in memory do not get flushed in a timely manner, and hence if filesystem is not cleanly unmounted those changes can be lost. This loss can manifest itself even hours after the changes were made if the filesystem is left to idle without a sync() occurring between the last modification and the crash/shutdown occuring."

The kernel fix is only a negation of a condition in xfssyncd_worker() in fs/xfs/xfs_sync.c :

- if (!(mp->m_super->s_flags & MS_ACTIVE) &&

+ if ((mp->m_super->s_flags & MS_ACTIVE) &&

Dave Chinner (19 Dec 2012): "If people agree the fix is correct, I'll post it to the -stable list for inclusion..."

(1) I am ready to make experiment with the new fixed kernel. I compiled kernels in mid 90's, but I slowly gave up this habit at around 1988...2002. I think, it is worthwhile to check whether this fix solves the XFS/grub-install/reboot problem.

(2) How I will notice (or be noticed), when this patch arrives among the Ubuntu updates?

Revision history for this message
Péter Prőhle (prohlep) wrote :

XFS "periodic log flushing" http://oss.sgi.com/archives/xfs/2013-01/msg00117.html

Ben Myers (8 Jan 2013): "I agree that this is important for stable kernels 3.5 - 3.7. It looks good to
me."

(3) The Ubuntu kernel team will contact the maintainers of XFS?

Revision history for this message
Péter Prőhle (prohlep) wrote :

I have just dist-upgraded from kernel 3.5.0-23 to 3.5.0-24, and the reboot was successful, because now I have a separate ext4 partition for the /boot tree, while the / root tree is still in XFS partition.

Due to the begining of the semester, now I can not afford time and risk to try out what happens, if the /boot tree comes back into the XFS partition. I hope, I can do the experiment next weekend, not this.

Revision history for this message
Steffen (satlank) wrote :

I have just updated from 3.5.0-23 to 3.5.0-24 and encounter the problem Peter described. Reboot ends up in grub console, manually getting to initrd and mounting the fs replays the log after which booting works fine again. I also have /boot on an XFS partition.

summary: - xfs left inconcistent after reboot, causing grub to fail
+ xfs left inconsistent after reboot, causing grub to fail
Revision history for this message
penalvch (penalvch) wrote :

as per your :
[ 0.000000] Your BIOS doesn't leave a aperture memory hole
[ 0.000000] Please enable the IOMMU option in the BIOS setup
[ 0.000000] This costs you 64 MB of RAM
...
[ 0.338295] mtrr: your CPUs had inconsistent variable MTRR settings
[ 0.338296] mtrr: probably your BIOS does not setup all CPUs.

Hence, as per http://support.asus.com/download.aspx?SLanguage=en&m=M4A89GTD%20PRO/USB3&os=29 an update is available for your BIOS (3029). If you update to this following https://help.ubuntu.com/community/BiosUpdate , does it change anything? If it doesn't, could you please both specify what happened, and just provide the output of the following terminal command:
sudo dmidecode -s bios-version && sudo dmidecode -s bios-release-date

For more on BIOS updates and linux, please see https://help.ubuntu.com/community/ReportingBugs#Bug_reporting_etiquette .

Thank you for your understanding.

tags: added: bios-outdated-3029
Changed in linux (Ubuntu):
importance: Undecided → Low
status: Confirmed → Incomplete
Revision history for this message
Phillip Susi (psusi) wrote :

Wrong issue Christopher. This bug is about xfs becoming corrupt when not cleanly unmounted ( and Ubuntu regularly fails to unmount the root fs on shutdown/reboot ).

Changed in linux (Ubuntu):
importance: Low → High
status: Incomplete → Triaged
tags: removed: bios-outdated-3029
Revision history for this message
penalvch (penalvch) wrote :

Phillip Susi, thank you for your comments. Everyone is aware of the scope of the Bug Description. Despite this, as it would appear you find not having an updated BIOS a good idea, could you please provide a technical detail on why having a buggy/more buggy BIOS would be preferred here?

tags: added: bios-outdated-3029
Revision history for this message
Phillip Susi (psusi) wrote :

It isn't preferred, it is simply irrelevant, and updating it is not necessary to resolve the bug.

tags: removed: bios-outdated-3029
Revision history for this message
penalvch (penalvch) wrote :

Phillip Susi, thank you for your comments. Regarding them :
>"It isn't preferred, it is simply irrelevant, and updating it is not necessary to resolve the bug."

Could you please provide a more thorough analysis on what is necessary to resolve the bug (ex. upstream commit #s tested to fix this, reproducible on multiple pieces of hardware specifically, mailinglist discussion, technical hunch, etc.)?

As well, please stop removing the tracker tag bios-outdated-3029. Whether or not this is resolved or predicated on a BIOS update has nothing to do with why the tag is used.

tags: added: bios-outdated-3029
Revision history for this message
Phillip Susi (psusi) wrote :

I just checked the Ubuntu kernel git trees and this fix ( the one linked to in comment #27 Christopher ) is there, so this has been resolved.

Changed in linux (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.