Ubuntu
linux package

Race condition with network and NFS mounts causes boottime hang

Bug #1118447 reported by Rüdiger Kupper on 2013-02-07

This bug affects 17 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Confirmed	Medium	Unassigned
	nfs-utils (Ubuntu)	Incomplete	Undecided	Unassigned

Bug Description

I seem to experience a race condition during boot of my ubuntu 12.04 server: In approx. one of seven boots, the server hangs during bootup.
This is what I see on the screen:

After the line

* Starting configure network device

there is a short delay of about 1 second, then messages continue. I see

* Starting Mount network filesystems [ OK ]
* Starting set sysctls from /etc/sysctl.conf [ OK ]
* Starting configure network device [ OK ]
* Stopping Mount network filesystems [ OK ]
* Stopping set sysctls from /etc/sysctl.conf [ OK ]
* Starting Block the mounting event for NFS filesytems until statd is running [ OK ]
* Stopping Block the mounting event for NFS filesytems until statd is running [ OK ]
* Starting Block the mounting event for NFS filesytems until statd is running [ OK ]
* Stopping Block the mounting event for NFS filesytems until statd is running [ OK ]

The last messages repeats several times, and then the boot process hangs.
In 6/7 of cases, I wait for a minute, and after that bootup continues.

But in approx 1/7 cases, the system hangs at this point forever. The machine does not respond to CTRL-ALT-DEL, I have to reboot it using SysRq-Keys.

WORKAROUND: Setting the NFS entries in fstab to "noauto" completely removes the problem:
There is no timeout during boot, and no lockup any more. The machine boote smoothly with the NFS-shares unmounted. After the machine is up, we can manually mount the NFS-shares without a problem.

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-3.2.0-37-generic 3.2.0-37.58
ProcVersionSignature: Ubuntu 3.2.0-37.58-generic 3.2.35
Uname: Linux 3.2.0-37-generic x86_64
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
AplayDevices: aplay: device_list:252: keine Soundkarten gefunden ...
ApportVersion: 2.0.1-0ubuntu17.1
Architecture: amd64
ArecordDevices: arecord: device_list:252: keine Soundkarten gefunden ...
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/by-path', '/dev/snd/controlC1', '/dev/snd/hwC1D0', '/dev/snd/pcmC1D3p', '/dev/snd/controlC0', '/dev/snd/hwC0D0', '/dev/snd/pcmC0D0c', '/dev/snd/pcmC0D0p', '/dev/snd/pcmC0D1p', '/dev/snd/pcmC0D2c', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
CurrentDmesg: [ 85.200104] lxcbr0: no IPv6 routers present
Date: Thu Feb 7 15:50:40 2013
HibernationDevice: RESUME=UUID=6c172536-57cc-4deb-867a-0718d572f23e
IwConfig:
lo no wireless extensions.

eth0 no wireless extensions.

lxcbr0 no wireless extensions.
MachineType: To be filled by O.E.M. To be filled by O.E.M.
MarkForUpload: True
ProcEnviron:
LANGUAGE=de:en
TERM=xterm
PATH=(custom, no user)
LANG=de_DE.UTF-8
SHELL=/bin/bash
ProcFB: 0 radeondrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.2.0-37-generic root=/dev/mapper/lvmvg-root ro debug splash vt.handoff=7
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: Es läuft kein PulseAudio-Dienst oder nicht als Sessiondienst.
RelatedPackageVersions:
linux-restricted-modules-3.2.0-37-generic N/A
linux-backports-modules-3.2.0-37-generic N/A
linux-firmware 1.79.1
RfKill:

SourcePackage: linux
UpgradeStatus: Upgraded to precise on 2012-04-28 (285 days ago)
dmi.bios.date: 07/04/2012
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 0302
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: M5A97 EVO R2.0
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: To Be Filled By O.E.M.
dmi.chassis.version: To Be Filled By O.E.M.
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr0302:bd07/04/2012:svnTobefilledbyO.E.M.:pnTobefilledbyO.E.M.:pvrTobefilledbyO.E.M.:rvnASUSTeKCOMPUTERINC.:rnM5A97EVOR2.0:rvrRev1.xx:cvnToBeFilledByO.E.M.:ct3:cvrToBeFilledByO.E.M.:
dmi.product.name: To be filled by O.E.M.
dmi.product.version: To be filled by O.E.M.
dmi.sys.vendor: To be filled by O.E.M.

See original description

Tags:

Revision history for this message

Rüdiger Kupper (ruediger.kupper) wrote on 2013-02-07:

AcpiTables.txt Edit (173.4 KiB, text/plain; charset="utf-8")
AlsaDevices.txt Edit (660 bytes, text/plain; charset="utf-8")
BootDmesg.txt Edit (67.0 KiB, text/plain; charset="utf-8")
Card0.Amixer.info.txt Edit (1.1 KiB, text/plain; charset="utf-8")
Card0.Amixer.values.txt Edit (1.0 KiB, text/plain; charset="utf-8")
Card0.Codecs.codec.0.txt Edit (15.5 KiB, text/plain; charset="utf-8")
Card1.Amixer.info.txt Edit (1.1 KiB, text/plain; charset="utf-8")
Card1.Amixer.values.txt Edit (1.0 KiB, text/plain; charset="utf-8")
Card1.Codecs.codec.0.txt Edit (1.1 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (2.0 KiB, text/plain; charset="utf-8")
Lspci.txt Edit (18.6 KiB, text/plain; charset="utf-8")
Lsusb.txt Edit (806 bytes, text/plain; charset="utf-8")
PciMultimedia.txt Edit (1.2 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (8.4 KiB, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (5.6 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (4.4 KiB, text/plain; charset="utf-8")
UdevDb.txt Edit (127.4 KiB, text/plain; charset="utf-8")
UdevLog.txt Edit (277.9 KiB, text/plain; charset="utf-8")
WifiSyslog.txt Edit (430.8 KiB, text/plain; charset="utf-8")

Rüdiger Kupper (ruediger.kupper) on 2013-02-07

summary:

- Race condition causes boottime hang
+ Possible race condition causes boottime hang

Revision history for this message

Brad Figg (brad-figg) wrote on 2013-02-07: Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2013-02-07: Re: Possible race condition causes boottime hang

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.8 kernel[0] (Not a kernel in the daily directory) and install both the linux-image and linux-image-extra .deb packages.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.8-rc6-raring/

Changed in linux (Ubuntu):
importance:	Undecided → Medium
status:	Confirmed → Incomplete

Revision history for this message

Rüdiger Kupper (ruediger.kupper) wrote on 2013-02-18:

I should add that there are NFS3 mounts in our /etc/fstab. (Hence the boot messages about NFS you see above.)
This may be part of the race condition.

I will try to test the upstream kernel, but these are LTSP-Servers in a production environment. It may take a while until I have the chance.

Rüdiger Kupper (ruediger.kupper) on 2013-02-19

summary:

- Possible race condition causes boottime hang
+ Possible race condition with NFS mounts causes boottime hang

Revision history for this message

Rüdiger Kupper (ruediger.kupper) wrote on 2013-02-19: Re: Possible race condition with NFS mounts causes boottime hang

We tested the upstream kernel (3.8.0-030800rc7). The problem persists with the upstream kernel, it is even worse: The machines hang *every time* during bootup.
After some timeout, I see
"Waiting for network configuration", then
"Waiting up to 60 more seconds for network configuration",
after that the system boots with network down.

I am pretty sure the problem is related to the NFS mounts in our fstab.

tags:	added: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Rüdiger Kupper (ruediger.kupper) wrote on 2013-02-28:

We have now confirmed that the problem is the NFS mounts in /etc/fstab.
Setting the NFS entries in fstab to "noauto" completely removes the problem:
There is no timeout during boot, and no lockup any more. The machine boote smoothly with the NFS-shares unmounted. After the machine is up, we can manually mount the NFS-shares without a problem.

This is definitely a race condition during bootup:
the NFS shares seem to be mounted at a time, where the network is not yet up, and waiting for the network to come up locks up the boot process.

summary:

- Possible race condition with NFS mounts causes boottime hang
+ Race condition with network and NFS mounts causes boottime hang

Revision history for this message

David Edwards (purple52) wrote on 2013-03-25:

Sounds similar to #525154, but that is marked as fixed.

Revision history for this message

Rüdiger Kupper (ruediger.kupper) wrote on 2013-03-25:

Thanks for the pointer. It's not #525154 exactly, as our /var/lib/nfs is not on an NFS share. But #525154 shows that quite a lot od people have phenomenologically related problems...

Revision history for this message

Steve Langasek (vorlon) wrote on 2013-03-25:

Please attach /etc/fstab and /var/log/boot.log from the affected system.

Changed in nfs-utils (Ubuntu):
status:	New → Incomplete

Revision history for this message

Rüdiger Kupper (ruediger.kupper) wrote on 2013-03-25:

#10

/etc/fstab of the affected system Edit (1.4 KiB, text/plain)

This is fstab. Adding "noauto" to the mount options prevents hang, and NFS shares can be manually mlounted later (this is the workaround we use now).

Getting boot.log will take a few days. Since this is a server in a production system and the bug will literally hang the machine, I will need to be physically near the server to reboot...

Revision history for this message

Steve Langasek (vorlon) wrote on 2013-03-26:

#11

The boot.log from an existing successful boot would probably also be useful here.

As far as it "hanging the machine", I see that you have 'splash' on the kernel commandline. Do you have a graphical splash installed? Is there any interaction from mountall before the hang, notifying you of problems with the mounts?

Revision history for this message

Rüdiger Kupper (ruediger.kupper) wrote on 2013-03-27:

#12

boot.log from a successful boot with "noauto" option set for NFS shares Edit (3.9 KiB, text/plain)

This is /var/log/boot.log from the last successful boot. All NFS shares had the "noauto" option set in /etc/fstab, so mountall did not try to mount them during boot. Instead, the shares get mounted from a script in /etc/rc.local at the end of the boot process. This works fine.

Revision history for this message

Rüdiger Kupper (ruediger.kupper) wrote on 2013-03-27:

#13

> As far as it "hanging the machine", I see that you have 'splash' on the kernel commandline. Do you have a graphical splash installed?

Yes, this is a Linux Terminal Server, so it has the full edubuntu-desktop metapackage installed. This includes the edubuntu plymouth splash. I saw no reason to remove it.

> Is there any interaction from mountall before the hang, notifying you of problems with the mounts?
Please see the bug description at the top of this page. There's no message from mountall, but the lines
* Starting Block the mounting event for NFS filesytems until statd is running [ OK ]
* Stopping Block the mounting event for NFS filesytems until statd is running [ OK ]
keep repeating.

Revision history for this message

Steve Langasek (vorlon) wrote on 2013-03-27: Re: [Bug 1118447] Re: Race condition with network and NFS mounts causes boottime hang

#14

On Wed, Mar 27, 2013 at 02:52:24PM -0000, Rüdiger Kupper wrote:
> This is /var/log/boot.log from the last successful boot. All NFS shares
> had the "noauto" option set in /etc/fstab, so mountall did not try to
> mount them during boot. Instead, the shares get mounted from a script in
> /etc/rc.local at the end of the boot process. This works fine.

Oh, right - if you were working around this with 'noauto', I guess that's
not as useful as I would hope for debugging :/

Revision history for this message

Rüdiger Kupper (ruediger.kupper) wrote on 2013-04-11:

#15

Hello Steve, today I had the opportunity to reboot our servers with NFS shares mounted from fstab. I tried to get a boot.log of a failing boot -- but there is no boot.log.

Wherever the hang occurs, it is before the log files get written (?):
When the machine hangs, I reboot into recovery mode (=root fs mounted ro), and find that no /var/log/boot.log, no /var/log/syslog, and no /var/log/dmesg have been written during the last boot.

I don't know if there's some magic I need to do to get me a copy of boot messages? (bootlogd is turned on in /etc/default/bootlogd, but does not produce a boot.log. I could still take a photograph of the screen, if that helps.)

penalvch (penalvch) on 2013-08-05

description:	updated
tags:	added: bios-outdated-1903
tags:	added: kernel-bug-exists-upstream-v3.8-rc6 needs-upstream-testing regression-potential removed: kernel-bug-exists-upstream

Revision history for this message

Zygmunt Krynicki (zyga) wrote on 2013-10-01:

#16

Hey. I seem to experience this bug as well. It definitely feels like a race condition somewhere. I have to reboot my machine 2-5 times each morning to finally get it to boot. I don't have any issues mounting stuff after bootup is done, it certainly feels like a bug around upstart/mountall to me. This is on 12.04.3, with all the updates, etc.

Revision history for this message

Arjen Verweij (arjen-verweij) wrote on 2013-10-07:

#17

This bug has been around for years and we have worked around it by mounting NFS from rc.local. It's not pretty, but it gets pretty old to reboot your system and find it hanging. For some reason the NFS mounts won't wait until networking has completed (i.e. bringing up the interface).

I doubt it is a kernel bug, but I'm happy to test. It's not just Ubuntu, but also Mint, so I guess all of Debian and derivatives.

Revision history for this message

Josep Manel Andrés (titansmc) wrote on 2013-12-11:

#18

The workaround of mounting the NFS shares through rc.local doesn't work for me, sometimes it mounts them and sometimes not. Could it get executed at the end of systemd scripts but not at the end of upstart jobs?

Revision history for this message

Josep Manel Andrés (titansmc) wrote on 2013-12-12:

#19

I found what's going on, it looks to me like there is another race condition between upstart and systemd scripts, sometimes rc.local gets to mount some of my NFS shares but other times it doesn't. The first thing I tried was doing a loop until ifconfig was showing a valid address but it didn't work either, at the time network card got the IP address the route was not present, so I've done a loop waiting for the route to be ready, and then mount the volumes.

aux=0

while [ $aux -eq 0 ]; do
route -n > /tmp/network
if grep "192.168.40.1" /tmp/network; then aux=1; fi
done

echo "mounting home" >> /tmp/network
mount /home 2>> /tmp/network

echo "mounting data" >> /tmp/network
mount /data 2>> /tmp/network

echo "mounting usr" >> /tmp/network
mount /usr/local 2>> /tmp/network
exit 0

And having /etc/fstab not to mount at startup.

fourier:/export/data /data nfs vers=3,rsize=32768,wsize=32768,nosuid,soft,noauto 0 0

But this is a very quick and dirty solution, any workaround more gentle?

Revision history for this message

penalvch (penalvch) wrote on 2014-01-03:

#20

Rüdiger Kupper, as per your https://launchpadlibrarian.net/130634701/BootDmesg.txt :
[ 0.000000] Your BIOS doesn't leave a aperture memory hole
[ 0.000000] Please enable the IOMMU option in the BIOS setup
...
13.747117] EDAC amd64: DRAM ECC disabled.
[ 13.747128] EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load.
[ 13.747129] Either enable ECC checking or force module loading by setting 'ecc_enable_override'.

hence, as per http://support.asus.com/download.aspx?SLanguage=en&p=1&s=24&m=M5A97+EVO+R2.0&os=&hashedid=QsleSfiMgdBr9241 an update is available for your BIOS (2201). If you update to this following https://help.ubuntu.com/community/BiosUpdate , does it change anything? If it doesn't, could you please both specify what happened, and just provide the output of the following terminal command:
sudo dmidecode -s bios-version && sudo dmidecode -s bios-release-date

For more on BIOS updates and linux, please see https://help.ubuntu.com/community/ReportingBugs#Bug_reporting_etiquette .

Thank you for your understanding.

Changed in linux (Ubuntu):
status:	Confirmed → Incomplete

Andrew Radke (andrew-radke) on 2014-01-04

tags:

removed: bios-outdated-1903

Revision history for this message

Andrew Radke (andrew-radke) wrote on 2014-01-04:

#21

I can confirm this bug on different hardware with completely up to date BIOS and firmware.

The race condition we are seeing is being reported from varied sources on different hardware over a significant period of time so I think it is very unlikely to be effected by BIOS of firmware versions. I believe the status should still be confirmed.

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Andrew Radke (andrew-radke) wrote on 2014-01-04:

#22

A cleaner way of mounting the nfs shares than Josep provided is something like what I have below. It looks for whether the nfs server responds to a ping rather than the default route and will retry 3 times with a 3 second timeout each time. It also checks all nfs mount points listed in /etc/fstab which largely restores the functionality of that file.

In any case this is a work around and it seems that fixing the original bug would be preferable to continued effort in this direction.

cat /etc/fstab | sed 's/#.*//' | grep nfs | while read LINE
do
        HOST=`echo "$LINE" | cut -f1 -d:`
        DIR=`echo "$LINE" | awk '{print $2}'`
        if ! mount -l -t nfs | awk '{print $3}' | grep -q "^$DIR$"
        then
                echo -n "Mounting '$DIR'"
                COUNT=0
                STOP=0
                while [ $STOP -eq 0 -a $COUNT -lt 3 ]
                do
                        if ping -qAc 3 $HOST > /dev/null
                        then
                                mount $DIR
                                STOP=1
                                echo " done."
                        else
                                # no sleep needed here because ping will take 3 seconds to fail anyway
                                COUNT=$( expr $COUNT + 1 )
                                echo -n "."
                        fi
                done
                if [ $STOP -eq 0 ]
                then
                        echo " failed. $HOST not responding."
                fi
        else
                echo "$DIR already mounted"
        fi
done

Revision history for this message

penalvch (penalvch) wrote on 2014-01-04:

#23

Andrew Radke, thank you for your comments. Regarding them https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1118447/comments/21 :
>"I can confirm this bug on different hardware with completely up to date BIOS and firmware. The race condition we are seeing is being reported from varied sources on different hardware over a significant period of time so I think it is very unlikely to be effected by BIOS of firmware versions. I believe the status should still be confirmed."

Despite it being replicated on disparate hardware, It is strongly preferred folks with an outdated BIOS update to the latest from their vendor. We can let this not be a show stopper, but please do not remove the tag as it's used for tracking purposes.

In any event, could you please test the latest upstream kernel available (not the daily folder) following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.13-rc6

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

tags:

added: bios-outdated-1903

Revision history for this message

Andrew Radke (andrew-radke) wrote on 2014-01-04:

#24

Download full text (3.1 KiB)

Unfortunately I'm not going to be able to test as requested in the short term as the systems involved are in production. I will see if I can replicate it on some virtual machines but again it could be a while until this is possible.

While I also understand the preference for up to date BIOS it should be used with care for removing a confirmed bug status when it can be seen that one person is out of date. We have multiple people reporting the problem and it should not be assumed that no-one has up to date BIOS.

Please also don't take the following as me trying to be rude, it is merely meant to be an observation of the history of the bug:

In the past we have also been asked to update our kernels and at no stage has this improved the situation. Is there anything to indicate that it is likely to help on this occasion or that it is a kernel bug and not a problem somewhere in the system startup scripts. It seems quite possible that it is a bug where upstart believes that the network is up when it's not.

I'm not an upstart expert so my observations here may be way off.
/etc/init/mountall-net.conf is set to start on net-device-up
/etc/init/network-interface.conf emits net-device-up but doesn't necessarily configure anything other than the loopback interface
/etc/init/networking.conf also emits net-device-up

Why is it that mountall-net (Mount network filesystems) is happening before either network-interface (configure network device) or networking (configure virtual network devices)?
And what would be the implications of removing "emits net-device-up" from /etc/init/network-interface.conf so that we only get net-device-up once ifup -a has been run rather than just a ne...

Please also don't take the following as me trying to be rude, it is merely meant to be an observation of the history of the bug:

a "grep network /var/log/boot.log" on my system (that uses the noauto option and mounts via /etc/rc.local) shows the following:
 * Starting configure network device used by iSCSI root                                            [ OK ]
 * Starting configure network device security                                                      [ OK ]
 * Starting configure network device security                                                      [ OK ]
 * Starting Mount network filesystems                                                              [ OK ]
 * Stopping Mount network filesystems                                                              [ OK ]
 * Starting configure network device                                                               [ OK ]
 * Starting Mount network filesystems                                                              [ OK ]
 * Stopping Mount network filesystems                                                              [ OK ]
 * Starting configure network device                                                               [ OK ]
 * Starting configure network device security                                                      [ OK ]
 * Starting configure virtual network devices                                                      [ OK ]
 * Stopping configure virtual network devices                                                      [ OK ]
 * Starting Serial port to network proxy ser2net                                                   [ OK ]

Revision history for this message

Rüdiger Kupper (ruediger.kupper) wrote on 2014-01-04:

#25

Christopher, thank you for your comment on the outdated BIOS version. I completely see your point about good quality of bug reports.
It may be some time until I can update our BIOSes, however. These are LTSP servers powering a whole school with 700 every-day users, it's elementary they are stable. For us being teachers by our main profession, it's a tough job running a network, since our superior authorities grant us almost no time for doing it. It's like you have a full-time job that keeps you busy 120% of your time and you are asked to run the company network in your free time -- with full responsibility.
I heavily rely on sources like launchpad for doing that job, but I simply cannot always be in time with running tests on that system. Still, I want to give back a little by writing bug reports and helping to resolve bugs where I can.

I will certainly find opportunity to upgrade our BIOSes sooner or later, but as others confirm the bug, I would not want to let this block the bug resolution process. Thanks to all for your help and effort!

Revision history for this message

Steve Langasek (vorlon) wrote on 2014-01-08:

#26

This bug very clearly has nothing at all to do with the BIOS. Christopher, it is inappropriate to ask bug submitters to test with a new BIOS for bugs like this - the BIOS is entirely unrelated to the network filesystem layer, and if upgrading the BIOS did have any effect, it would be *irrelevant* to the bug at hand.

tags:

removed: bios-outdated-1903

Revision history for this message

penalvch (penalvch) wrote on 2014-01-08:

#27

Steve Langasek, thank you for your comment. While it is always appreciated when a developer steps in and advises to a bug report, a portion of your comments don't make sense (which I'm happy to take off report, but they are relevant here). Regarding your comments:
>"This bug very clearly has nothing at all to do with the BIOS."

This wouldn't be clear to everyone, and just saying so doesn't clear this up for folks.

>"Christopher, it is inappropriate to ask bug submitters to test with a new BIOS for bugs like this - the BIOS is entirely unrelated to the network filesystem layer,..."

Ok, but the BIOS does have a hand in a well functioning system, of which the network filesystem layer depends on. No?

>"...if upgrading the BIOS did have any effect, it would be *irrelevant* to the bug at hand."

Wouldn't the updated BIOS having a positive affect, be the exact reason why one would want to update, in order to eliminating collateral damage from a buggy BIOS?

Thank you again for your comment, and understanding.

penalvch (penalvch) on 2014-01-08

tags:

added: bios-outdated-1903

Revision history for this message

Steve Langasek (vorlon) wrote on 2014-01-09:

#28

On Sat, Jan 04, 2014 at 07:28:12AM -0000, Andrew Radke wrote:
> In the past we have also been asked to update our kernels and at no
> stage has this improved the situation. Is there anything to indicate
> that it is likely to help on this occasion or that it is a kernel bug
> and not a problem somewhere in the system startup scripts. It seems
> quite possible that it is a bug where upstart believes that the network
> is up when it's not.

If this is the case, it's necessarily a bug in either the kernel or the
networking layer, not in upstart. Upstart itself (mountall) only tries to
mount the network filesystems when it's received a notification that a
network interface is up. It may not be the interface *needed* for the
mount, but in this case it's expected to fail immediately and upstart will
wait for the next network interface before retrying.

If this is a kernel bug, it's an obscure one, and I don't see any point in
having you retest with newer upstream kernels.

Revision history for this message

Steve Langasek (vorlon) wrote on 2014-01-09:

#29

On Wed, Jan 08, 2014 at 06:47:03PM -0000, Christopher M. Penalver wrote:

> Steve Langasek, thank you for your comment. While it is always appreciated
> when a developer steps in and advises to a bug report, a portion of your
> comments don't make sense (which I'm happy to take off report, but they
> are relevant here). Regarding your comments:

> >"This bug very clearly has nothing at all to do with the BIOS."

> This wouldn't be clear to everyone, and just saying so doesn't clear
> this up for folks.

It wouldn't be clear to everyone, but it *should* be clear to anyone who's
involved in triaging kernel bug reports. If you're going to be asking bug
submitters to take time to test with a newer BIOS, you should have a very
good reason for it; otherwise you're wasting the valuable resource of our
bug submitters' time, and filling the bug report with irrelevancies that
do nothing to help fix the user's issue or improve Ubuntu.

And in some cases, causing bugs to expire out of the system for invalid
reasons.

>> "Christopher, it is inappropriate to ask bug submitters to test with a
>> new BIOS for bugs like this - the BIOS is entirely unrelated to the
>> network filesystem layer,..."

> Ok, but the BIOS does have a hand in a well functioning system, of which
> the network filesystem layer depends on. No?

No. The network filesystem layer *is unrelated to the BIOS*.

And if you don't know this, you should not be following up to bugs asking
users to do tests with a newer BIOS.

Almost none of what the kernel does depends on the BIOS. There are only a
few exceptions: suspend/resume support, power management, some aspects of
video setup, hotkeys. Outside of this, the kernel is directly responsible
for the hardware, and any bugs are kernel bugs, not BIOS bugs.

>> "...if upgrading the BIOS did have any effect, it would be *irrelevant*
>> to the bug at hand."

> Wouldn't the updated BIOS having a positive affect, be the exact reason
> why one would want to update, in order to eliminating collateral damage
> from a buggy BIOS?

No. Any effect here is *not* because the BIOS is buggy, it's because modern
kernels and firmware are complex and changing the BIOS could
*coincidentally* work around this bug for one user. But it doesn't fix the
bug, and doesn't help other users who are experiencing this bug.

On Wed, Jan 08, 2014 at 06:47:03PM -0000, Christopher M. Penalver wrote:

> >"This bug very clearly has nothing at all to do with the BIOS."

> This wouldn't be clear to everyone, and just saying so doesn't clear
> this up for folks.

It wouldn't be clear to everyone, but it *should* be clear to anyone who's
involved in triaging kernel bug reports.  If you're going to be asking bug
submitters to take time to test with a newer BIOS, you should have a very
good reason for it; otherwise you're wasting the valuable resource of our
bug submitters' time, and filling the bug report with irrelevancies that
do nothing to help fix the user's issue or improve Ubuntu.

And in some cases, causing bugs to expire out of the system for invalid
reasons.

>> "Christopher, it is inappropriate to ask bug submitters to test with a
>> new BIOS for bugs like this - the BIOS is entirely unrelated to the
>> network filesystem layer,..."

> Ok, but the BIOS does have a hand in a well functioning system, of which
> the network filesystem layer depends on. No?

No.  The network filesystem layer *is unrelated to the BIOS*.

And if you don't know this, you should not be following up to bugs asking
users to do tests with a newer BIOS.

Almost none of what the kernel does depends on the BIOS.  There are only a
few exceptions: suspend/resume support, power management, some aspects of
video setup, hotkeys.  Outside of this, the kernel is directly responsible
for the hardware, and any bugs are kernel bugs, not BIOS bugs.

>> "...if upgrading the BIOS did have any effect, it would be *irrelevant*
>> to the bug at hand."

> Wouldn't the updated BIOS having a positive affect, be the exact reason
> why one would want to update, in order to eliminating collateral damage
> from a buggy BIOS?

No.  Any effect here is *not* because the BIOS is buggy, it's because modern
kernels and firmware are complex and changing the BIOS could
*coincidentally* work around this bug for one user.  But it doesn't fix the
bug, and doesn't help other users who are experiencing this bug.

Revision history for this message

Thomas Antepoth (ta-ubuntu-antepoth) wrote on 2014-07-20:

#30

I'm right now approx 300 kms away from my server and I notice, that the machine fails to boot.

Symptoms are:

Open ports on the server:
Not shown: 997 closed ports
PORT STATE SERVICE
111/tcp open rpcbind
139/tcp open netbios-ssn
445/tcp open microsoft-ds
MAC Address: 50:E5:49:92:FB:3F (Giga-byte Technology Co.)

samba is up - and I can connect to the shares.

But trying to look up a directory on a NFS mounted share on the server reveals nothing which means that this machine is currently waiting for an NFS mount to be finished.

smb: \> dir
. D 0 Tue Jun 19 14:27:16 2012
.. D 0 Sat Jul 12 22:06:24 2014

42459 blocks of size 16777216. 4075 blocks available
smb: \>

The NFS demon on the hanging machine is currently not running properly

showmount -e 192.168.186.199
clnt_create: RPC: Program not registered

and the NAS where the server gets its beforementioned share is currently up and running fine:

showmount -e 192.168.186.240

Export list for 192.168.186.240:
/Qweb
/Qusb
/Qrecordings
/Qmultimedia
/Qdownload
/Public

As the boot stage hangs in such an early stage there is no possibility for me to log on using ssh and I have to get remote hands (on a Sunday) to get things running again.

When the remote hands are available I'll post some additional information about this issue.

Revision history for this message

Christian Mertes (mertes) wrote on 2016-10-16:

#31

Hey,

here in our institute we use VMware ESXi to deploy Ubuntu 14.04 machines running:

Linux hostname 3.13.0-98-generic
#145-Ubuntu SMP Sat Oct 8 20:13:07 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

And we experiencing the same bug with the NFS race condition. Is there already a bugfix after 2 years? Or do we still need to mount the NFS after bootup manually?

Im happy to share some test results or logs, if needed.

Revision history for this message

Christian González (droetker) wrote on 2017-10-02:

#32

Maybe it doesn't help much here. It just *could* be that it doesn't have to do anything with the network being up, but with DNS resolution?
I have had the problem on one computer (don't know why), and spotted the following line in /var/log/syslog:

mount[859]: mount.nfs: Failed to resolve server kerberos: Temporary failure in name resolution

So I tested a bit, and found that if I put "kerberos" with the right IP into /etc/hosts, it works perfectly.

So maybe it is the problem that the name resolution at the boot time when NFS shares should be mounted, does still not work.

Revision history for this message

Rüdiger Kupper (ruediger.kupper) wrote on 2017-10-03:

#33

Hi Christian,
thanks for the suggestion, but we did not use names for NFS mounts but IPs (see fstab in #10).

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

Race condition with network and NFS mounts causes boottime hang

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package