live cd from nfsroot breaks the nfs mount during bootup

Bug #268005 reported by Peter Cordes
8
Affects Status Importance Assigned to Milestone
casper (Ubuntu)
New
Undecided
Unassigned
network-manager (Ubuntu)
New
Undecided
Unassigned

Bug Description

Binary package hint: network-manager

I'm half guessing that this is N-M's bug, so it may need to be reassigned.

 I've been trying to boot the Intrepid Desktop i386 (alpha5) LiveCD from a NFS/TFTP/DHCP server (i.e. PXE).

 I can netboot the Intrepid live CD, accessing filesystem.squashfs over nfs (mounted by the initramfs, so not nfsroot by all definitions). The problem is that after the init scripts are done, the NFS mount is dead, and all I/O hangs forever. cat is usually in the cache, so I can even cat /var/log/kern.log, and files in /proc, like /proc/mounts, but if tab-completion accesses anything from filesystem.squashfs, the tty is tied up permanently. /etc/init.d/NetworkManager runs late in the boot sequence, so it's possible that all the scripts after it are already cached locally. Or that it doesn't break the network until later. Or maybe it's not N-M's fault at all.

 So you only get 6 tries after switching away from X (which manages to start up far enough to show a blank orange screen and start spinning the cursor). Running things in the background in case they hang works, but tab-completion will still hang your shell. /proc/mounts doesn't indicate any problems, and the NFS mount looks up. I don't know what to cat in /proc or /sys to duplicate the info I'd get from ifconfig. (neither it nor ip(8) are in the cache.) This is a really hard problem to debug... TORAM=yes might help, but it doesn't seem to do anything. (There is code that looks for that env var, even if toram isn't parsed by /init.) Probably the thing to do would be to mount a local filesystem somewhere and run binaries from it.

 It's definitely an init script that breaks the NFS mount, because booting with init=/bin/bash drops me to a shell after the initramfs does its thing. Then I can run find over the whole squashfs filesystem with no problems.

 My client machine (holly) PXE boots, and my pxelinux.cfg/default looks like

DEFAULT menu.c32
SAY press return for menu
prompt 1

LABEL intrepid-i386
        kernel intrepid-i386/vmlinuz
        append boot=casper netboot=nfs nfsroot=10.0.0.17:/mnt/GP1TB/srv/intrepid-a5-i386 initrd=intrepid-i386/initrd.gz --
# other useful args: text to not start gdm

...other LABELs

Everything is set up properly so it loads vmlinuz and initrd.gz that were unpacked from /casper in the intrepid iso image. I've been netbooting Debian installers and whatnot for years, and I'm sure I didn't get that part wrong. BTW, I had to search for a while and eventally read /init to figure out the right boot args, if those even are all the boot args needed. I found lots of older docs, e.g. for feisty and Gutsy. I guess I should have looked at initramfs(8), although it doesn't say what combination you need exactly for NFS root to work.

10.0.0.17:/... is where I unpacked the whole ISO (with 7z x intrepid-desktop-i386.iso). casper/filesystem.squashfs exists under that. /mnt/GP1TB is exported to the subnet that the client is on. I unpacked the CD instead of exporting a loopback mount of the iso image to rule out any possibility of problems on the NFS server side.

 Unsurprisingly, I get the same results if I boot the same kernel and initrd with the same args, but from syslinux on a USB stick instead of pxelinux.

 I've tried this on two different clients with identical results: a Dell PE1950 server at work (dual bnx2 gigE onboard), and an Asus K8V (Marvell Yukon gigE onboard, no other NICS) at home. So it's not a eth0/eth1 confusion problem, because my home machine only has an eth0. I use NFS all the time between my home machines, so I'm confident there's nothing wrong with the network or my NFS setup. (NFS server (tesla) is running linux Ubuntu 2.6.24-21.42-generic, boot server (llama) is running dnsmasq 2.43-1~bpo40+1 on Debian Etch. Etch's pxelinux, too, except Hardy's at work and when booting syslinux from USB...)

 LiveCD netbooting used to work as late as Gutsy. see https://wiki.ubuntu.com/LiveCDNetboot. It doesn't work with Hardy, either, but I haven't tried as hard to see if maybe a different combination of kernel params would work. I haven't managed to get Hardy as far as NFS mounting the squashfs, so it's problems aren't due to any init scripts.

Revision history for this message
Peter Cordes (peter-cordes) wrote :

I thought of a way to debug this: hit alt+sysrq+e to send a sigterm to all tasks while init scripts are running, but before NetworkManager starts.

 I tried it after booting with TORAM=yes, since it seemed to be hung there. It actually works, and I'm running from a filesystem.squashfs that was loaded to tmpfs. So I manually ran some sudo /etc/rc2.d/whatever start, and after starting dbus then NetworkManager, I see that ifconfig shows eth0 go down for a short while after N-M's start script runs.

 For netbooting the livecd to work, we need a way to prevent N-M from running, or at least from doing this. Or we need NFS mounts that survive the interface going down then up. In fact I'm a bit surprised it doesn't seem to survive. (since I booted with TORAM=yes, I didn't have any NFS mounts when I was running the init scripts.)

Revision history for this message
Peter Cordes (peter-cordes) wrote :

The Hardy netboot issues can be debugged by booting with break=mount debug=y,
then running
t=/dev/tty2; sh <$t >$t 2>$t & # repeat for tty3 if you want
exit

 boot will continue, and hang because the network is down. You can debug by switching to another console and looking at /casper.log (where stdout and stderr are redirected).

Some script tries to switch to manual mode when nfsroot booting, so ifup doesn't screw it up later. I guess if you don't put ip=something (other than dhcp) on the commandline, Hardy won't nfsroot boot.

 This is not the NetworkManager issue, and is not present in Intrepid. I only mention it because Hardy is an LTS release, so it people may be trying to mess around with it on servers for a while...

 Hmm, Hardy might be failing because it tries to use eth0, while the network cable is plugged in to eth1, according to its detection order. (I haven't tried netbooting Hardy at home, only on a Dell PE1950 with dual bnx2 NICs.) Intrepid detects the ports in the opposite order, with eth0 as the port marked GbE 1 (of 2) on the back, and which has the lower MAC address. I plugged my cables into GbE 1 since the BIOS defaults to netbooting from that interface, but not the other one, among other reasons. So anyway, maybe it's just Hardy's bad luck, and Intrepid would have the same problem if it needed to use eth1.

Revision history for this message
Peter Cordes (peter-cordes) wrote :

I have a workaround for nfsroot booting Intrepid:
boot with break=init
touch /cow/etc/init.d/NetworkManager
exit

 You will boot normally, but N-M won't run (because files in /cow take precedence over files in filesystem.squashfs, so it's init script is empty).

Revision history for this message
Peter Cordes (peter-cordes) wrote :

Any way of preventing NetworkManager from starting (or from bringing your interface down and then up) is sufficient. Or maybe just pre-caching the files it will try to access while the network is down, if N-M gets stuck trying to access some files while the network is down.

Revision history for this message
Peter Cordes (peter-cordes) wrote :

Hardy does not suffer from this bug. The problems I was having with Hardy were
https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/182940, because as I said, Hardy's kernel detected interfaces in opposite order to how they're labeled on the chassis.

 I netboot Hardy at home on my single-NIC desktop with no problems.

Revision history for this message
Peter Cordes (peter-cordes) wrote :

I marked this as also affecting casper, since it's with boot=casper, not boot=nfs. Also, most nfsroot systems probably wouldn't have N-M installed.

In Hardy and Intrepid, scripts/casper-bottom/23networking builds a /etc/network/interfaces with
auto lo
iface lo inet loopback

auto eth0
iface eth0 inet manual

auto eth1
iface eth1 inet manual

In Hardy, this apparently prevents N-M from bringing the interface down then up. nm-applet's dropdown just shows "manual..."

In Intrepid, this doesn't stop N-M. So unless there's some other way of asking N-M to back off (which casper should use), this is a regression in N-M in Intrepid.

 N-M's changelog shows that 21_manual_means_always_online was dropped or not needed anymore. Maybe that's when the regression happened. Or maybe not, but I'm not going to debug this further myself.

Revision history for this message
Ansgar Pflipsen (ansgar) wrote :

afaics it is a problem with dhclient3.

Had the same issue in a netboot-scenario. As soon as dhclient (think it was version 3.01) gets called
the network interface shuts down. Found this after upgrading an older debian based netboot fs)

Unfortunately the call for dhclient is'nt found in /etc. The documented way to prevent N-M to handle
eth0 is to remove the auto-entry from /etc/network/interfaces.

Unfortunately all other required initialisation wont be done. The other way is to set eth0 to manual
wich works fine for me.

Other possibilities are:
- use the old-style init scripts
- add a (virtual) auto ethX interface to keep N-M running.
  (loopback is not enough to start network services, nfs ...)
- dhclient
  (necessary if you dont have fix ip addresses and need to refresh the leases...)
- fix N-M

Imho, there is needed a "already-up" option for interfaces to prevent initialisation in netboot environments.
Same for kernel parameters (eg: ip=up) instead of ip=dhcp. This would prevent unneccesary (repeated)
dhcp calls.

Revision history for this message
Colin Watson (cjwatson) wrote :

This is probably related to bug 256054; if not for that bug, casper's workaround would be good enough.

Revision history for this message
Peter Cordes (peter-cordes) wrote :

I had thought this was bug 92338, but that was fixed by Hardy's N-M's support for
iface eth0 inet manual

The fact that that no longer works is a different bug (this one, in fact). On bug 92338,
---
 Alexander Sack wrote on 2008-09-22: (permalink)
since gutsy we dont manage interfaces configure in /etc/network/interfaces anymore. ifupdown and network-manager are mutually exclusive. So given that nfsroot uses something in /etc/network/interfaces, all should be fine for you.
---

 So some people think casper's workaround should still be enough, and I agree it looks like bug 256054. Maybe removing the auto eth0, and just keeping iface eth0 would do the trick? No idea.

Ansgar, casper doesn't parse any ip= options. boot=nfs does, but not boot=casper.

 How would ip=up work? You want Linux to get the DHCP lease from the pxelinux that booted it? I don't think there's any mechanism for that, other than encoding the ip, netmask, gateway, and dns into kernel command line options for either Linux itself or the initramfs to parse. If you want Linux to parse it, probably your ethernet drivers need to be compiled into the kernel. The trend is always to move things out of the kernel and into user-space, so parsing complex kernel params is probably not going to get added, and if it already exists it will probably get removed sometime!

Revision history for this message
Saivann Carignan (oxmosys) wrote :

I've been able to boot LiveCD from nfs mount with intrepid and jaunty alpha 3 without any problem, so I think that this bug is fixed (or that a workaround is now enabled). Should this bug be closed?

If yes, https://wiki.ubuntu.com/LiveCDNetboot needs to be updated.

Revision history for this message
TJ (tj) wrote :

I've tested this with Hardy Intrepid and Jaunty and cannot reproduce the issue.

The PXE configuration is detailed at:

http://tjworld.net/wiki/Linux/Ubuntu/NetbootPxeLiveCDMultipleReleases

Revision history for this message
Colin Watson (cjwatson) wrote :

OK, given recent confirmation that this appears to be fixed, I think we can safely call this a duplicate of bug 256054.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.