Boot process hangs in Ubuntu 11.10 server after upgrade

Bug #885909 reported by Owen Duffy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ifupdown (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Boot process is delayed with the message on screen "ubuntu 11.10 booting system without full network configuration".

Some processes did not start (eg nmbd, mediatomb). They appear to be waiting for an event, in nmbd's case: start on (local-filesystems and net-device-up IFACE!=lo) .

Some status checks:

root@u01:~# status network-interface INTERFACE=lo
network-interface (lo) start/running
root@u01:~# status network-interface INTERFACE=eth0
network-interface (eth0) start/running
root@u01:~# status nmbd
nmbd stop/waiting

So, it seems that the prerequisites for nmbd are running, but nmbd seems to be waiting to start.

Looking at the code in network-interface.conf:

pre-start script
    if [ "$INTERFACE" = lo ]; then
        # bring this up even if /etc/network/interfaces is broken
        ifconfig lo 127.0.0.1 up || true
        initctl emit -n net-device-up \
            IFACE=lo LOGICAL=lo ADDRFAM=inet METHOD=loopback || true
    fi
    mkdir -p /var/run/network
    exec ifup --allow auto $INTERFACE
end script

It emits an event when lo starts, what emits an event to notify that eth0 started?

Owen

Revision history for this message
Owen Duffy (owwn) wrote :

In searching the net, it has become obvious that Ubuntu 11.10 is severely bug ridden, that the status of bugs that prevent the system fully starting is trivialised, and there doesn't look to be a fix anytime soon.

My system is 11.10 server, it has one external network interface, eth0 using on board hardware. There has never been any doubt that the network interface is actually starting and working fine. This cannot be an unusual configuration.

That said, I have looked at why failsafe.conf is delaying things. It looks like it provides a safety net for other failures:

rc-sysinit.conf contains start on (filesystem and static-network-up) or failsafe-boot

So, perhaps it is filesystem or static-network-up that are the problem?

network-interface.conf is the only job that contains emits static-network-up, but it doesn't seem to actually emit it.

Two things I discover, it appears that the pre start script exits ungracefully, and it doesn't seem to emit net-device-up (needed by nmbd for instance) or static-network-up (seemingly needed to allow the boot to complete).

If some of this seems uncertain, it is because upstart is totally new to me, and the system is short on documentation, and it is not my objective to be a Ubuntu sysprog.

But, this might help others to develop a fix to allow their system to boot fully.

Owen

Revision history for this message
Owen Duffy (owwn) wrote :

This turns out to have been a result of the following line in /etc/network/interfaces:

up flush-mail

Removing this line allows the ifup command to complete and fully configure the eth0 interface, which does emit the necessary events to allow the boot process to continue.

Messages alerting the failure of ifup are hidden by the wonder of the upstart scripting, so there is no warning that the network interface (which seemed to work fine) was only partially configured/started.

At some point, it seems the flush-mail command has been removed by an upgrade, the up command crashes, ifup ends prematurely, no warning...

Perhaps hiding all the detail of startup by writing to /dev/null is not such a clever idea?

Owen

affects: ubuntu → ifupdown (Ubuntu)
Revision history for this message
Owen Duffy (owwn) wrote :

In summary, the root cause of the problem was a line in /etc/network/interfaces which tried to run a non existent command (up flush-mail).

ifup silently fails in that case, and although it has done almost everything to start the interface, and the interface is fully functional, it does not create the semaphore used by ifdown to signal completion or emit the relevant upstart events. ifconfig reports normal operation of the interface, ifdown says the interface does not exist and ends.

upstart's network-interfaces job silently ends when the ifup fails. Why are events for the lo interface issued from the network-interfaces job directly, whereas for others, they are emitted from the subsidiary ifup command's post script? This smacks of half baked patching on the fly.

The whole design philosophy of scripts that do not provide useful exception reporting, of silent failure contributes to the unreliability witnessed by the large number of bug reports pertaining to startup functions migrated to upstart.

The time for robust logging to be included in upstart was before migration, not afterwards. The greatest benefit in adequate logging would be during migration in quickly identifying and rectifying problems.

I have solved my problem by removing the "up mail-flush" command, but the design philosophy / reliability issues are embedded in upstart / linux.

Owen

Revision history for this message
Stéphane Graber (stgraber) wrote :

Ok, I'm marking the bug in ifupdown as invalid because there isn't anything to fix in ifupdown.

As for boot logging issues with upstart, it's a known issue and upstart 1.4 that will ship in 12.04 has a feature called "job logging" that will populate /var/log/boot.log and /var/log/upstart/<job>.log with whatever output the upstart job gave.

Changed in ifupdown (Ubuntu):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.