"lxc-execute: Failed to find an lxc-init" Error during lxc smoke tests

Bug #1144873 reported by Para Siva
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxc (Ubuntu)
Fix Released
High
Serge Hallyn

Bug Description

The following error occurred during raring smoke tests of lxc, with 20130304 imagages (both amd64 and i386)

   dpkg-preconfigure: unable to re-open stdin:
    lxc-execute: Failed to find an lxc-init
    lxc-execute: invalid sequence number 1. expected 4
    lxc-execute: failed to spawn 'test-lucid'
    FAIL

    ======================================================================
    FAIL: runTest (__main__.SimpleLxcTests)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "lxctest.py", line 187, in runTest
        "\nError: %s" % (r, test, e))
    AssertionError: Failed with release lucid at test lxc-execute
    Error: Failed executing /bin/true in test-lucid

The steps to reproduce:
1. utah is used for automated installation, the how to is given in http://utah.readthedocs.org/en/latest/introduction.html#how-to-start-running-tests)

2. The preseed and the run list are as follows and the test could be run using the following command,

      sudo -u utah -i run_utah_tests.py -p lp:ubuntu-test-cases/server/preseeds/lxc.preseed -i /var/cache/utah/iso/raring-server-amd64.iso lp:ubuntu-test-cases/server/runlists/lxc.run -d -n -x /etc/utah/bridged-network-vm.xml

The post test result yaml file and the d-i syslog of the host are attached.

Related branches

Revision history for this message
Para Siva (psivaa) wrote :
Revision history for this message
Para Siva (psivaa) wrote :
Revision history for this message
Para Siva (psivaa) wrote :
Revision history for this message
Serge Hallyn (serge-hallyn) wrote : Re: [Bug 1144873] Re: "lxc-execute: Failed to find an lxc-init" Error during lxc smoke tests

Thanks for reporting this bug.

 assignee serge-hallyn
 importance high
 status confirmed

Changed in lxc (Ubuntu):
assignee: nobody → Serge Hallyn (serge-hallyn)
importance: Undecided → High
status: New → Confirmed
Changed in lxc (Ubuntu):
status: Confirmed → In Progress
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Oops - the command to actually install lxc-init got dropped (purely my fault). I believe it's now fixed, but test is still running.

Revision history for this message
Para Siva (psivaa) wrote :

The tests appear to hang with the revno of 86. Unable to extract any logs since the logs wont be written to any files until the tests finish executing.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Yes, thanks. That happened to me last night too, but the way it hung
made me wonder if utah or kvm was crashing.

But I see what is happening - I think the lxc.postinst which sets up
dnsmasq may be messing up the host's networking. I think I do still
need to set /usr/sbin/policy-rc.d.

I've just pushed that fix to lp, but due to uds I can't test it right
now.

Revision history for this message
Para Siva (psivaa) wrote :

The tests have still been running for more than 3 hours with the revno of 87. I am able to ssh to the host and I get 'test-lucid' returned when I run both these commands,
              sudo lxc-ls --active
              sudo lxc-ls --running

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Parameswaran Sivatharman (<email address hidden>):
> The tests have still been running for more than 3 hours with the revno of 87. I am able to ssh to the host and I get 'test-lucid' returned when I run both these commands,
> sudo lxc-ls --active
> sudo lxc-ls --running

Hmm, so you're saying the lucid container never shuts down?

Revision history for this message
Para Siva (psivaa) wrote :

Yes, it never shuts down.

and /var/log/lxc/test-lucid.log in the host contains only the following,
lxc-execute 1362505584.513 ERROR lxc_conf - No such file or directory - stat(/proc/2296/root/dev//console)

utah@utah-5629-raring-server-amd64:~$ sudo lxc-info --name test-lucid
[sudo] password for utah:
state: RUNNING
pid: 2296

utah@utah-5629-raring-server-amd64:~$ sudo lxc-execute --name test-lucid -- top
lxc-execute: failed (98) to create the command service point /var/lib/lxc/test-lucid/command
lxc-execute: ##
lxc-execute: # The container appears to be already running!
lxc-execute: ##
lxc-execute: failed to initialize the container

utah@utah-5629-raring-server-amd64:~$ sudo lxc-ps -n test-lucid -- -ef
CONTAINER UID PID PPID C STIME TTY TIME CMD

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@psivaa,

I can't reproduce this any more. Are you saying that happened to you in the last 15 hours (since my last commit to the utah server testsuite), or that that is what happened to you at comment #8?

If you havent re-run since comment #8, can you please do so? The tests all pass for me here now.

Changed in lxc (Ubuntu):
status: In Progress → Incomplete
Revision history for this message
Para Siva (psivaa) wrote :

Yes at present the tests in amd64 are hanging. I ran it a couple of times and every time the tests with amd64 installations hang indefinitely (until I forcefully abort the tests). The test-lucid was still running after more than 3 hours.

The tests in with i386 installations seem to pass after the latest fixes (revno 87)

Changed in lxc (Ubuntu):
status: Incomplete → In Progress
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Quoting Parameswaran Sivatharman (<email address hidden>):
> Yes at present the tests in amd64 are hanging. I ran it a couple of
> times and every time the tests with amd64 installations hang

Ok I really can't reproduce this now. Can you leave it running for, say
twice as long as it takes to complete on i386, then long in as user utah
into the kvm vm and show the results of each of:

ps -ef
lxc-ls --fancy
lxc-ps -n test-lucid -- -ef
dmesg|tail

 status incomplete

Changed in lxc (Ubuntu):
status: In Progress → Incomplete
Revision history for this message
Para Siva (psivaa) wrote :

The following are the results of from a an amd64 vm from a test which was running for more than 12 hours and still running
1) utah@utah-5695-raring-server-amd64:~$ ps -ef
   (attached as a text file)

2) utah@utah-5695-raring-server-amd64:~$ sudo lxc-ls --fancy
Cannot open network namespace: No such file or directory
Cannot open network namespace: No such file or directory
NAME STATE IPV4 IPV6 AUTOSTART
------------------------------------------
test-lucid RUNNING - - NO

3) utah@utah-5695-raring-server-amd64:~$ lxc-ps -n test-lucid -- -ef
CONTAINER UID PID PPID C STIME TTY TIME CMD

4) utah@utah-5695-raring-server-amd64:~$ dmesg | tail
[ 516.604721] lxcbr0: port 1(vethwO3MhE) entered forwarding state
[ 516.604725] lxcbr0: port 1(vethwO3MhE) entered forwarding state
[ 516.611279] lxcbr0: port 1(vethwO3MhE) entered disabled state
[ 516.611850] device vethwO3MhE left promiscuous mode
[ 516.611853] lxcbr0: port 1(vethwO3MhE) entered disabled state
[ 516.630860] init: Failed to spawn network-interface (vethwO3MhE) pre-start process: unable to change root directory: No such file or directory
[ 516.635524] init: Failed to spawn network-interface-security (network-interface/vethwO3MhE) pre-start process: unable to change root directory: No such file or directory
[ 516.637734] init: Failed to spawn network-interface (vethwO3MhE) post-stop process: unable to change root directory: No such file or directory
[ 516.649299] init: Failed to spawn network-interface-security (network-interface/vethxypKbH) pre-start process: unable to change root directory: No such file or directory
[ 516.650879] init: Failed to spawn network-interface (vethxypKbH) post-stop process: unable to change root directory: No such file or directory

Revision history for this message
Para Siva (psivaa) wrote :

@hallyn,
This has happened in i386 test as well; a test is running for more than 2 hours where 30 mins was the time it took when it passed in the previous runs. All the outputs for the above commands are similar.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Is it possible for me to get access to a machine where it is failing?
Are these on raring hosts?

 status incomplete

Revision history for this message
Para Siva (psivaa) wrote :

The host is an amd64 installation of 12.04.1. Ill give the information about accessing it separately. Thanks

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

On your host (which is a raring qemu VM, bridged to hardware nic, on a precise host) if I do

while [ 1 ]; do
   mv -f outout outout2
   lxc-execute -n r1 -l info -o outout -- ps
done

after some time, lxc-execute hangs. But the debug output file shows both the good and bad runs made it to at least the very last debug statement.

On my raring laptop I cannot reproduce that. That suggests this is a race triggered when the kernel is slower to do <something> with the veth nics on cleanup.

Still looking...

Changed in lxc (Ubuntu):
status: Incomplete → Triaged
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Indeed the hang is at the end of lxc-execute. It seems to be accompannied with

run 26
  PID TTY TIME CMD
    1 ? 00:00:00 lxc-init
    2 ? 00:00:00 ps
lxc-execute: No such file or directory - stat(/proc/25055/root/dev/lxc/console)

that error message, however on bare metal when that message shows up, lxc-execute does complete. The debug output file suggests

    lxc-execute 1362758203.088 INFO lxc_conf - running to reset 0 nic names
    lxc-execute 1362758203.088 WARN lxc_conf - failed to remove interface '(null)'

did not happen.

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Oh but I'm being silly - lxc-init is defunct, that is where the bug is :)

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

While I have identified a possible problem in the lxc start code (still
to test), I note that there are several other unrelated defunct
processes in some those VMs. For instance

lp 19479 19478 1 17:41 ? 00:00:00 [python3] <defunct>
lp 19496 19478 6 17:41 ? 00:00:00 [python3] <defunct>
utah 19590 19221 0 17:41 ? 00:00:00 sshd: utah@pts/4
utah 19592 19590 0 17:41 pts/4 00:00:00 -bash
lp 19595 19478 0 17:41 ? 00:00:00 [foomatic] <defunct>

So I'm not sure there isn't either a problem with your host, or with
the kernel (or with qemu for that matter).

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

The problem is that src/lxc/start.c:lxc_poll, which checks signalfd for
sigchld from container init, can start after the container init has
exited. It then hangs trying to talk over the monitor. The container
init mainwhile waits to be reaped (by the fn which called lxc_poll,
which does waitpid after lxc_poll completes).

We can pretty easily make the race shorter, but I'd like to get rid
of the race altogether and am not sure yet of the cleanest way to do
that.

 status: inprogress

Changed in lxc (Ubuntu):
status: Triaged → In Progress
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lxc - 0.9.0~alpha3-0ubuntu3

---------------
lxc (0.9.0~alpha3-0ubuntu3) raring; urgency=low

  * 0001-fix-race-with-fast-init: Before starting lxc_mainloop, check whether
    lxc-init has already exited. If it has, return immediately to reap it.
    (LP: #1134923) (LP: #1144873)
 -- Serge Hallyn <email address hidden> Mon, 11 Mar 2013 10:14:39 -0500

Changed in lxc (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.