Error calling ''lxd forkstart juju-machine-2-lxd-0 /var/lib/lxd/containers /var/log/lxd/juju-machine-2-lxd-0/lxc.conf'': err=''exit status 1''

Bug #1577524 reported by Aaron Bentley
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Katherine Cox-Buday

Bug Description

Deploying lxd containers is sometimes failing with "Error calling ''lxd forkstart juju-machine-2-lxd-0 /var/lib/lxd/containers
            /var/log/lxd/juju-machine-2-lxd-0/lxc.conf'': err=''exit status 1''"

As seen in http://reports.vapour.ws/releases/issue/575edc63749a565c740f08d1

Also seen in our industrial test run:
http://juju-ci.vapour.ws/job/industrial-test-joyent/172/consoleFull

(Remember to log in if you get a 404)

Copious logs are attached to the Jenkins build.

tags: added: juju-release-support
Revision history for this message
Michael Reed (mreed8855) wrote :

I am seeing this on an arm64 system

Revision history for this message
Michael Reed (mreed8855) wrote :

I am using the following lxd version

sudo apt-cache policy lxd
lxd:
  Installed: 2.0.0-0ubuntu4~ubuntu16.04.1~juju1
  Candidate: 2.0.1-0ubuntu1~16.04.1
  Version table:
     2.0.1-0ubuntu1~16.04.1 500
        500 http://ports.ubuntu.com/ubuntu-ports xenial-proposed/main arm64 Packages
     2.0.0-0ubuntu4 500
        500 http://ports.ubuntu.com/ubuntu-ports xenial/main arm64 Packages
 *** 2.0.0-0ubuntu4~ubuntu16.04.1~juju1 500
        500 http://ppa.launchpad.net/juju/enablement/ubuntu xenial/main arm64 Packages
        100 /var/lib/dpkg/status

Revision history for this message
Michael Reed (mreed8855) wrote :

I am using an older version of juju as it has been stable on our systems.

sudo apt-cache policy juju
juju:
  Installed: 2.0-beta5-0ubuntu1~16.04.1~juju1
  Candidate: 2.0-beta5-0ubuntu1~16.04.1~juju1
  Version table:
 *** 2.0-beta5-0ubuntu1~16.04.1~juju1 500
        500 http://ppa.launchpad.net/juju/enablement/ubuntu xenial/main arm64 Packages
        100 /var/lib/dpkg/status
     2.0~beta6-0ubuntu1.16.04.1 500
        500 http://ports.ubuntu.com/ubuntu-ports xenial-updates/main arm64 Packages
     2.0~beta4-0ubuntu2 500
        500 http://ports.ubuntu.com/ubuntu-ports xenial/main arm64 Packages

Changed in juju-core:
milestone: 2.0.0 → 2.0-beta8
Changed in juju-core:
milestone: 2.0-beta8 → 2.0-beta9
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Sent an email to tych0 asking for some advice. A quick search for this forkstart error points me to some results from late 2015.

Revision history for this message
Tycho Andersen (tycho-s) wrote :

Hi guys,

I don't know that I have a login for the juju-ci, but this is a generic error that basically just means the container failed to boot for some reason. The contents of /var/log/lxd/juju-machine-2-lxd-0 would be helpful in diagnosing it, if they're present.

Revision history for this message
Aaron Bentley (abentley) wrote :

I have passed credentials on internal IRC. Copious logs are attached to the build, including syslog, but not /var/log/lxd/juju-machine-2-lxd-0. I expect it would be easy to add to future runs.

Revision history for this message
Tycho Andersen (tycho-s) wrote :

Ok, thanks. A copy of /var/log/lxd for a failing run would be enough to diagnose the issue I think.

I looked in the logs/2/ subdirectory at the syslogs there and didn't see anything interesting (although it's not clear to me if that's the right place to look for one of the failures).

Anyway, the last thing I noticed is that this has "joyent" in the URL. Is this running on Joyent's weird Linux zone on SmartOS thing? I wouldn't expect LXD to function very well there, although it would still be interesting to see the failure.

Revision history for this message
Aaron Bentley (abentley) wrote :

Yes, that run is on Joyent, but we are also seeing this message on Azure:
http://juju-ci.vapour.ws/job/industrial-test-azure/122/consoleFull

Revision history for this message
Aaron Bentley (abentley) wrote :

Here are artifacts from a recent run on Azure that also include /var/logs/lxd/juju-*
http://juju-ci.vapour.ws/job/industrial-test-azure/124/artifact/logs/

Revision history for this message
Tycho Andersen (tycho-s) wrote :

So r.e. the failure on azure, I see,

            lxc 1464787476.182 INFO lxc_conf - conf.c:run_script_argv:362 - Executing script '/usr/bin/lxd callhook /var/lib/lxd 2 start' for container 'juju-machine-0-lxd-0', config section 'lxc'
            lxc 1464787476.186 INFO lxc_start - start.c:lxc_check_inherited:240 - closed inherited fd 3
            lxc 1464787476.186 INFO lxc_start - start.c:lxc_check_inherited:240 - closed inherited fd 8
            lxc 1464787476.195 INFO lxc_monitor - monitor.c:lxc_monitor_sock_name:178 - using monitor sock name lxc/d78a9d7e97b4b375//var/lib/lxd/containers
            lxc 1464787506.205 ERROR lxc_conf - conf.c:run_buffer:342 - Script exited with status 1
            lxc 1464787506.205 ERROR lxc_start - start.c:lxc_init:450 - failed to run pre-start hooks for container 'juju-machine-0-lxd-0'.
            lxc 1464787506.205 ERROR lxc_start - start.c:__lxc_start:1172 - failed to initialize the container

The 30s time gap in the time stamps there means the `callhook` command likely timed out. I looked, but couldn't find the actual LXD log (/var/log/lxd/lxd.log) in the logs above, is it there? That may give more clues as to why LXD itself timed out on this request.

Revision history for this message
Tycho Andersen (tycho-s) wrote :

Can you guys also post your liblxc versions? I think it may be worth updating those, as I believe there was a bug like this (i.e. callhook hanging) fixed a while ago.

Revision history for this message
Aaron Bentley (abentley) wrote :

Those were brand-new instances created this morning and apt-get upgraded when they were launched. I can't get you the library versions they used, because the instances were terminated after their segment of the test completed.

Revision history for this message
Tycho Andersen (tycho-s) wrote : Re: [Bug 1577524] Re: Error calling ''lxd forkstart juju-machine-2-lxd-0 /var/lib/lxd/containers /var/log/lxd/juju-machine-2-lxd-0/lxc.conf'': err=''exit status 1''

On Thu, Jun 02, 2016 at 08:59:35PM -0000, Aaron Bentley wrote:
> Those were brand-new instances created this morning and apt-get upgraded
> when they were launched. I can't get you the library versions they
> used, because the instances were terminated after their segment of the
> test completed.

Ok. Is this on trusty by chance?

Revision history for this message
Aaron Bentley (abentley) wrote :

Yes, it was on Trusty.

Revision history for this message
Aaron Bentley (abentley) wrote :

Per your IRC request, here's a run with /var/log/lxd/lxd.log included.
http://juju-ci.vapour.ws/job/industrial-test-azure/126/artifact/logs/

Aaron Bentley (abentley)
description: updated
Changed in juju-core:
assignee: nobody → Cheryl Jennings (cherylj)
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta9 → 2.0-beta10
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta10 → 2.0-beta11
Revision history for this message
Tycho Andersen (tycho-s) wrote :

Hi,

Sorry, but I still don't quite understand how the logs are organized here. I looked in a few different numbered subdirectories and didn't see anything that obviously looked like the failure we're talking about here with LXD logs. How can I tell?

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Hey Tycho - Here's a clearer run with logs: http://reports.vapour.ws/releases/4096/job/azure-arm-deploy-bundle-lxd/attempt/191

Note that the files in .gz are actually ASCII, so you'll get an error from the archive manager, but you can just view them with less / your editor of choice.

Revision history for this message
Tycho Andersen (tycho-s) wrote :

Hmm. Are these machines heavily loaded? I see the failure:

            lxc 20160624003431.482 INFO lxc_conf - conf.c:run_script_argv:367 - Executing script '/usr
/bin/lxd callhook /var/lib/lxd 2 start' for container 'juju-c803ae-1-lxd-1', config section 'lxc'
            lxc 20160624003431.487 INFO lxc_start - start.c:lxc_check_inherited:251 - closed inherited
 fd 3
            lxc 20160624003431.487 INFO lxc_start - start.c:lxc_check_inherited:251 - closed inherited
 fd 8
            lxc 20160624003431.488 INFO lxc_monitor - monitor.c:lxc_monitor_sock_name:178 - using moni
tor sock name lxc/d78a9d7e97b4b375//var/lib/lxd/containers
            lxc 20160624003501.508 ERROR lxc_conf - conf.c:run_buffer:347 - Script exited with status 1
            lxc 20160624003501.508 ERROR lxc_start - start.c:lxc_init:464 - failed to run pre-start hooks for container 'juju-c803ae-1-lxd-1'.
            lxc 20160624003501.508 ERROR lxc_start - start.c:__lxc_start:1312 - failed to initialize the container

and the subsequent GET:

t=2016-06-24T00:34:31+0000 lvl=info msg=handling method=GET url=/internal/containers/2/onstart ip=@

So it seems like something in the container's start took a long time. Since you're using the directory storage backend, I wouldn't expect anything to take that long, but it does do some I/O (writing the generated apparmor profile, loading it, applying templates, etc.). Is it possible that it's just working but going very slowly?

Revision history for this message
Tycho Andersen (tycho-s) wrote :

It may also be useful to start LXD in debug mode for these tests, although it'll be a bit more verbose. You can do this by just adding a `--debug` to its command line somewhere.

Revision history for this message
Aaron Bentley (abentley) wrote :

Would adding --debug require a change to tools/lxd/client.go? If so, can you provide an example?

I am unsure whether the machine in Cheryl's example is overloaded. The machines in my example were starting 8 lxd instances.

Regarding decoding the logs,

Basically, two attempts are run simultaneously, they are named "parallel-azure-old" and "parallel-azure-new" for historical reasons and they are given odd and even numbers respectively.

Combining the fact that 0/lxd/2 failed for the first attempt, we can say that the files linked here describe a forkstart failure:
http://juju-ci.vapour.ws/job/industrial-test-azure/126/artifact/logs/0/parallel-azure-old/machine-0/

Changed in juju-core:
assignee: Cheryl Jennings (cherylj) → nobody
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta11 → 2.0-beta12
Changed in juju-core:
milestone: 2.0-beta12 → 2.0-beta13
Revision history for this message
Aaron Bentley (abentley) wrote :

We now have an example of lxd forkstart failing on bootstrap instead of when adding machines:
http://reports.vapour.ws/releases/4143/job/functional-jes/attempt/1232

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta13 → 2.0-beta14
Revision history for this message
Martin Packman (gz) wrote :

See also logs from oil testing in bug 1605741 comment 1.

tags: added: oil oil-2.0
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-beta14 → 2.0-beta15
Changed in juju-core:
milestone: 2.0-beta15 → 2.0-beta16
Changed in juju-core:
assignee: nobody → Alexis Bruemmer (alexis-bruemmer)
Revision history for this message
Tim Penhey (thumper) wrote :

We have a retry fix that landed seven days ago, which we think will address this issue.

Marking fix committed for now, and if it comes back, we should re-look at it.

Changed in juju-core:
status: Triaged → Fix Committed
assignee: Alexis Bruemmer (alexis-bruemmer) → Katherine Cox-Buday (cox-katherine-e)
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta16 → none
milestone: none → 2.0-beta16
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.