LXC was not created, no errors, no logs -> pending state.

Bug #1354027 reported by David Britton
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
Critical
Ian Booth
1.20
Fix Released
Critical
Ian Booth

Bug Description

I deployed 9 lxcs to a node. Number 5 (Zero indexed, so the 6th one) failed to be created. I can find no mention of it in the machine-0 log, I don't see a directory for it. It's just not there.

Please add more logging around the creation of LXCs and better catches for the commands used to launch them.

root@velvety:/var/lib/lxc# ll
total 44
drwx------ 11 root root 4096 Aug 7 04:16 ./
drwxr-xr-x 49 root root 4096 Aug 7 06:53 ../
drwxr-xr-x 3 root root 4096 Aug 7 04:15 juju-machine-0-lxc-0/
drwxr-xr-x 3 root root 4096 Aug 7 04:15 juju-machine-0-lxc-1/
drwxr-xr-x 3 root root 4096 Aug 7 04:15 juju-machine-0-lxc-2/
drwxr-xr-x 3 root root 4096 Aug 7 04:15 juju-machine-0-lxc-3/
drwxr-xr-x 3 root root 4096 Aug 7 04:15 juju-machine-0-lxc-4/
drwxr-xr-x 3 root root 4096 Aug 7 04:16 juju-machine-0-lxc-6/
drwxr-xr-x 3 root root 4096 Aug 7 04:16 juju-machine-0-lxc-7/
drwxr-xr-x 3 root root 4096 Aug 7 04:16 juju-machine-0-lxc-8/
drwxr-xr-x 3 root root 4096 Aug 7 04:14 juju-trusty-lxc-template/
root@velvety:/var/lib/lxc# lxc-ls --fancy
NAME STATE IPV4 IPV6 AUTOSTART
---------------------------------------------------------------
juju-machine-0-lxc-0 RUNNING 172.16.1.24 - YES
juju-machine-0-lxc-1 RUNNING 172.16.1.25 - YES
juju-machine-0-lxc-2 RUNNING 172.16.1.26 - YES
juju-machine-0-lxc-3 RUNNING 172.16.1.27 - YES
juju-machine-0-lxc-4 RUNNING 172.16.1.28 - YES
juju-machine-0-lxc-6 RUNNING 172.16.1.29 - YES
juju-machine-0-lxc-7 RUNNING 172.16.1.30 - YES
juju-machine-0-lxc-8 RUNNING 172.16.1.31 - YES
juju-trusty-lxc-template STOPPED - - NO
root@velvety:/var/lib/lxc#

1.20.7 fix: https://github.com/juju/juju/pull/650

Revision history for this message
David Britton (dpb) wrote :
Revision history for this message
David Britton (dpb) wrote :
tags: added: cloud-installer landscape
Ian Booth (wallyworld)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.21-alpha1
Revision history for this message
Chad Smith (chad.smith) wrote :

This bug has the same texture as the errors I saw earlier in lp:1348813. no LXC or cloud init logs for the missing lxc (lxc-2 in my case). So it's something that was likely in 1.20.2 as well. A request for additional logging would be helpful in getting closer to the root of this problem. Right now all is fairly quiet with Juju's lxc creation we have a parse syskig for veth* creation to get a feel for when lxcs are being created.

Revision history for this message
Chad Smith (chad.smith) wrote :

typo:
   we have a parse syskig
  ---
   we have to parse syslog

Revision history for this message
Björn Tillenius (bjornt) wrote :

FWIW, I hit this bug as well, and oddly enough, the same LXC is missing for me as well, juju-machine-0-lxc-5.

Revision history for this message
Adam Collard (adam-collard) wrote :

Yet again, another hit for this bug with our favourite juju-machine-0-lxc-5

Revision history for this message
Tim Penhey (thumper) wrote :

WTH? And nothing in the logs at all? That is just screwy.

Revision history for this message
Tim Penhey (thumper) wrote :

How are you going about creating the machines?

Revision history for this message
Björn Tillenius (bjornt) wrote :

The LXCs are all on the bootstrap node, so that machines is created by
"juju bootstrap <maas_machine>". After that we use the API to do the
equivalent of 'juju deploy --to lxc:0' for each of the services.

Before deploying to the LXC containers, we deploy a service (neutron-gateway)
directly to the bootstrap node. We wait for the neutron-gateway unit to be in
the installed state before starting the LXC deploys (because in the past, creating
LXCs at the same time a unit got installed could break)

Revision history for this message
Ian Booth (wallyworld) wrote :

Was there anything in the /var/lib/juju/containers/ directory for the missing container?

Mark Ramm (mark-ramm)
Changed in juju-core:
importance: High → Critical
Revision history for this message
Mark Ramm (mark-ramm) wrote :

If Juju fails to create an LXC container this is a blocker for our cloud-installer.

We should be handling errors as close to the source as possible, and not passing them up the the stack unless absolutely necessary.

Two reasons for this:

1) it makes no sense to handle the error only in landscape, since that is just one of many possible juju users.
2) In thecase of LXC, Juju itself is the infrastructure provider and it needs to detect and retry this sort of problem.

I expect that to get better at this we need improved logging, and to having Juju track that it's been asked to bring up the container, and to at the very least to provide a clear error message when the container fails to start.

It's possible that we can automate retries, but that requires that we set limits on retries to deal with cases where resources are
exhausted, or other systematic issues are preventing the creation of containers, but we should ALWAYS make sure we report up the failure.

And since this is at least somewhat reproducible, my bet is that we can find and solve the underlying issue and get this working without need for retries.

Revision history for this message
Ian Booth (wallyworld) wrote : Re: [Bug 1354027] Re: LXC was not created, no errors, no logs -> pending state.

On 22/08/14 01:22, Mark Ramm wrote:
> If Juju fails to create an LXC container this is a blocker for our
> cloud-installer.
>

Agreed. We have been working on it but have not been able to reproduce.

> We should be handling errors as close to the source as possible, and not
> passing them up the the stack unless absolutely necessary.
>

Yes, agreed. There's work scheduled to better recognise and handle errors that
occur in cloud init and from lxc itself.

> Two reasons for this:
>
> 1) it makes no sense to handle the error only in landscape, since that is just one of many possible juju users.
> 2) In thecase of LXC, Juju itself is the infrastructure provider and it needs to detect and retry this sort of problem.
>
> I expect that to get better at this we need improved logging, and to
> having Juju track that it's been asked to bring up the container, and to
> at the very least to provide a clear error message when the container
> fails to start.
>

Agreed. Juju does already provide an error via Juju status when the container
fails to start due to a lxc issue where lxc fails and reports the error. But
what's happening here appears to be that lxc is not reporting any failure to
Juju but is also not doing what was asked of it. In this case, it's very
difficult for Juju to detect what may have happened and to know how to react.

Because we have not been able to reproduce, we need to rely on receiving
information about the state of the environment where the failure was observed. I
think the next step is for Juju devs to hopefully be able to ssh in to the
affected system and poke around to try and see what's going on.

> It's possible that we can automate retries, but that requires that we set limits on retries to deal with cases where resources are
> exhausted, or other systematic issues are preventing the creation of containers, but we should ALWAYS make sure we report up the failure.
>

I think we can and should extend the current provisioning retry mechanism used
for cloud instances to also handle container startup - at least that way there's
an option to manually recover if a human decides they know that's viable.

> And since this is at least somewhat reproducible, my bet is that we can
> find and solve the underlying issue and get this working without need
> for retries.
>

If only we Juju devs could reproduce it :-)

Revision history for this message
Ian Booth (wallyworld) wrote :

I have to mark this back to High as while it stays Critical, it blocks landings to 1.20. We are working on it but also need to be able to land other fixes while this is in progress.

Changed in juju-core:
importance: Critical → High
Revision history for this message
Curtis Hovey (sinzui) wrote :

This bug remains critical. It doesn't block landing because it is not tagged as a regression.

Changed in juju-core:
importance: High → Critical
tags: added: deploy lxc
Revision history for this message
Curtis Hovey (sinzui) wrote :

Is this issue the same as bug 1350008

Revision history for this message
Ian Booth (wallyworld) wrote :

On the surface, it is not the same as bug 1350008. The that bug, the LXC container starts up and then fails running cloud init. In this bug, the LXC container appears to not even be started in the first place.

Revision history for this message
David Britton (dpb) wrote :

Hey! I reproduced it (finally)... Attached are all the requested logs (I'm afraid it may not be too interesting).

Here is the juju status output as well

http://paste.ubuntu.com/8197336/

Revision history for this message
David Britton (dpb) wrote :
Revision history for this message
Ian Booth (wallyworld) wrote :

Yay for the extra logs. With debug turned on, it could be seen that the provisioner task was being notified of containers added to a machine 2, which the provisioner then started to process. However, at this time, the machine records were not necessarily all written to the database yet. It turned out that for one container (number 8), its status document was not yet written, so the provisioner saw an error and ignored that container forever. A couple of seconds after this happened, the status record was written.

So unfortunately, we have no transactions so there's no guarantee of a consistent view of the state model. I've done a change which records churning machines (ones where status comes back as not found), and allows the provisioner to retry after 5 seconds (and so on if there's still failures). This will (hopefully) solve currently observed issue.

Ian Booth (wallyworld)
Changed in juju-core:
assignee: nobody → Ian Booth (wallyworld)
status: Triaged → Fix Committed
Aaron Bentley (abentley)
description: updated
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.