juju-core

Bug #1354027
Comment #12

Comment 12 for bug 1354027

Revision history for this message

Ian Booth (wallyworld) wrote on 2014-08-21: Re: [Bug 1354027] Re: LXC was not created, no errors, no logs -> pending state.

#12

On 22/08/14 01:22, Mark Ramm wrote:
> If Juju fails to create an LXC container this is a blocker for our
> cloud-installer.
>

Agreed. We have been working on it but have not been able to reproduce.

> We should be handling errors as close to the source as possible, and not
> passing them up the the stack unless absolutely necessary.
>

Yes, agreed. There's work scheduled to better recognise and handle errors that
occur in cloud init and from lxc itself.

> Two reasons for this:
>
> 1) it makes no sense to handle the error only in landscape, since that is just one of many possible juju users.
> 2) In thecase of LXC, Juju itself is the infrastructure provider and it needs to detect and retry this sort of problem.
>
> I expect that to get better at this we need improved logging, and to
> having Juju track that it's been asked to bring up the container, and to
> at the very least to provide a clear error message when the container
> fails to start.
>

Agreed. Juju does already provide an error via Juju status when the container
fails to start due to a lxc issue where lxc fails and reports the error. But
what's happening here appears to be that lxc is not reporting any failure to
Juju but is also not doing what was asked of it. In this case, it's very
difficult for Juju to detect what may have happened and to know how to react.

Because we have not been able to reproduce, we need to rely on receiving
information about the state of the environment where the failure was observed. I
think the next step is for Juju devs to hopefully be able to ssh in to the
affected system and poke around to try and see what's going on.

> It's possible that we can automate retries, but that requires that we set limits on retries to deal with cases where resources are
> exhausted, or other systematic issues are preventing the creation of containers, but we should ALWAYS make sure we report up the failure.
>

I think we can and should extend the current provisioning retry mechanism used
for cloud instances to also handle container startup - at least that way there's
an option to manually recover if a human decides they know that's viable.

> And since this is at least somewhat reproducible, my bet is that we can
> find and solve the underlying issue and get this working without need
> for retries.
>

If only we Juju devs could reproduce it :-)

On 22/08/14 01:22, Mark Ramm wrote:
> If Juju fails to create an LXC container this is a blocker for our
> cloud-installer.
>

Agreed. We have been working on it but have not been able to reproduce.

> We should be handling errors as close to the source as possible, and not
> passing them up the the stack unless absolutely necessary.
>

Yes, agreed. There's work scheduled to better recognise and handle errors that
occur in cloud init and from lxc itself.

> Two reasons for this:
> 
> 1)  it makes no sense to handle the error only in landscape, since that is just one of many possible juju users. 
> 2) In thecase of LXC, Juju itself is the infrastructure provider and it needs to detect and retry this sort of problem. 
> 
> I expect that to get better at this we need improved logging, and to
> having Juju track that it's been asked to bring up the container, and to
> at the very least to provide a clear error message when the container
> fails to start.
>

> And since this is at least somewhat reproducible, my bet is that we can
> find and solve the underlying issue and get this working without need
> for retries.
>

If only we Juju devs could reproduce it :-)