juju deployed services to lxc containers stuck in pending

Bug #1317197 reported by Ryan Harper
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Andrew Wilkins
1.18
Fix Released
High
Andrew Wilkins

Bug Description

Using juju 1.18, machines running trusty, deploying additional services to machines with lxc stay stuck in pending.
Looking at the one of the hosting machine logs (/var/log/juju/machine-2.log), lxc packages fail to install with:

2014-05-07 16:34:28 INFO juju.cmd supercommand.go:302 running juju-1.18.2-trusty-amd64 [gc]
2014-05-07 16:34:28 INFO juju.cmd.jujud machine.go:127 machine agent machine-1 start (1.18.2-trusty-amd64 [gc])
2014-05-07 16:34:28 DEBUG juju.agent agent.go:384 read agent config, format "1.18"
2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "api"
2014-05-07 16:34:28 INFO juju apiclient.go:114 state/api: dialing "wss://10.245.0.185:17070/"
2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "termination"
2014-05-07 16:34:28 INFO juju apiclient.go:124 state/api: connection established
2014-05-07 16:34:28 DEBUG juju.agent agent.go:523 writing configuration file
2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "upgrader"
2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "upgrade-steps"
2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "machiner"
2014-05-07 16:34:28 INFO juju.cmd.jujud machine.go:458 upgrade to 1.18.2-trusty-amd64 already completed.
2014-05-07 16:34:28 INFO juju.cmd.jujud machine.go:445 upgrade to 1.18.2-trusty-amd64 completed.
2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "logger"
2014-05-07 16:34:28 DEBUG juju.worker.logger logger.go:35 initial log config: "<root>=DEBUG"
2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "machineenvironmentworker"
2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "rsyslog"
2014-05-07 16:34:28 DEBUG juju.worker.logger logger.go:60 logger setup
2014-05-07 16:34:28 DEBUG juju.worker.machineenvironment machineenvironmentworker.go:70 write system files: true
2014-05-07 16:34:28 DEBUG juju.worker.rsyslog worker.go:76 starting rsyslog worker mode 1 for "machine-1" ""
2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "authenticationworker"
2014-05-07 16:34:28 DEBUG juju.worker.logger logger.go:45 reconfiguring logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"
2014-05-07 16:34:30 ERROR juju.utils.apt apt.go:133 apt-get command failed: exit status 100
args: []string{"apt-get", "--option=Dpkg::Options::=--force-confold", "--option=Dpkg::options::=--force-unsafe-io", "--assume-yes", "--quiet", "install", "lxc", "cloud-image-utils"}
E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable)
E: Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?

2014-05-07 16:34:30 ERROR juju.provisioner container_initialisation.go:90 starting container provisioner for lxc: setting up container dependnecies on host machine: apt-get failed: exit status 100
2014-05-07 16:34:30 ERROR juju runner.go:220 worker: exited "1-container-watcher": setting up container dependnecies on host machine: apt-get failed: exit status 100

Here's the juju status:
environment: oil-slave-1
machines:
  "0":
    agent-state: started
    agent-version: 1.18.2
    dns-name: lambert.oil
    instance-id: /MAAS/api/1.0/nodes/node-f645c4dc-ae16-11e3-9074-00163efc5068/
    series: precise
  "1":
    agent-state: started
    agent-version: 1.18.2
    dns-name: reading.oil
    instance-id: /MAAS/api/1.0/nodes/node-f64f188e-ae16-11e3-b194-00163efc5068/
    series: trusty
    containers:
      1/lxc/0:
        instance-id: pending
        series: trusty
  "2":
    agent-state: started
    agent-version: 1.18.2
    dns-name: suangi.oil
    instance-id: /MAAS/api/1.0/nodes/node-f665a2ca-ae16-11e3-9074-00163efc5068/
    series: trusty
    containers:
      2/lxc/0:
        instance-id: pending
        series: trusty
  "3":
    agent-state: started
    agent-version: 1.18.2
    dns-name: varaha.oil
    instance-id: /MAAS/api/1.0/nodes/node-f67d24e0-ae16-11e3-b194-00163efc5068/
    series: trusty
    containers:
      3/lxc/0:
        instance-id: pending
        series: trusty
  "4":
    agent-state: started
    agent-version: 1.18.2
    dns-name: gulon.oil
    instance-id: /MAAS/api/1.0/nodes/node-0f67ca28-ae17-11e3-b194-00163efc5068/
    series: trusty
    containers:
      4/lxc/0:

Tags: landscape oil

Related branches

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1317197] [NEW] juju deployed services to lxc containers stuck in pending
Download full text (5.1 KiB)

I would have thought this would be a dupe of bug #1302820 or bug #1311676
but both of those were considered fixed in 1.18.2 which it appears you are
using.
Which would indicate we have another bug hiding out for multiple-series
support.

On Wed, May 7, 2014 at 9:05 PM, Ryan Harper <email address hidden>wrote:

> Public bug reported:
>
> Using juju 1.18, machines running trusty, deploying additional services to
> machines with lxc stay stuck in pending.
> Looking at the one of the hosting machine logs
> (/var/log/juju/machine-2.log), lxc packages fail to install with:
>
> 2014-05-07 16:34:28 INFO juju.cmd supercommand.go:302 running
> juju-1.18.2-trusty-amd64 [gc]
> 2014-05-07 16:34:28 INFO juju.cmd.jujud machine.go:127 machine agent
> machine-1 start (1.18.2-trusty-amd64 [gc])
> 2014-05-07 16:34:28 DEBUG juju.agent agent.go:384 read agent config,
> format "1.18"
> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "api"
> 2014-05-07 16:34:28 INFO juju apiclient.go:114 state/api: dialing "wss://
> 10.245.0.185:17070/"
> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "termination"
> 2014-05-07 16:34:28 INFO juju apiclient.go:124 state/api: connection
> established
> 2014-05-07 16:34:28 DEBUG juju.agent agent.go:523 writing configuration
> file
> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "upgrader"
> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "upgrade-steps"
> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "machiner"
> 2014-05-07 16:34:28 INFO juju.cmd.jujud machine.go:458 upgrade to
> 1.18.2-trusty-amd64 already completed.
> 2014-05-07 16:34:28 INFO juju.cmd.jujud machine.go:445 upgrade to
> 1.18.2-trusty-amd64 completed.
> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "logger"
> 2014-05-07 16:34:28 DEBUG juju.worker.logger logger.go:35 initial log
> config: "<root>=DEBUG"
> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start
> "machineenvironmentworker"
> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "rsyslog"
> 2014-05-07 16:34:28 DEBUG juju.worker.logger logger.go:60 logger setup
> 2014-05-07 16:34:28 DEBUG juju.worker.machineenvironment
> machineenvironmentworker.go:70 write system files: true
> 2014-05-07 16:34:28 DEBUG juju.worker.rsyslog worker.go:76 starting
> rsyslog worker mode 1 for "machine-1" ""
> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start
> "authenticationworker"
> 2014-05-07 16:34:28 DEBUG juju.worker.logger logger.go:45 reconfiguring
> logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"
> 2014-05-07 16:34:30 ERROR juju.utils.apt apt.go:133 apt-get command
> failed: exit status 100
> args: []string{"apt-get", "--option=Dpkg::Options::=--force-confold",
> "--option=Dpkg::options::=--force-unsafe-io", "--assume-yes", "--quiet",
> "install", "lxc", "cloud-image-utils"}
> E: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily
> unavailable)
> E: Unable to lock the administration directory (/var/lib/dpkg/), is
> another process using it?
>
> 2014-05-07 16:34:30 ERROR juju.provisioner container_initialisation.go:90
> starting container provisioner for lxc: setting up container dependnecies
> on host...

Read more...

Revision history for this message
John A Meinel (jameinel) wrote :
Download full text (5.5 KiB)

this might be a dupe of
https://bugs.launchpad.net/juju-core/+bug/1306537which I thought was
also fixed in 1.18.2 but apparently the change didn't
get backported in time.

On Thu, May 8, 2014 at 3:05 PM, John Meinel <email address hidden> wrote:

> I would have thought this would be a dupe of bug #1302820 or bug #1311676
> but both of those were considered fixed in 1.18.2 which it appears you are
> using.
> Which would indicate we have another bug hiding out for multiple-series
> support.
>
>
>
> On Wed, May 7, 2014 at 9:05 PM, Ryan Harper <email address hidden>wrote:
>
>> Public bug reported:
>>
>> Using juju 1.18, machines running trusty, deploying additional services
>> to machines with lxc stay stuck in pending.
>> Looking at the one of the hosting machine logs
>> (/var/log/juju/machine-2.log), lxc packages fail to install with:
>>
>> 2014-05-07 16:34:28 INFO juju.cmd supercommand.go:302 running
>> juju-1.18.2-trusty-amd64 [gc]
>> 2014-05-07 16:34:28 INFO juju.cmd.jujud machine.go:127 machine agent
>> machine-1 start (1.18.2-trusty-amd64 [gc])
>> 2014-05-07 16:34:28 DEBUG juju.agent agent.go:384 read agent config,
>> format "1.18"
>> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "api"
>> 2014-05-07 16:34:28 INFO juju apiclient.go:114 state/api: dialing "wss://
>> 10.245.0.185:17070/"
>> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "termination"
>> 2014-05-07 16:34:28 INFO juju apiclient.go:124 state/api: connection
>> established
>> 2014-05-07 16:34:28 DEBUG juju.agent agent.go:523 writing configuration
>> file
>> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "upgrader"
>> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "upgrade-steps"
>> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "machiner"
>> 2014-05-07 16:34:28 INFO juju.cmd.jujud machine.go:458 upgrade to
>> 1.18.2-trusty-amd64 already completed.
>> 2014-05-07 16:34:28 INFO juju.cmd.jujud machine.go:445 upgrade to
>> 1.18.2-trusty-amd64 completed.
>> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "logger"
>> 2014-05-07 16:34:28 DEBUG juju.worker.logger logger.go:35 initial log
>> config: "<root>=DEBUG"
>> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start
>> "machineenvironmentworker"
>> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start "rsyslog"
>> 2014-05-07 16:34:28 DEBUG juju.worker.logger logger.go:60 logger setup
>> 2014-05-07 16:34:28 DEBUG juju.worker.machineenvironment
>> machineenvironmentworker.go:70 write system files: true
>> 2014-05-07 16:34:28 DEBUG juju.worker.rsyslog worker.go:76 starting
>> rsyslog worker mode 1 for "machine-1" ""
>> 2014-05-07 16:34:28 INFO juju runner.go:262 worker: start
>> "authenticationworker"
>> 2014-05-07 16:34:28 DEBUG juju.worker.logger logger.go:45 reconfiguring
>> logging from "<root>=DEBUG" to "<root>=WARNING;unit=DEBUG"
>> 2014-05-07 16:34:30 ERROR juju.utils.apt apt.go:133 apt-get command
>> failed: exit status 100
>> args: []string{"apt-get", "--option=Dpkg::Options::=--force-confold",
>> "--option=Dpkg::options::=--force-unsafe-io", "--assume-yes", "--quiet",
>> "install", "lxc", "cloud-image-utils"}
>> E: Could not get lock /var/lib/dpk...

Read more...

Revision history for this message
Ryan Harper (raharper) wrote :

I don't think it's a dupe of either. It appears that since the container doesn't contain the lxc tools (where as in 1.16 the cloud-init always installed lxc tools I'm told -- note we didn't see this issue when using 1.16), juju then needs to install lxc tools, but that's racing with whatever charm is running. in our instance we're installing openstack bits which download lots of packages and is likely what's holding the lock.

Revision history for this message
Curtis Hovey (sinzui) wrote :

Juju or cloud-init needs to install juju-local. lxc is just one on many dependencies that juju requires to run the local-provider

Changed in juju-core:
status: New → Triaged
importance: Undecided → Low
importance: Low → High
milestone: none → 1.18.3
Revision history for this message
Ian Booth (wallyworld) wrote :

The approach we will take is to lazily install lxc/kvm as needed when the first lxc/kvm container is requested on a machine. cloud-init is a bad place to install this type of dependency because it gets messy quickly and only works for provider provisioned machines, and fails for manually added machines etc. We will tie this in with improved juju status so that it shows something better than just "pending". This will require a bit of thought and is something we will look to do for the next stable release rather than 1.18. Until then the workaround is to apt-get install juju-local on instances where deployment to lxc is required.

Changed in juju-core:
milestone: 1.18.3 → next-stable
Revision history for this message
Ian Booth (wallyworld) wrote :

On second thought, moving back to 1.18 since we should try and at least get the core issue fixed, even if the juju status improvements will likely not be in 1.18

Changed in juju-core:
milestone: next-stable → 1.18.3
Revision history for this message
Ryan Harper (raharper) wrote :

This is a regression from 1.16. We've been running openstack services in lxc containers for quite some time, and it only broke after moving up to 1.18.

 I'm not running manual deployments of lxc, rather these are openstack bundles where some services are placed in lxc containers. It's not clear to me how I would mark out in a bundle that I need to install juju-local in the instances that will need to run lxc; manually installing juju-local isn't much of a workaround for my use-case.

Any comments on adding some retry logic when attempting the lxc install instead of giving up after one go?

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.18.3 → 1.18.4
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1317197] Re: juju deployed services to lxc containers stuck in pending

A 'simple' fix is that the code that does the "apt-get install lxc" should
take out the hook lock that we already have to ensure 2 hooks don't try to
apt-get at the same time.

On Fri, May 9, 2014 at 5:05 PM, Curtis Hovey <email address hidden> wrote:

> ** Changed in: juju-core
> Milestone: 1.18.3 => 1.18.4
>
> --
> You received this bug notification because you are subscribed to juju-
> core.
> https://bugs.launchpad.net/bugs/1317197
>
> Title:
> juju deployed services to lxc containers stuck in pending
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1317197/+subscriptions
>

Revision history for this message
John A Meinel (jameinel) wrote :

FWIW, we generally recommend against installing things onto a machine, and
then other things inside containers on that machine. The recommendation is
that if you want to use containers, then you should try to isolate all
items in a container.

Now I believe you had issues trying to use 1 KVM container next to a bunch
of LXC containers, but that certainly would have been the recommended
workflow.

It does sound like a short-term workaround is to just wait a bit for the
charm you are installing into the outer KVM finishes coming up before you
start trying to deploy into LXC instances inside of it.

Do we actually need juju-local installed inside a machine in order to use
LXC containers? Because I'm pretty sure we haven't been doing that for
MaaS. "requiredPackages" for containers/lxc just lists "lxc" and now
"cloud-image-utils" (we need the latter to expose trusty containers or
precise hosts).

Anyway, my personal feel is that we introduced a race when we started
installing the package late, and we should be able to just fix that without
having to make any changes to status, etc.

John

On Sat, May 10, 2014 at 2:24 PM, John Meinel <email address hidden> wrote:

> A 'simple' fix is that the code that does the "apt-get install lxc" should
> take out the hook lock that we already have to ensure 2 hooks don't try to
> apt-get at the same time.
>
>
>
> On Fri, May 9, 2014 at 5:05 PM, Curtis Hovey <email address hidden> wrote:
>
>> ** Changed in: juju-core
>> Milestone: 1.18.3 => 1.18.4
>>
>> --
>> You received this bug notification because you are subscribed to juju-
>> core.
>> https://bugs.launchpad.net/bugs/1317197
>>
>> Title:
>> juju deployed services to lxc containers stuck in pending
>>
>> To manage notifications about this bug go to:
>> https://bugs.launchpad.net/juju-core/+bug/1317197/+subscriptions
>>
>
>

Andrew Wilkins (axwalk)
Changed in juju-core:
status: Triaged → In Progress
assignee: nobody → Andrew Wilkins (axwalk)
Revision history for this message
Andrew Wilkins (axwalk) wrote :

I've proposed a fix for this on trunk; will backport if/when it is accepted.

Status updates are nice, and we should definitely do them, but that's a bigger kettle of fish. There's some complication with this at the moment, because the code that runs the host initialisation (for containers) runs outside the scope of the provisioner. The instance broker for each container type is created on-demand after container initialisation succeeds.

I haven't thought about it deeply about this, but I think it'd be simpler in general, and for status updates in particular, if the provisioner triggered off the host initialisation.

Revision history for this message
Ryan Harper (raharper) wrote :

Looking at the proposed branch I can see we attempt to acquire a lock before running the container initializer; what's not clear is whether there is any retry logic between the two cases which are attempting to acquire the lock. In my log, the container init code ran only once; it appears that with this change it will attempt to acquire a lock and report an error, but not attempt to try more than once. Am I missing the retry logic?

Revision history for this message
Andrew Wilkins (axwalk) wrote :

@raharper: sorry, didn't see your comment till just now. You're not missing anything, there's no retry logic in there.

The initialisation code will not retry if the lock fails; it is currently a catastrophic failure. Acquiring the lock is a blocking procedure though, and an error should not occur except in unusual cases. We should certainly make it more robust, but that will involve a significant amount of refactoring and is out of scope of this specific issue.

Changed in juju-core:
status: In Progress → Fix Committed
Revision history for this message
Ryan Harper (raharper) wrote :

If this only fixes the "stuck in pending" status message then I suppose I should file another bug for the regression in behavior from 1.16.

I need the 1.16 behavior to allow deploying services to machines and to lxc containers on the same machine as well. It's not an unusual use-case; especially for openstack deployments.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

@raharper: I think there may be some confusion from the initial replies on this bug. I'll explain what the problem was, and how I've fixed it. As far as I'm aware, with my change you should get the same behaviour as in 1.16. If there's still an issue, please let me know and I'll got onto it ASAP.

In 1.16 we installed the lxc package during cloud-init, along with a bunch of other packages. To cut down on installation time, the code was changed to defer installation of lxc until an lxc container is first added to the machine. This is where the regression occurred: there was no serialisation of the package installation with charm hooks, which may also execute apt/dpkg commands.

The change I made was simply to serialise the installation of lxc with hook execution. There's no retrying of package installation, because it will simply block until the hooks have finished executing.

Please ignore all the talk about status messages. This is relating to some work we have scheduled for improving feedback, but this is only an enhancement and has not been started yet.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.18.4 → 1.19.3
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
tags: added: landscape
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.