Stale lock causes local provider unit to be stuck pending

Bug #1302935 reported by Tim Van Steenburgh
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
High
Tim Penhey

Bug Description

I'm deploying a trusty workload to lxc, and all unit agent-states get stuck in "pending" and never come up. (Precise workload deploys fine.)

all-machines.log -> http://paste.ubuntu.com/7204338/

$ juju --version
1.17.7-trusty-amd64

The trusty cloud image isn't being downloaded:
root@trusty-vm:/home/tvansteenburgh# ll /var/cache/lxc/
total 12
drwx------ 3 root root 4096 Apr 4 11:08 ./
drwxr-xr-x 18 root root 4096 Mar 25 12:06 ../
drwxr-xr-x 2 root root 4096 Mar 26 14:51 cloud-precise/

root@trusty-vm:/home/tvansteenburgh# lxc-ls --fancy
NAME STATE IPV4 IPV6 AUTOSTART
-----------------------------------------------------
juju-precise-template STOPPED - - NO

Curtis Hovey (sinzui)
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.19.0
tags: added: deploy local-provider lxc
Changed in juju-core:
importance: High → Critical
Revision history for this message
Dave Cheney (dave-cheney) wrote :

Hi Tim,

Could you please provide some more background so I can attempt to reproduce the issue

> I'm deploying a trusty workload to lxc, and all unit agent-states get stuck in "pending" and never come up. (Precise workload deploys fine.)

Was this with the local provider ? Precise is not a supported series for ppc64el. Can you please provide more details

Additional questions

* What are the steps you used to deploy this environment ?

* If it was with the deployer, where is the configuration

* Can you provide copies of the local charms you use, it looks like haproxy, memcache, mysql and sugarcrm.

* wolfe-01 does not have direct access to the internet which may be blocking download of the lxc template images. Can you please try exporting the following values in your environment and trying again

http_proxy=http://10.245.64.1:3128/
https_proxy=http://10.245.64.1:3128/

you will also need to tell juju to _NOT_ use a proxy to talk to the api server running on your local machine as it is bound to a private address on the br0 interface which is not visible to the proxy

no_proxy="10.0.3.1"

Revision history for this message
Tim Van Steenburgh (tvansteenburgh) wrote :

Hey Dave,

We got some communication wires crossed, which is undoubtedly my fault. This problem didn't happen on wolfe-01, it happened on my trusty vm (x86_64) using local provider. I did see a failed config-changed hook with go stack trace on wolfe-01, but it was rebooted before I could grab it, so I moved to testing the workload locally, and ran into this problem.

The charms used:

  parent branch: bzr+ssh://bazaar.launchpad.net/~charmers/charms/precise/haproxy/trunk/
  parent branch: bzr+ssh://bazaar.launchpad.net/~charmers/charms/precise/memcached/trunk/
  parent branch: bzr+ssh://bazaar.launchpad.net/~mbruzek/charms/trusty/mysql/apache2fix/
  parent branch: bzr+ssh://bazaar.launchpad.net/~cabs-team/charms/trusty/sugarcrm/trunk/

The deploy script:

$ cat ~/sugarcrm_deploy.sh
#!/bin/bash

set -ex

cd ~/src/charms/
juju deploy local:trusty/sugarcrm
juju deploy local:trusty/mysql
juju deploy local:trusty/memcached
juju deploy local:trusty/haproxy
juju set mysql dataset-size="1G"
juju add-relation sugarcrm mysql
juju add-relation sugarcrm memcached
juju add-relation sugarcrm haproxy
juju expose haproxy

Thanks for looking.

Revision history for this message
Tim Penhey (thumper) wrote :

I have just done the following:

# start my local provider
juju bootstrap

cd ~/sandbox/
mkdir charms
cd charms/
bzr init-repo trusty
cd trusty/
bzr branch bzr+ssh://bazaar.launchpad.net/~mbruzek/charms/trusty/mysql/apache2fix/ mysql
cd ..
juju deploy local:trusty/mysql

watch juju status eventually showed:

environment: local
machines:
  "0":
    agent-state: started
    agent-version: 1.19.0.1
    dns-name: localhost
    instance-id: localhost
    series: trusty
  "1":
    agent-state: started
    agent-version: 1.19.0.1
    dns-name: 10.0.3.159
    instance-id: tim-local-machine-1
    series: trusty
    hardware: arch=amd64
services:
  mysql:
    charm: local:trusty/mysql-311
    exposed: false
    relations:
      cluster:
      - mysql
    units:
      mysql/0:
        agent-state: started
        agent-version: 1.19.0.1
        machine: "1"
        public-address: 10.0.3.159

It did take quite a while for the first container to start, and this would have been the trusty template starting.
I'm not sure why your trusty template didn't get created. However you can now catch extremely verbose logging
by doing the following:

export JUJU_LOGGING_CONFIG='<root>=INFO; juju.container=TRACE; juju.provisioner=TRACE; golxc=TRACE'
juju bootstrap

The golxc trace will then output every call to lxc we make along with the output, and we should get a very clear idea of what is going on.

Changed in juju-core:
status: Triaged → Incomplete
assignee: nobody → Tim Penhey (thumper)
importance: Critical → High
Revision history for this message
Tim Van Steenburgh (tvansteenburgh) wrote :

Hi Tim,

I reran the same deployment after turning up logging as you suggested: http://paste.ubuntu.com/7215357/

Revision history for this message
Dave Cheney (dave-cheney) wrote : Re: [Bug 1302935] Re: trusty local provider unit agents stuck pending

This looks like the problem

machine-0: 2014-04-07 00:44:23 INFO juju.container.lxc
clonetemplate.go:78 wait for fslock on juju-trusty-template
machine-0: 2014-04-07 00:44:23 INFO juju.utils.fslock fslock.go:146
attempted lock failed "juju-trusty-template", ensure clone exists,
currently held: ensure clone exists

This stale lock is preventing you from starting new containers. You'll
probably find that lock file inside ~/.juju/local/locks. I would
recommend doing a

juju destory-environment -y local

you may need to add a --force if it does not terminate cleanly.

Checking that you have no lxc containers running, sudo lxc-ls, or
pstree are useful here

Then removing that lock file if it exists and trying the bootstrap/deploy again

On Mon, Apr 7, 2014 at 1:21 PM, Tim Van Steenburgh
<email address hidden> wrote:
> Hi Tim,
>
> I reran the same deployment after turning up logging as you suggested:
> http://paste.ubuntu.com/7215357/
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1302935
>
> Title:
> trusty local provider unit agents stuck pending
>
> Status in juju-core:
> Incomplete
>
> Bug description:
> I'm deploying a trusty workload to lxc, and all unit agent-states get
> stuck in "pending" and never come up. (Precise workload deploys fine.)
>
> all-machines.log -> http://paste.ubuntu.com/7204338/
>
> $ juju --version
> 1.17.7-trusty-amd64
>
> The trusty cloud image isn't being downloaded:
> root@trusty-vm:/home/tvansteenburgh# ll /var/cache/lxc/
> total 12
> drwx------ 3 root root 4096 Apr 4 11:08 ./
> drwxr-xr-x 18 root root 4096 Mar 25 12:06 ../
> drwxr-xr-x 2 root root 4096 Mar 26 14:51 cloud-precise/
>
> root@trusty-vm:/home/tvansteenburgh# lxc-ls --fancy
> NAME STATE IPV4 IPV6 AUTOSTART
> -----------------------------------------------------
> juju-precise-template STOPPED - - NO
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju-core/+bug/1302935/+subscriptions

Revision history for this message
Tim Penhey (thumper) wrote : Re: trusty local provider unit agents stuck pending

No... the lock dir is /var/lib/juju/locks and will be a file called juju-trusty-template.

The lock dir is outside the local provider data dir as it is shared across all local environments.

It looks as if you Ctrl-C'ed an earlier attempt...

Revision history for this message
Tim Van Steenburgh (tvansteenburgh) wrote :

Removing the lock file fixed the problem!

Revision history for this message
John A Meinel (jameinel) wrote :

That sounds like this is no longer a critical bug, but perhaps there is still a "why did we end up with a stale lock file, and how do we recover if we can't prevent it"?

tags: added: ppc64el
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.19.0 → 1.19.1
status: Incomplete → Triaged
Curtis Hovey (sinzui)
summary: - trusty local provider unit agents stuck pending
+ Stale lock causes local provider unit to be stuck pending
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.19.1 → 1.20.0
Revision history for this message
John A Meinel (jameinel) wrote :

I feel the part about a stale lock file being left behind is now bug #1311668

Revision history for this message
John A Meinel (jameinel) wrote :

closing *this* part of the bug as fix released, and the remainder is bug #1311668

Changed in juju-core:
milestone: 1.20.0 → 1.19.1
status: Triaged → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.