Juju restore fails Could not get lock /var/lib/dpkg/lock

Bug #1342937 reported by Curtis Hovey
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
Critical
Horacio Durán

Bug Description

All 3 runs of the backup restore test failed because of a race condition in commit 2357de0d
    http://juju-ci.vapour.ws:8080/job/functional-backup-restore

The issue was not present or rarely happened in the last week. commit abf3de87 passed. several commit merged after this:

7083058 Merge pull request #317 from jameinel/remove-apiconn-1326802
dfbe505 Merge pull request #121 from klyachin/120-networker
ea144d9 Merge pull request #316 from davecheney/137-names-refactor-part-ix
2357de0 Merge pull request #226 from gabriel-samfira/windows-toolsdir

From http://juju-ci.vapour.ws:8080/job/functional-backup-restore/1150/console

Starting restore.
juju --show-log destroy-environment test-function-aws-1 --force -y
2014-07-16 19:50:42 INFO juju.cmd supercommand.go:37 running juju [1.21-alpha1-precise-amd64 gc]
2014-07-16 19:50:42 INFO juju.provider.ec2 ec2.go:204 opening environment "test-function-aws-1"
2014-07-16 19:50:42 INFO juju.provider.common destroy.go:15 destroying environment "test-function-aws-1"
2014-07-16 19:50:44 INFO juju.cmd supercommand.go:329 command finished
Restore failed:
2014-07-16 19:46:26 INFO juju.cmd supercommand.go:37 running juju [1.21-alpha1-precise-amd64 gc]
Launching instance
 - i-57b79b7d
Waiting for address
Attempting to connect to ip-172-31-9-36.ec2.internal:22
Attempting to connect to 172.31.9.36:22
Attempting to connect to ec2-54-210-77-42.compute-1.amazonaws.com:22
Attempting to connect to 54.210.77.42:22
Warning: Permanently added 'ec2-54-210-77-42.compute-1.amazonaws.com,54.210.77.42' (ECDSA) to the list of known hosts.
Logging to /var/log/cloud-init-output.log on remote host
Installing add-apt-repository
Adding apt repository: deb http://ubuntu-cloud.archive.canonical.com/ubuntu precise-updates/cloud-tools main
Running apt-get update
Running apt-get upgrade
Installing package: curl
Installing package: cpu-checker
Installing package: bridge-utils
Installing package: rsyslog-gnutls
Fetching tools: curl -sSfw 'tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_download} bytes; speed %{speed_download} bytes/s ' -o $bin/tools.tar.gz 'http://juju-dist.s3.amazonaws.com/testing/tools/releases/juju-1.21-alpha1-precise-amd64.tgz'
Bootstrapping Juju machine agent
Starting Juju machine agent (jujud-machine-0)
error: cannot restore bootstrap machine: update script failed: ssh command failed: subprocess encountered error code 100 ("Warning: Permanently added 'ec2-54-210-77-42.compute-1.amazonaws.com,172.31.9.36' (ECDSA) to the list of known hosts.\r\n+ export LC_ALL=C\n+ LC_ALL=C\n+ tar xzf juju-backup.tgz\n+ test -d juju-backup\n+ apt-get --option=Dpkg::Options::=--force-confold --option=Dpkg::options::=--force-unsafe-io --assume-yes --quiet install mongodb-clients\nE: Could not get lock /var/lib/dpkg/lock - open (11: Resource temporarily unavailable)\nE: Unable to lock the administration directory (/var/lib/dpkg/), is another process using it?\n")
2014-07-16 19:50:42 ERROR juju.cmd supercommand.go:323 subprocess encountered error code 1

I will escalate this to critical if the next revision cannot pass this test.

Curtis Hovey (sinzui)
Changed in juju-core:
importance: High → Critical
Revision history for this message
Curtis Hovey (sinzui) wrote :

Restore rarely passes now, but when 1.20.2 was tested a few hours ago, both tests that use restore worked fine and passed on the first try.

Also, there may be more than one issue, or that the issue is mutating because we also see an equal number of errors that look like this:

Starting restore.
juju --show-log destroy-environment test-function-aws-1 --force -y
2014-07-17 09:36:01 INFO juju.cmd supercommand.go:37 running juju [1.21-alpha1-precise-amd64 gc]
2014-07-17 09:36:01 INFO juju.provider.ec2 ec2.go:204 opening environment "test-function-aws-1"
2014-07-17 09:36:01 INFO juju.provider.common destroy.go:15 destroying environment "test-function-aws-1"
2014-07-17 09:36:02 INFO juju.cmd supercommand.go:329 command finished
Restore failed:
2014-07-17 09:35:48 INFO juju.cmd supercommand.go:37 running juju [1.21-alpha1-precise-amd64 gc]
Launching instance
 - i-d5042aff
Waiting for address
ERROR bootstrap failed: refreshing addresses: The service is unavailable. Please try again shortly. (Unavailable)
Stopping instance...
error: cannot re-bootstrap environment: cannot bootstrap new instance: refreshing addresses: The service is unavailable. Please try again shortly. (Unavailable)
2014-07-17 09:36:01 ERROR juju.cmd supercommand.go:323 subprocess encountered error code 1

Revision history for this message
Curtis Hovey (sinzui) wrote :

I have some new observational issue about this issue. We are seeing slow apt access in ec2 this week. Unit tests run in ec2 had a high incidence of failure because they couldn't setup. Suspecting the race/timing problem related to slow apt, I move the restore tests to hp cloud, and they pass more often. While CI is reporting the restore tests pass, it is not reporting that they pass in Hp, but don't pass in ec2. This bug is still critical because we expect users to restore tests in ec2, the dpkg/apt conflict needs to be resolved.

Changed in juju-core:
assignee: nobody → Horacio Durán (hduran-8)
Revision history for this message
Horacio Durán (hduran-8) wrote :

This is a race condition, it is triggered when networker is trying to install vlan package:

This is a snip of `ps faux` for the machine at the moment of failure:

root 4622 8.9 0.5 599892 19904 ? Ssl 17:51 0:00 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug
root 4928 0.0 0.0 17856 1424 ? S 17:51 0:00 \_ /bin/bash -s
root 4930 48.0 1.1 69068 45532 ? R 17:51 0:00 \_ apt-get --option Dpkg::Options::=--force-confold --assume-yes install vlan

I am now trying to figure out if networker in time does any check before running apt-get or it is also a possible point of failure.

Changed in juju-core:
status: Triaged → Fix Committed
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.