Comment 27 for bug 1410876

Revision history for this message
John A Meinel (jameinel) wrote :

So I tried looking at 20150129.../32588 which has this in the status.yaml:
    containers:
      5/lxc/0:
        agent-state-info: 'error executing "lxc-wait": lxc_container: utils.c: mkdir_p:
          220 Not a directory - failed to create directory ''/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'';
          lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy
          snapshot %s - failed to allocate a pty; Insufficent privileges to control
          juju-trusty-lxc-template'
        instance-id: pending
        series: trusty

I dug into the logs for machine-0 and machine-5:

For machine-5 we see it notice that it wants to create a 5/lxc/0 instance, and that it grabs the lock to create the template, but then just that something failed.

machine-5: 2015-01-29 10:44:14 INFO juju.provisioner provisioner_task.go:119 Starting up provisioner task machine-5
machine-5: 2015-01-29 10:44:14 INFO juju.provisioner provisioner_task.go:331 found machine "5/lxc/0" pending provisioning
machine-5: 2015-01-29 10:44:15 INFO juju.provisioner.lxc lxc-broker.go:65 starting lxc container for machineId: 5/lxc/0
machine-5: 2015-01-29 10:44:15 INFO juju.container.lxc clonetemplate.go:77 wait for flock on juju-trusty-lxc-template
machine-5: 2015-01-29 10:44:15 INFO juju.container lock.go:50 acquire lock "juju-trusty-lxc-template", ensure clone exists
machine-5: 2015-01-29 10:44:15 INFO juju.container.lxc clonetemplate.go:117 template does not exist, creating
machine-5: 2015-01-29 10:44:32 ERROR juju.container.lxc clonetemplate.go:167 container failed to start: error executing "lxc-wait": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template
machine-5: 2015-01-29 10:44:32 INFO juju.container lock.go:66 release lock "juju-trusty-lxc-template"
machine-5: 2015-01-29 10:44:32 ERROR juju.provisioner.lxc lxc-broker.go:97 failed to start container: error executing "lxc-wait": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template
machine-5: 2015-01-29 10:44:32 ERROR juju.provisioner provisioner_task.go:418 cannot start instance for machine "5/lxc/0": error executing "lxc-wait": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template

Note that "Could not destroy snapshot %s" doesn't occur in our codebase (that I could see). I'm pretty sure it is something in the LXC codebase. (I was hoping to fix the bare %s bug).

If you look on machine-0 around the time this is happening, you can see the requests from machine-5 to get information on paramaters for starting the container:

2015-01-29 10:44:14 DEBUG juju.state.apiserver apiserver.go:150 <- [5B] machine-5 {"RequestId":50,"Type":"Provisioner","Request":"InstanceId","Params":{"Entities":[{"Tag":"machine-5-lxc-0"}]}}
2015-01-29 10:44:14 DEBUG juju.state.apiserver apiserver.go:157 -> [5B] machine-5 370.403us {"RequestId":50,"Response":{"Results":[{"Error":{"Message":"machine 5/lxc/0 is not provisioned","Code":"not provisioned"},"Result":""}]}} Provisioner[""].InstanceId
2015-01-29 10:44:14 DEBUG juju.state.apiserver apiserver.go:150 <- [5B] machine-5 {"RequestId":51,"Type":"Provisioner","Request":"Status","Params":{"Entities":[{"Tag":"machine-5-lxc-0"}]}}
2015-01-29 10:44:14 DEBUG juju.state.apiserver apiserver.go:157 -> [5B] machine-5 371.87us {"RequestId":51,"Response":{"Results":[{"Error":null,"Id":"","Life":"","Status":"pending","Info":"","Data":{}}]}} Provisioner[""].Status
2015-01-29 10:44:14 DEBUG juju.state.apiserver apiserver.go:150 <- [5B] machine-5 {"RequestId":52,"Type":"Provisioner","Request":"InstanceId","Params":{"Entities":[{"Tag":"machine-5-lxc-0"}]}}
2015-01-29 10:44:14 DEBUG juju.state.apiserver apiserver.go:157 -> [5B] machine-5 376.336us {"RequestId":52,"Response":{"Results":[{"Error":{"Message":"machine 5/lxc/0 is not provisioned","Code":"not provisioned"},"Result":""}]}} Provisioner[""].InstanceId
2015-01-29 10:44:14 DEBUG juju.state.apiserver apiserver.go:150 <- [5B] machine-5 {"RequestId":53,"Type":"Provisioner","Request":"SetPasswords","Params":{"Changes":[{"Tag":"machine-5-lxc-0","Password":"MiO79JBQtzWDmVsWZ65/qfvp"}]}}
2015-01-29 10:44:15 INFO juju.state.apiserver.common password.go:98 setting password for "machine-5-lxc-0"
2015-01-29 10:44:15 DEBUG juju.state.apiserver apiserver.go:157 -> [5B] machine-5 487.965466ms {"RequestId":53,"Response":{"Results":[{"Error":null}]}} Provisioner[""].SetPasswords
2015-01-29 10:44:15 DEBUG juju.state.apiserver apiserver.go:150 <- [5B] machine-5 {"RequestId":54,"Type":"Provisioner","Request":"ProvisioningInfo","Params":{"Entities":[{"Tag":"machine-5-lxc-0"}]}}
2015-01-29 10:44:15 DEBUG juju.state.apiserver apiserver.go:157 -> [5B] machine-5 605.548us {"RequestId":54,"Response":{"Results":[{"Error":null,"Result":{"Constraints":{"tags":["hw-ok"]},"Series":"trusty","Placement":"","Networks":[],"Jobs":["JobHostUnits"]}}]}} Provisioner[""].ProvisioningInfo
2015-01-29 10:44:15 DEBUG juju.state.apiserver apiserver.go:150 <- [5B] machine-5 {"RequestId":55,"Type":"Provisioner","Request":"ContainerConfig","Params":{}}
2015-01-29 10:44:15 DEBUG juju.state.apiserver apiserver.go:157 -> [5B] machine-5 659.621us {"RequestId":55,"Response":{"ProviderType":"maas","AuthorizedKeys":"<SNIPPED>","SSLHostnameVerification":true,"Proxy":{"Http":"","Https":"","Ftp":"","NoProxy":""},"AptProxy":{"Http":"","Https":"","Ftp":"","NoProxy":""}}} Provisioner[""].ContainerConfig

but then nothing about machine-5 for the next 17s until machine-5 informs us that the container could not be started:
2015-01-29 10:44:32 DEBUG juju.state.apiserver apiserver.go:150 <- [5B] machine-5 {"RequestId":56,"Type":"Provisioner","Request":"SetStatus","Params":{"Entities":[{"Tag":"machine-5-lxc-0","Status":"error","Info":"error executing \"lxc-wait\": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template","Data":null}]}}
2015-01-29 10:44:32 DEBUG juju.state.apiserver apiserver.go:157 -> [5B] machine-5 395.460665ms {"RequestId":56,"Response":{"Results":[{"Error":null}]}} Provisioner[""].SetStatus

We are certainly missing both the TRACE level logs for LXC operations on
machine-5, and we are missing logging from the Proxy code to tell us that
someone is making a request, but we are failing.

I'm not sure how artifacts-1410876 aligns with anything else. The error message there was different, though. And included some information about some other commands we were trying to run that were failing. (It showed a 500 ICE for the Juju server, IIRC).

From what I can tell, none of the oil-triage-20150131 logs have content in them.

I did do:
 grep -B5 -A8 -rnI "lxc container cloning failed" */pipeline_deploy/*/juju_status.yaml

There appear to only be 5 failures. But they don't seem to be correlated. Specifically, machines 2, 3, 6, 6, 5 are the point where it failed, and hardware hayward-50.oil, hayward-16.oil, hayward-63.oil, hayward-50.oil, and hayward-53.oil.

The only thing that matches is 'machine 6' twice, and 'hayward-50' twice. But it looks to just be some sort of internal race as to what instance or what machine it happens on (i see other status outputs where hayward-50 successfully creates a container/0 on it.)