So I tried looking at 20150129.../32588 which has this in the status.yaml:
containers:
5/lxc/0: agent-state-info: 'error executing "lxc-wait": lxc_container: utils.c: mkdir_p:
220 Not a directory - failed to create directory ''/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps''; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy
snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template' instance-id: pending
series: trusty
I dug into the logs for machine-0 and machine-5:
For machine-5 we see it notice that it wants to create a 5/lxc/0 instance, and that it grabs the lock to create the template, but then just that something failed.
machine-5: 2015-01-29 10:44:14 INFO juju.provisioner provisioner_task.go:119 Starting up provisioner task machine-5
machine-5: 2015-01-29 10:44:14 INFO juju.provisioner provisioner_task.go:331 found machine "5/lxc/0" pending provisioning
machine-5: 2015-01-29 10:44:15 INFO juju.provisioner.lxc lxc-broker.go:65 starting lxc container for machineId: 5/lxc/0
machine-5: 2015-01-29 10:44:15 INFO juju.container.lxc clonetemplate.go:77 wait for flock on juju-trusty-lxc-template
machine-5: 2015-01-29 10:44:15 INFO juju.container lock.go:50 acquire lock "juju-trusty-lxc-template", ensure clone exists
machine-5: 2015-01-29 10:44:15 INFO juju.container.lxc clonetemplate.go:117 template does not exist, creating
machine-5: 2015-01-29 10:44:32 ERROR juju.container.lxc clonetemplate.go:167 container failed to start: error executing "lxc-wait": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template
machine-5: 2015-01-29 10:44:32 INFO juju.container lock.go:66 release lock "juju-trusty-lxc-template"
machine-5: 2015-01-29 10:44:32 ERROR juju.provisioner.lxc lxc-broker.go:97 failed to start container: error executing "lxc-wait": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template
machine-5: 2015-01-29 10:44:32 ERROR juju.provisioner provisioner_task.go:418 cannot start instance for machine "5/lxc/0": error executing "lxc-wait": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template
Note that "Could not destroy snapshot %s" doesn't occur in our codebase (that I could see). I'm pretty sure it is something in the LXC codebase. (I was hoping to fix the bare %s bug).
If you look on machine-0 around the time this is happening, you can see the requests from machine-5 to get information on paramaters for starting the container:
but then nothing about machine-5 for the next 17s until machine-5 informs us that the container could not be started:
2015-01-29 10:44:32 DEBUG juju.state.apiserver apiserver.go:150 <- [5B] machine-5 {"RequestId":56,"Type":"Provisioner","Request":"SetStatus","Params":{"Entities":[{"Tag":"machine-5-lxc-0","Status":"error","Info":"error executing \"lxc-wait\": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/lxc//var/lib/lxc/juju-trusty-lxc-template/snaps'; lxc_container: lxccontainer.c: do_snapshot_destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template","Data":null}]}}
2015-01-29 10:44:32 DEBUG juju.state.apiserver apiserver.go:157 -> [5B] machine-5 395.460665ms {"RequestId":56,"Response":{"Results":[{"Error":null}]}} Provisioner[""].SetStatus
We are certainly missing both the TRACE level logs for LXC operations on
machine-5, and we are missing logging from the Proxy code to tell us that
someone is making a request, but we are failing.
I'm not sure how artifacts-1410876 aligns with anything else. The error message there was different, though. And included some information about some other commands we were trying to run that were failing. (It showed a 500 ICE for the Juju server, IIRC).
From what I can tell, none of the oil-triage-20150131 logs have content in them.
I did do:
grep -B5 -A8 -rnI "lxc container cloning failed" */pipeline_deploy/*/juju_status.yaml
There appear to only be 5 failures. But they don't seem to be correlated. Specifically, machines 2, 3, 6, 6, 5 are the point where it failed, and hardware hayward-50.oil, hayward-16.oil, hayward-63.oil, hayward-50.oil, and hayward-53.oil.
The only thing that matches is 'machine 6' twice, and 'hayward-50' twice. But it looks to just be some sort of internal race as to what instance or what machine it happens on (i see other status outputs where hayward-50 successfully creates a container/0 on it.)
So I tried looking at 20150129.../32588 which has this in the status.yaml:
agent- state-info: 'error executing "lxc-wait": lxc_container: utils.c: mkdir_p: lock/lxc/ /var/lib/ lxc/juju- trusty- lxc-template/ snaps'' ;
lxc_ container: lxccontainer.c: do_snapshot_ destroy: 3272 Could not destroy
juju- trusty- lxc-template'
instance- id: pending
containers:
5/lxc/0:
220 Not a directory - failed to create directory ''/run/
snapshot %s - failed to allocate a pty; Insufficent privileges to control
series: trusty
I dug into the logs for machine-0 and machine-5:
For machine-5 we see it notice that it wants to create a 5/lxc/0 instance, and that it grabs the lock to create the template, but then just that something failed.
machine-5: 2015-01-29 10:44:14 INFO juju.provisioner provisioner_ task.go: 119 Starting up provisioner task machine-5 task.go: 331 found machine "5/lxc/0" pending provisioning r.lxc lxc-broker.go:65 starting lxc container for machineId: 5/lxc/0 lxc-template lxc-template" , ensure clone exists go:117 template does not exist, creating go:167 container failed to start: error executing "lxc-wait": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/ lxc//var/ lib/lxc/ juju-trusty- lxc-template/ snaps'; lxc_container: lxccontainer.c: do_snapshot_ destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty- lxc-template lxc-template" r.lxc lxc-broker.go:97 failed to start container: error executing "lxc-wait": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/ lxc//var/ lib/lxc/ juju-trusty- lxc-template/ snaps'; lxc_container: lxccontainer.c: do_snapshot_ destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty- lxc-template task.go: 418 cannot start instance for machine "5/lxc/0": error executing "lxc-wait": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/ lxc//var/ lib/lxc/ juju-trusty- lxc-template/ snaps'; lxc_container: lxccontainer.c: do_snapshot_ destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty- lxc-template
machine-5: 2015-01-29 10:44:14 INFO juju.provisioner provisioner_
machine-5: 2015-01-29 10:44:15 INFO juju.provisione
machine-5: 2015-01-29 10:44:15 INFO juju.container.lxc clonetemplate.go:77 wait for flock on juju-trusty-
machine-5: 2015-01-29 10:44:15 INFO juju.container lock.go:50 acquire lock "juju-trusty-
machine-5: 2015-01-29 10:44:15 INFO juju.container.lxc clonetemplate.
machine-5: 2015-01-29 10:44:32 ERROR juju.container.lxc clonetemplate.
machine-5: 2015-01-29 10:44:32 INFO juju.container lock.go:66 release lock "juju-trusty-
machine-5: 2015-01-29 10:44:32 ERROR juju.provisione
machine-5: 2015-01-29 10:44:32 ERROR juju.provisioner provisioner_
Note that "Could not destroy snapshot %s" doesn't occur in our codebase (that I could see). I'm pretty sure it is something in the LXC codebase. (I was hoping to fix the bare %s bug).
If you look on machine-0 around the time this is happening, you can see the requests from machine-5 to get information on paramaters for starting the container:
2015-01-29 10:44:14 DEBUG juju.state. apiserver apiserver.go:150 <- [5B] machine-5 {"RequestId" :50,"Type" :"Provisioner" ,"Request" :"InstanceId" ,"Params" :{"Entities" :[{"Tag" :"machine- 5-lxc-0" }]}} apiserver apiserver.go:157 -> [5B] machine-5 370.403us {"RequestId" :50,"Response" :{"Results" :[{"Error" :{"Message" :"machine 5/lxc/0 is not provisioned" ,"Code" :"not provisioned" },"Result" :""}]}} Provisioner[ ""].InstanceId apiserver apiserver.go:150 <- [5B] machine-5 {"RequestId" :51,"Type" :"Provisioner" ,"Request" :"Status" ,"Params" :{"Entities" :[{"Tag" :"machine- 5-lxc-0" }]}} apiserver apiserver.go:157 -> [5B] machine-5 371.87us {"RequestId" :51,"Response" :{"Results" :[{"Error" :null," Id":"", "Life": "","Status" :"pending" ,"Info" :"","Data" :{}}]}} Provisioner[ ""].Status apiserver apiserver.go:150 <- [5B] machine-5 {"RequestId" :52,"Type" :"Provisioner" ,"Request" :"InstanceId" ,"Params" :{"Entities" :[{"Tag" :"machine- 5-lxc-0" }]}} apiserver apiserver.go:157 -> [5B] machine-5 376.336us {"RequestId" :52,"Response" :{"Results" :[{"Error" :{"Message" :"machine 5/lxc/0 is not provisioned" ,"Code" :"not provisioned" },"Result" :""}]}} Provisioner[ ""].InstanceId apiserver apiserver.go:150 <- [5B] machine-5 {"RequestId" :53,"Type" :"Provisioner" ,"Request" :"SetPasswords" ,"Params" :{"Changes" :[{"Tag" :"machine- 5-lxc-0" ,"Password" :"MiO79JBQtzWDm VsWZ65/ qfvp"}] }} apiserver. common password.go:98 setting password for "machine-5-lxc-0" apiserver apiserver.go:157 -> [5B] machine-5 487.965466ms {"RequestId" :53,"Response" :{"Results" :[{"Error" :null}] }} Provisioner[ ""].SetPassword s apiserver apiserver.go:150 <- [5B] machine-5 {"RequestId" :54,"Type" :"Provisioner" ,"Request" :"ProvisioningI nfo","Params" :{"Entities" :[{"Tag" :"machine- 5-lxc-0" }]}} apiserver apiserver.go:157 -> [5B] machine-5 605.548us {"RequestId" :54,"Response" :{"Results" :[{"Error" :null," Result" :{"Constraints" :{"tags" :["hw-ok" ]},"Series" :"trusty" ,"Placement" :"","Networks" :[],"Jobs" :["JobHostUnits "]}}]}} Provisioner[ ""].Provisionin gInfo apiserver apiserver.go:150 <- [5B] machine-5 {"RequestId" :55,"Type" :"Provisioner" ,"Request" :"ContainerConf ig","Params" :{}} apiserver apiserver.go:157 -> [5B] machine-5 659.621us {"RequestId" :55,"Response" :{"ProviderType ":"maas" ,"AuthorizedKey s":"<SNIPPED> ","SSLHostnameV erification" :true," Proxy": {"Http" :"","Https" :"","Ftp" :"","NoProxy" :""},"AptProxy" :{"Http" :"","Https" :"","Ftp" :"","NoProxy" :""}}} Provisioner[ ""].ContainerCo nfig
2015-01-29 10:44:14 DEBUG juju.state.
2015-01-29 10:44:14 DEBUG juju.state.
2015-01-29 10:44:14 DEBUG juju.state.
2015-01-29 10:44:14 DEBUG juju.state.
2015-01-29 10:44:14 DEBUG juju.state.
2015-01-29 10:44:14 DEBUG juju.state.
2015-01-29 10:44:15 INFO juju.state.
2015-01-29 10:44:15 DEBUG juju.state.
2015-01-29 10:44:15 DEBUG juju.state.
2015-01-29 10:44:15 DEBUG juju.state.
2015-01-29 10:44:15 DEBUG juju.state.
2015-01-29 10:44:15 DEBUG juju.state.
but then nothing about machine-5 for the next 17s until machine-5 informs us that the container could not be started: apiserver apiserver.go:150 <- [5B] machine-5 {"RequestId" :56,"Type" :"Provisioner" ,"Request" :"SetStatus" ,"Params" :{"Entities" :[{"Tag" :"machine- 5-lxc-0" ,"Status" :"error" ,"Info" :"error executing \"lxc-wait\": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/ lxc//var/ lib/lxc/ juju-trusty- lxc-template/ snaps'; lxc_container: lxccontainer.c: do_snapshot_ destroy: 3272 Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty- lxc-template" ,"Data" :null}] }} apiserver apiserver.go:157 -> [5B] machine-5 395.460665ms {"RequestId" :56,"Response" :{"Results" :[{"Error" :null}] }} Provisioner[ ""].SetStatus
2015-01-29 10:44:32 DEBUG juju.state.
2015-01-29 10:44:32 DEBUG juju.state.
We are certainly missing both the TRACE level logs for LXC operations on
machine-5, and we are missing logging from the Proxy code to tell us that
someone is making a request, but we are failing.
I'm not sure how artifacts-1410876 aligns with anything else. The error message there was different, though. And included some information about some other commands we were trying to run that were failing. (It showed a 500 ICE for the Juju server, IIRC).
From what I can tell, none of the oil-triage-20150131 logs have content in them.
I did do: deploy/ */juju_ status. yaml
grep -B5 -A8 -rnI "lxc container cloning failed" */pipeline_
There appear to only be 5 failures. But they don't seem to be correlated. Specifically, machines 2, 3, 6, 6, 5 are the point where it failed, and hardware hayward-50.oil, hayward-16.oil, hayward-63.oil, hayward-50.oil, and hayward-53.oil.
The only thing that matches is 'machine 6' twice, and 'hayward-50' twice. But it looks to just be some sort of internal race as to what instance or what machine it happens on (i see other status outputs where hayward-50 successfully creates a container/0 on it.)