Error executing lxc-clone: lxc_container: utils.c: mkdir_p 220 Not a directory - Could not destroy snapshot %s - failed to allocate a pty; Insufficent privileges to control juju-trusty-lxc-template
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | juju-core |
High
|
Unassigned | ||
| | lxc |
Fix Released
|
Undecided
|
Unassigned | |
| | lxc (Ubuntu) |
High
|
Unassigned | ||
| | Trusty |
High
|
Unassigned | ||
Bug Description
This is for:
++ export OPENSTACK_
++ OPENSTACK_
++ export COMPUTE=nova-lxc
++ COMPUTE=nova-lxc
++ export UBUNTU_
++ UBUNTU_
Tool version is 1.20.14:
Launching instance
WARNING picked arbitrary tools &{1.20.
- /MAAS/api/
and agent-version on node says 1.20.13.
Juju-status.yaml:
'5':
agent-state: started
agent-version: 1.20.13
containers:
5/lxc/0:
220 Not a directory - failed to create directory ''/run/
snapshot %s - failed to allocate a pty; Insufficent privileges to control
series: trusty
5/lxc/1:
series: trusty
dns-name: apsaras.oil
hardware: arch=amd64 cpu-cores=4 mem=32768M tags=hw-
instance-id: /MAAS/api/
series: trusty
and from log:
machine-0: 2015-01-14 15:32:12 DEBUG juju.state.
machine-0: 2015-01-14 15:32:12 DEBUG juju.state.
machine-0: 2015-01-14 15:32:12 DEBUG juju.state.
machine-0: 2015-01-14 15:32:12 DEBUG juju.state.
machine-0: 2015-01-14 15:32:12 DEBUG juju.state.
machine-0: 2015-01-14 15:32:12 DEBUG juju.state.
| description: | updated |
| tags: | added: oil |
| tags: | added: lxc trusty |
| Changed in juju-core: | |
| status: | New → Triaged |
| importance: | Undecided → High |
| milestone: | none → 1.23 |
| Larry Michel (lmic) wrote : | #1 |
| Larry Michel (lmic) wrote : | #2 |
Openstack bundle file
| Horacio Durán (hduran-8) wrote : | #3 |
I took a look and requested oil team to try to reproduce this in newer versions and different configurations (lxc-use-
The relevant line on the logs is:
juju_debug_
and
2015-01-15 11:41:05 ERROR juju.provisioner provisioner_
After checking the code I found out that the clone operation error ouput is relayed directly from golxc.Clone I cannot determine exactly what is going on there.
Just in case the error could be in a path I checked http://
| Tim Penhey (thumper) wrote : | #4 |
The problem is a little hard to diagnose with the current output.
1) are you able to rerun the tests with the following config?
logging-
- you can either do this at bootstrap time like this:
juju bootstrap --logging-
- or on a running environment like this:
juju set-env 'logging-
2) Has this ever passed before? If so, what other config settings have changed?
3) Can we also get the values from /etc/lxc ?
| Changed in juju-core: | |
| status: | Triaged → Incomplete |
| Curtis Hovey (sinzui) wrote : | #5 |
Does this issue really affect 1.21-beta5? This issue doesn't show any evidence of regression. I don't think this should block the release of 1.21-rc1.
| Changed in juju-core: | |
| milestone: | 1.23 → none |
| Larry Michel (lmic) wrote : | #6 |
I am looking into the frequency for this bug to see when was the first and last time it was recreated. I am also attempting to recreate with the suggested debug settings. Attaching the deployment file that was used for this particular failure.
| no longer affects: | juju-core/1.21 |
| Larry Michel (lmic) wrote : | #7 |
We are hitting this for 1.21 in our production environment . Because of other blocking issues, we are not able to recreate in test environment yet. Please see all debug data for 48 recreates for the month of January with the last 4 being on 1.21. Debug flag is turned on and what's missing is the /etc/lxc and we're looking into how we can collect this info in the production environment.
Has this ever worked? Since we're seeing this for different configuration, the answer would pertain to what's common and neutron-contrail is the common thread.
| Changed in juju-core: | |
| status: | Incomplete → New |
| Curtis Hovey (sinzui) wrote : | #8 |
Yes. Juju is very good at creating lxc containers it always works for the release. Every revision is test a minimum of 14 combinations with lxc, then retested for weekly tests. If /etc/lxc is missing there would be systemic failures in CI (and this has happened). As for this being universally broken...only OIL is reporting this case.
As for /etc/lxc being missing, I believe the lxc package wasn't installed. Maybe there is a case where it is not installed, or apt is still installing packages.
| no longer affects: | juju-core/1.22 |
| Changed in juju-core: | |
| status: | New → Triaged |
| milestone: | none → 1.23 |
| Larry Michel (lmic) wrote : | #9 |
Curtis, sorry for the confusion. /etc/lxc is not missing from the system. I meant that its content is missing from the logs since we are not currently collecting its content in the event of a failure.
| Larry Michel (lmic) wrote : | #10 |
Artifacts from one of the build.
| Greg Lutostanski (lutostag) wrote : | #11 |
I will run a deployment with same yaml bundle and see if we can repro it offline so we can poke it interactively and see what is happening.
| Jason Hobbs (jason-hobbs) wrote : | #12 |
We're still seeing this with 1.22.
| Changed in juju-core: | |
| milestone: | 1.23 → 1.24-alpha1 |
| tags: | added: stakeholder-critical |
| Changed in juju-core: | |
| milestone: | 1.24-alpha1 → 1.25.0 |
| John A Meinel (jameinel) wrote : | #13 |
These files in artifacts.zip make it a bit to debug our end:
juju_debug_log.txt
cat: /var/log/
juju_debug_
cat: /var/log/
I'm pretty sure those are owned as root, probably with 700 permissions. (Depending on debug level secrets can end up in those files, which is why they aren't world readable.)
I'm also having a really hard time figuring out some of the formatting of that error message, but I'm trying.
It appears to be trying to download the root image from a local machine:
+ url2=https:/
what is 10.245.0.177? Given the port 17070 it is almost definitely the Juju API server.
+ cd /var/cache/
+ ''['' 0 -eq 1 '']'';
It looks like maybe it is failing to cd $cache, even though earlier it does issue a "mkdir -p $cache".
Alternatively that check is actually just seeing if the file already exists, and since it doesn't it goes ahead and issues the wget request. wget seems unhappy:
+ wget https:/
--2015-02-03 01:16:43--
https:/
Connecting to 10.245.
HTTP request sent, awaiting response... 500 Internal Server Error;
2015-02-03 01:16:43 ERROR 500: Internal Server Error.;
If we assume 10.245.0.177 is a juju state server hosting root images, it is getting a 500 ISE.
| John A Meinel (jameinel) wrote : | #14 |
So Ian noted that the Juju API Server is acting as a caching proxy here. The client requests the file from Juju and Juju is supposed to download it from cloud-images and cache the file and then return it.
Without the log files (all-machines.log and/or machine-0.log) it is hard to know what the internal failure is.
Can you possibly try changing the line:
+ juju ssh -e maas 0 'cat /var/log/
to
+ juju ssh -e maas 0 'sudo cat /var/log/
?
| Alexis Bruemmer (alexis-bruemmer) wrote : | #15 |
Based on John's last comment, logs will be needed before further progress can be made on this bug.
| Changed in juju-core: | |
| status: | Triaged → Incomplete |
| no longer affects: | juju-core/1.24 |
| Larry Michel (lmic) wrote : | #16 |
John, the logs you are requesting are already attached to this bug. I looked into the tarball logfiles-
All the other directories have complete all-machines.log output.
| Changed in juju-core: | |
| status: | Incomplete → New |
| Larry Michel (lmic) wrote : | #17 |
It does so happen that the archive manager displays these 2 directories first, so I understand why it would look like they're all bad. You should see a total of 43 complete all-machines.log in that tarball. Hopefully, these help.
lmic@lmic-
cat: /var/log/
cat: /var/log/
cat: /var/log/
cat: /var/log/
cat: /var/log/
lmic@lmic-
5
lmic@lmic-
48
| Changed in juju-core: | |
| status: | New → Triaged |
| Ante Karamatić (ivoks) wrote : | #18 |
'failed to allocate a pty' might indicate exhausted ptys. Is /dev/pts mounted? How many are in use (sudo lsof /dev/pts/*) when it fails?
| Jason Hobbs (jason-hobbs) wrote : | #19 |
I think the failed to allocate pty bit is a red herring - it looks like the original failure is this:
error executing "lxc-clone": lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lock/
This error comes from lxc's use of libc's mkdir(), which is setting errno to ENOTDIR, meaning one of the components in the directory it's trying to create is not actually a directory.
So, which part of the '/run/lock/
| Jason Hobbs (jason-hobbs) wrote : | #20 |
Well, I don't know. The lxc code looks like it tries for another path if using that mkdir fails...
| Serge Hallyn (serge-hallyn) wrote : | #21 |
could you please show the result of:
full=/run/
while [ $full != "/" ]; do
echo "XXX: $full"
ls -ld $full
full=`dirname $full`
done
| Jason Hobbs (jason-hobbs) wrote : | #22 |
The problem with running scripts like that is this is all coming from an automated environment that cleans up after failure. We haven't been able to add a script to do that yet.
Did see another failure of this that has a slightly different error message:
cleanup of containers needed: error executing "lxc-wait": lxc_container:
utils.c: mkdir_p: 220 Not a directory - failed to create directory ''/run/
snapshot %s - failed to allocate a pty; Insufficent privileges to control
| Jason Hobbs (jason-hobbs) wrote : | #23 |
It looks like the lxc commands that juju uses aren't logged at DEBUG log level. It would be really helpful for debugging this if they were.
For example, from the error here, it's not clear what 'lxc-clone' command is being called:
2015-01-15 11:41:05 ERROR juju.provisioner provisioner_
Where is the "%s" coming from? Either lxc or juju is not converting a format specifier and it's not clear which. Logging what lxc commands juju calls would clear that up.
| Ian Booth (wallyworld) wrote : | #24 |
The calls are logged at trace level. You could enable that:
juju set-env logging-
Or having logging-config set in the yaml
| Jason Hobbs (jason-hobbs) wrote : | #25 |
To do logging config in the yaml, we'd add:
logging-config: "<root>
to our environments' config in our environment.yaml right?
| John A Meinel (jameinel) wrote : | #26 |
Sorry about missing the logs. I had looked at the latest comment (artifacts-
I'll try to dig through the bigger tarball.
logging-config looks correct to me.
| John A Meinel (jameinel) wrote : | #27 |
So I tried looking at 20150129.../32588 which has this in the status.yaml:
containers:
5/lxc/0:
220 Not a directory - failed to create directory ''/run/
snapshot %s - failed to allocate a pty; Insufficent privileges to control
series: trusty
I dug into the logs for machine-0 and machine-5:
For machine-5 we see it notice that it wants to create a 5/lxc/0 instance, and that it grabs the lock to create the template, but then just that something failed.
machine-5: 2015-01-29 10:44:14 INFO juju.provisioner provisioner_
machine-5: 2015-01-29 10:44:14 INFO juju.provisioner provisioner_
machine-5: 2015-01-29 10:44:15 INFO juju.provisione
machine-5: 2015-01-29 10:44:15 INFO juju.container.lxc clonetemplate.go:77 wait for flock on juju-trusty-
machine-5: 2015-01-29 10:44:15 INFO juju.container lock.go:50 acquire lock "juju-trusty-
machine-5: 2015-01-29 10:44:15 INFO juju.container.lxc clonetemplate.
machine-5: 2015-01-29 10:44:32 ERROR juju.container.lxc clonetemplate.
machine-5: 2015-01-29 10:44:32 INFO juju.container lock.go:66 release lock "juju-trusty-
machine-5: 2015-01-29 10:44:32 ERROR juju.provisione
machine-5: 2015-01-29 10:44:32 ERROR juju.provisioner provisioner_
Note that "Could not destroy snapshot %s" doesn't occur in our codebase (that I could see). I'm pretty sure it is something in the LXC codebase. (I was hoping to fix the bare %s bug).
If you look on machine-0 around the time this is h...
| Jason Hobbs (jason-hobbs) wrote : | #28 |
I've attached the debug log and status from a failure we hit yesterday, this time with LXC trace logging enabled. It doesn't show the failed command in the debug log though, so it's not capturing what I wanted to see.
| Jason Hobbs (jason-hobbs) wrote : | #29 |
| Jason Hobbs (jason-hobbs) wrote : | #30 |
It looks like the lxc-clone command being used isn't logged even in the successful case. Here's example logs that show the template being created and then lxc-0 up and running, but no logged clone command.
http://
Why is the clone command not being logged?
| Ryan Beisner (1chb1n) wrote : | #31 |
We are experiencing this in our automated OpenStack bare metal testing (UOSCI). It appears to be a race, as it is intermittently an issue. It is frequent enough to cause noticeable test failures though.
The observable symptom is that juju-deployer times out, as some container units are forever in an "allocating" & "Waiting for agent initialization to finish" state.
The impact is that we cannot reliably test OpenStack deployments on baremetal.
Juju version is 1.24.2-
# from the machine log:
2015-07-21 19:38:35 DEBUG juju.worker.logger logger.go:45 reconfiguring logging from "<root>=DEBUG" to "<root>
2015-07-21 19:41:05 ERROR juju.provisione
2015-07-21 19:41:05 ERROR juju.provisioner provisioner_
2015-07-21 19:41:06 ERROR juju.provisione
2015-07-21 19:41:06 ERROR juju.provisioner provisioner_
# More detail:
http://
| Ryan Beisner (1chb1n) wrote : | #32 |
# machines where juju deploy to lxc failed:
ubuntu@
lxc:
Installed: 1.1.2-0ubuntu3~
Candidate: 1.1.2-0ubuntu3~
Version table:
*** 1.1.2-0ubuntu3~
500 http://
100 /var/lib/
1.
500 http://
1.0.3-0ubuntu3 0
500 http://
ubuntu@
# machines where juju deploy to lxc succeeded:
ubuntu@
lxc:
Installed: 1.0.7-0ubuntu0.1
Candidate: 1.1.2-0ubuntu3~
Version table:
1.
500 http://
*** 1.0.7-0ubuntu0.1 0
500 http://
100 /var/lib/
1.0.3-0ubuntu3 0
500 http://
See the paste in my previous comment for the juju stat view.
| Ryan Beisner (1chb1n) wrote : | #33 |
Observations re: the deployment referenced @ http://
Machines 1,2,3 - all have lxc 1.0.7, and happy containers. The base charm is ceph.
Machines 4,5,6 - all have lxc 1.1.2, and no love. The base charm is nova-compute.
# juju run --machine 0,1,2,3,4,5,6 "apt-cache policy lxc"
http://
While this does illustrate behavioral inconsistency between ceph and nova-compute in that they have apparently added the cloud archive ppa at different moments throughout hook execution, I don't think that this is inherently an OpenStack or charm issue.
| Serge Hallyn (serge-hallyn) wrote : | #34 |
Hi,
I think I see what's going on.
lxc mutexes container actions using a lockfile, something like /run/lock/
When you create a snapshot, the lockfile /run/lock/
If the container's own lockfile happens to exist at the moment that you attemtp to create the snapshot lockfile, then you would see this 'mkdir: not a directory" because it tries to create that path as a directory when it already exists as a file.
There are several ways this could be fixed in lxc, but for a workaround the best I can suggest is to not act on a snapshot while concurrently acting on the parent container (from another shell/task).
| Serge Hallyn (serge-hallyn) wrote : | #35 |
The nicest fix is probably to use /run/lock/
| Larry Michel (lmic) wrote : | #36 |
We've been hitting this on juno and kilo, 18 out of 214 builds.
LXC versions on systems I checked: lxc:amd64 (1.0.7-0ubuntu0.1)
| tags: | added: openstack-provider uosci |
| Serge Hallyn (serge-hallyn) wrote : | #37 |
Turns out this was actually fixed in the latest security release. IIUC.
| Jason Hobbs (jason-hobbs) wrote : | #38 |
Is that fix going to be available in trusty?
| Ryan Beisner (1chb1n) wrote : | #39 |
FYI: Following the security release, I've not been able to reproduce this manually, and our automated testing which was failing, is now passing.
| Ryan Beisner (1chb1n) wrote : | #40 |
Package link for convenience: https:/
| Changed in juju-core: | |
| status: | Triaged → Invalid |
| Changed in juju-core: | |
| milestone: | 1.25.0 → none |
| Jason Hobbs (jason-hobbs) wrote : | #41 |
We're still seeing this with 1.0.7-0ubuntu0.2 and juju-core 1.24.4. I've attached the juju machine log which contains the log of the LXC activity on that machine, including the failing commands, startin with this one:
2015-08-13 04:09:12 TRACE golxc.run.lxc-wait golxc.go:448 run: lxc-wait [--name juju-machine-
2015-08-13 04:09:12 TRACE golxc.run.lxc-wait golxc.go:458 run failed output: lxc_container: utils.c: mkdir_p: 220 Not a directory - failed to create directory '/run/lxc/
failed to create lock
lxc_container: lxccontainer.c: do_snapshot_
Insufficent privileges to control juju-machine-
| Serge Hallyn (serge-hallyn) wrote : | #42 |
Hi,
thanks for pointing that out. The trusty package appears to be subtly different - its src/lxc/lxclock.c creates "/run/lxc/
| Changed in lxc: | |
| status: | New → Fix Released |
| Changed in lxc (Ubuntu): | |
| status: | New → Fix Released |
| importance: | Undecided → High |
| Changed in lxc (Ubuntu Trusty): | |
| importance: | Undecided → High |
| status: | New → Triaged |
| Serge Hallyn (serge-hallyn) wrote : | #44 |
Hello Larry, or anyone else affected,
Accepted lxc into trusty-proposed. The package will build now and be available at https:/
Please help us by testing this new package. See https:/
If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-
Further information regarding the verification process can be found at https:/
| Changed in lxc (Ubuntu Trusty): | |
| status: | Triaged → Fix Committed |
| tags: | added: verification-needed |
| Brian Murray (brian-murray) wrote : | #46 |
I've accepted this into -proposed but the bug would benefit from a more detailed test case and regression statement.
| Serge Hallyn (serge-hallyn) wrote : | #47 |
@jason-hobbs or @lmic,
can you provide a concise recipe for reproducing this in the Description?
This bug is hit very rarely - we deploy thousands of containers a day and
see it a few times a week. I don't have a way to reproduce it reliably
other than waiting to hit it in OIL.
On Fri, Aug 21, 2015 at 12:22 PM, Serge Hallyn <email address hidden>
wrote:
> @jason-hobbs or @lmic,
>
> can you provide a concise recipe for reproducing this in the
> Description?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https:/
>
> Title:
> Error executing lxc-clone: lxc_container: utils.c: mkdir_p 220 Not a
> directory - Could not destroy snapshot %s - failed to allocate a pty;
> Insufficent privileges to control juju-trusty-
>
> Status in juju-core:
> Invalid
> Status in lxc:
> Fix Released
> Status in lxc package in Ubuntu:
> Fix Released
> Status in lxc source package in Trusty:
> Fix Committed
>
> Bug description:
> This is for:
>
> ++ export OPENSTACK_
> ++ OPENSTACK_
> ++ export COMPUTE=nova-lxc
> ++ COMPUTE=nova-lxc
> ++ export UBUNTU_
> ++ UBUNTU_
>
> Tool version is 1.20.14:
> Launching instance
> WARNING picked arbitrary tools &{1.20.
> https:/
> 932640702b5d9f0
> - /MAAS/api/
>
> and agent-version on node says 1.20.13.
>
> Juju-status.yaml:
>
> '5':
> agent-state: started
> agent-version: 1.20.13
> containers:
> 5/lxc/0:
> agent-state-info: 'error executing "lxc-clone": lxc_container:
> utils.c: mkdir_p:
> 220 Not a directory - failed to create directory
> ''/run/
> lxc_container: lxccontainer.c: do_snapshot_
> not destroy
> snapshot %s - failed to allocate a pty; Insufficent privileges
> to control
> juju-trusty-
> instance-id: pending
> series: trusty
> 5/lxc/1:
> agent-state-info: cannot clone a running container
> instance-id: pending
> series: trusty
> dns-name: apsaras.oil
> hardware: arch=amd64 cpu-cores=4 mem=32768M
> tags=hw-
> instance-id:
> /MAAS/api/
> series: trusty
>
> and from log:
>
> machine-0: 2015-01-14 15:32:12 DEBUG juju.state.
> apiserver.go:150 <- [7C] unit-swift-
> {"RequestId"
> machine-0: 2015-01-14 15:32:12 DEBUG juju.state.
> apiserver.go:157 -> [7C] unit-swift-
> {"RequestId"
> Uniter[""].CharmURL
> machine-0: 2015-01-14 15:32:12 DEBUG juju.state.
> apiserver.go:150 <- [7C] unit-swift-
> {"RequestId"
| Changed in lxc (Ubuntu Trusty): | |
| status: | Fix Committed → Fix Released |


Logs from one failure