Juju cannot create vivid containers

Bug #1442308 reported by Curtis Hovey on 2015-04-09
40
This bug affects 8 people
Affects Status Importance Assigned to Milestone
juju-core
High
Cheryl Jennings
1.23
Medium
Unassigned
1.24
High
Cheryl Jennings
openstack-installer
Confirmed
High
Unassigned

Bug Description

Per bug 1436415 and bug 1441319, juju 1.23 on vivid cannot do a trivial deployment because it cannot clone templates to make the containers.

machines:
  "0":
    agent-state: started
    agent-version: 1.23-beta4.1
    dns-name: localhost
    instance-id: localhost
    series: vivid
    state-server-member-status: has-vote
  "1":
    agent-state-info: 'failed to retrieve the template to clone: template container
      "juju-vivid-lxc-template" did not stop'
    instance-id: pending
    series: vivid
  "2":
    agent-state-info: 'lxc container cloning failed: cannot clone a running container'
    instance-id: pending
    series: vivid

All the logs for several tries are available for local-deploy-vivid-amd64 (non-voting) at
     http://reports.vapour.ws/releases/2523
^ The console log and the machine logs are present.
Note the test was made none voting because vivid became very unstable when it switch to systemd, but Ubuntu will require this test to pass to accept this in Vivid. So Core or Ubuntu need to solve what ever is broken. The Vivid's lxc has gotten updates so we know there were bugs in lxc that needed fixing.

But as of a few commits ago, master (1.24) can! It has passed twice without human intervention. 1.23 has some of the commits just added to master, but not all. A possible fix may be in
Commit d9fe120 Merge pull request #2033 from tasdomas/uniter-send-metric-batches …
Commit cd3494e Merge pull request #2016 from cherylj/env-users …

ADDENDUM:
We know understand that Juju cannot create vivid containers because an upstart script is used to shut the container down. We don't know why master passed, but human intervention could have been a factor. Once a template container is made, juju can make additional vivid containers.

Juju can deploy trusty containers (for trusty charms). The principal scenario for developing and testing charms works.

Changed in juju-core:
assignee: nobody → Cheryl Jennings (cherylj)
Curtis Hovey (sinzui) wrote :

Attached is /var/lib/juju/containers/juju-vivid-lxc-template/container.log from the machine. is is 120M uncompressed.

Cheryl Jennings (cherylj) wrote :

I know for certain that my commit (#2016) wouldn't effect this problem, and I don't believe #2033 would either. Going to take a look at the lxc log to see if I can get any more information.

Curtis Hovey (sinzui) wrote :

I have good news. beta4 mostly works on vivid. It cannot complete the vivid template creation and destroy environment, but it cann deploy trusty and precise charms.

The BAD:
1. beta4 says '"juju-vivid-lxc-template" did not stop'. Trying to deploy a vivid charm (explicitly or implicitly) will fail.

2. Destroy environment cannot shutdown mongod, prevent any subsequent bootstraps (without human intervention):
    ERROR while stopping mongod: fork/exec /sbin/initctl: no such file or directory

The UGLY

1. A human can complete what juju doesn't and get a working setup.
    sudo lxc-stop juju-vivid-lxc-template
    juju destroy-environment local
    sudo killall -ABRT mongod
    juju bootstrap
    juju deploy <charm>

2. After running juju destroy-environment, run
    sudo killall -ABRT mongod

The GOOD

1. By setting "default-series: trusty" or "precise" in environments.yaml, users can remove ambiguous situations where juju will try to deploy a vivid charm. Users will need to type "vivid" and presumably know they will need to do some work arounds. trusty charms will just work, but destroy-environment is still broken.

Curtis Hovey (sinzui) wrote :

I have updated CI to NOT delete the templates containers to contrive deployments so that we can see the destroy-environment issue.

Curtis Hovey (sinzui) wrote :

Tim using newly build 1.23 and I using the last built 1.23 cannot reproduce the destroy-environment error that I saw earlier today. Maybe the machine dirty from earlier testing such as trying to deploy vivid charms.. 1.23 is suitable to do this:

juju init
juju switch local
juju bootstrap
juju deploy ubuntu
juju status
juju destroy-environment

The ubuntu is implicitly trusty and juju can create trusty template containers. This case matches what Ubuntu will test.

The only remaining issue that must be solved soon is ensuring juju can create vivid template containers.

Curtis Hovey (sinzui) on 2015-04-10
description: updated
summary: - 1.23 cannot deploy on vivid, but master can
+ Juju cannot create vivid containers
Changed in juju-core:
milestone: 1.23.0 → 1.24-alpha1
John A Meinel (jameinel) wrote :

We have seen this in the wild on Trusty when a container hangs trying to shutdown. We think that case was because of I/O blocking preventing the container from shutting down for more than 5 minutes.

There are other possibilities where we're using the init system (upstart vs systemd) to issue the cleanup and shutdown commands, rather than just running them at the end of cloud-init. I'm not sure why we need a service to shutdown cleanly.

Curtis Hovey (sinzui) on 2015-04-27
Changed in juju-core:
milestone: 1.24-alpha1 → none
milestone: none → 1.24.0
Curtis Hovey (sinzui) on 2015-04-27
Changed in juju-core:
milestone: 1.24.0 → 1.25.0
Curtis Hovey (sinzui) on 2015-05-04
no longer affects: juju-core/1.23
Adam Stokes (adam-stokes) wrote :

here is my status output:

http://paste.ubuntu.com/11132718/

Changed in cloud-installer:
status: New → Confirmed
importance: Undecided → High
tags: added: cloud-installer
Adam Stokes (adam-stokes) wrote :

Just FYI, we're blocked on this for getting nclxd support added into our installer.

Thanks!

Cheryl Jennings (cherylj) wrote :

Attempting to recreate locally.

Cheryl Jennings (cherylj) wrote :

Was able to recreate, and I think this is due to problems in the cloud-init script for vivid. I see this error in the console.log for the vivid container:

[ 4429.946677] cloud-init[7627]: + /bin/systemctl link /var/lib/juju/init/juju-template-restart/juju-template-restart.service
[ 4429.950535] cloud-init[7627]: Created symlink from /etc/systemd/system/juju-template-restart.service to /var/lib/juju/init/juju-template-restart/juju-template-restart.service.
[ 4430.388097] cloud-init[7627]: + /bin/systemctl daemon-reload
[ 4430.723843] cloud-init[7627]: + /bin/systemctl enable /var/lib/juju/init/juju-template-restart/juju-template-restart.service
[ 4431.081232] cloud-init[7627]: The unit files have no [Install] section. They are not meant to be enabled
[ 4431.081770] cloud-init[7627]: using systemctl.
[ 4431.084314] cloud-init[7627]: Possible reasons for having this kind of units are:
[ 4431.084638] cloud-init[7627]: 1) A unit may be statically enabled by being symlinked from another unit's
[ 4431.084958] cloud-init[7627]: .wants/ or .requires/ directory.
[ 4431.085327] cloud-init[7627]: 2) A unit's purpose may be to act as a helper for some other unit which has
[ 4431.085642] cloud-init[7627]: a requirement dependency on it.
[ 4431.086912] cloud-init[7627]: 3) A unit may be started when needed via activation (socket, path, timer,
[ 4431.087249] cloud-init[7627]: D-Bus, udev, scripted systemctl call, ...).

Going to investigate what this unit file should look like.

Cheryl Jennings (cherylj) wrote :

Worked with Eric Snow and added in the [Install] section, even if this is a transient config and eliminated the above error. Even with that change, the container still did not stop. Looking at the container itself, I see that the juju-template-restart service is still loaded, but is inactive and dead. Digging into the syslog for the container, I see a new error:

May 15 19:29:36 juju-vivid-lxc-template cloud-init[7541]: + /bin/systemctl enable /var/lib/juju/init/juju-template-restart/juju-template-restart.service
May 15 19:29:36 juju-vivid-lxc-template cloud-init[7541]: Created symlink from /etc/systemd/system/multi-user.target.wants/juju-template-restart.service to /var/lib/juju/init/juju-template-restart/juju-template-restart.service.
May 15 19:29:36 juju-vivid-lxc-template systemd[1]: message repeated 2 times: [ Reloading.]
May 15 19:29:36 juju-vivid-lxc-template systemd[1]: [/var/lib/juju/init/juju-template-restart/juju-template-restart.service:6] Failed to add dependency on cloud-final, ignoring: Invalid argument
May 15 19:29:36 juju-vivid-lxc-template systemd[1]: [/var/lib/juju/init/juju-template-restart/juju-template-restart.service:7] Failed to add dependency on cloud-final, ignoring: Invalid argument

A quick google search shows that maybe we need to add the dependency on cloud-final.service. Going to give that a go and see what happens.

Cheryl Jennings (cherylj) wrote :

I'm quite certain I have a fix for this now, but I want to check with ericsnow to do some sanity checking before committing.

The short version is that we need to alter our cloud-config to properly use systemd to halt the container once cloud-init completes.

There were a couple of issues with the current config, but basically the unit file and its handling should change to:
1 - Include an [Install] section, even if this is a transient service.
2 - Not specify "Conflicts" when we want to stop after some other service completes. Using "Conflicts" will actually kill the service we're trying to run after if we start while it is still running.
3 - Change the "After" to be the cloud-config.target, rather than cloud-final(.service). See the cloud-config.target file for more info: https://github.com/stackforge/cloud-init/blob/master/inits/systemd/cloud-config.target
4 - After we enable the juju-template-restart service, we need to explicitly start it. But since we've added the cloud-config.target, we won't halt the system until after cloud-init completes.

After making the modifications above, I was able to deploy a trivial vivid charm successfully. I will get the code changes in tomorrow once I've chatted with Eric.

Basically, the issue is that we want to use systemd to start a service that will halt the container once cloud-init completes. As part of this service, we want it to remove itself such that other containers started from the template we're creating don't just halt once they start up.

Cheryl Jennings (cherylj) wrote :

Gah, didn't mean to include that last paragraph in the previous comment...

Eric Snow (ericsnowcurrently) wrote :

@Cheryl, the solution you've outlined sounds correct. It may be worth asking smoser about it.

Eric Snow (ericsnowcurrently) wrote :

For the record, using the init system to reboot at the end of cloudinit feels like a hack to me. It is certainly fragile, as this bug attests. However, it may still be the best solution. Furthermore, I don't have any better alternatives to offer.

Cheryl Jennings (cherylj) wrote :

Turns out all this hand waving with systemd may be unneeded. Cloud-config has an option to halt a system after cloud-init completes. Going to give that a try and see if it works for our use case.

Cheryl Jennings (cherylj) wrote :

Using power_state in the cloud-config did power off the system as expected. Testing this change with precise and trusty.

Cheryl Jennings (cherylj) wrote :

Sent an email to smoser to find out if we'll ever be in a situation where the cloud-init version we're using doesn't support power_state, and if we'll be able to determine that when we're generating our cloud-config.

Cheryl Jennings (cherylj) wrote :

After talking with thumper, I'll be making the changes needed to get the systemd logic working properly as the power_state option in cloud-config is not guaranteed to be present in all cases, and we'd have to inject some os/version logic to determine if it would be present which would need to be updated every time a new os/version needed lxc support in juju.

Cheryl Jennings (cherylj) wrote :

I have a patch up for review for 1.24: http://reviews.vapour.ws/r/1789/

Changed in juju-core:
status: Triaged → In Progress
Curtis Hovey (sinzui) wrote :

We do not intend to back port the fix to 1.23.4 because 1.24.0 is scheduled for proposed this week. If there is a future change in plans, we will need to back port.

Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui) on 2015-06-25
Changed in juju-core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers