juju userdata should not restart networking

Reported by Kent Baxley on 2013-11-05
34
This bug affects 5 people
Affects Status Importance Assigned to Milestone
MAAS
Undecided
Unassigned
juju-core
High
Andrew Wilkins
dbus (Ubuntu)
Low
Unassigned
juju-core (Ubuntu)
Status tracked in Trusty
Saucy
High
Unassigned
Trusty
High
Unassigned

Bug Description

[Impact]

juju fails to deploy charms on Saucy.

[Steps to Reproduce]

On Saucy:

root@foo:~# dpkg-query -W dbus
dbus 1.6.12-0ubuntu10
root@foo:~# status dbus
dbus start/running, process 817
root@foo:~# service networking restart
networking stop/waiting
networking start/running
root@foo:~# status dbus
dbus stop/waiting

Verified on amd64. I believe this also affects armhf.

[Analysis]

Either:

1) dbus shouldn't fail when networking is restarted; or
2) juju should not restart networking in this way as a runcmd in cloud-init userdata, and then expect dbus to work.

The relevant part of juju's userdata is:

- "cat > /etc/network/eth0.config << EOF\niface eth0 inet manual\n\nauto br0\niface
  br0 inet dhcp\n bridge_ports eth0\nEOF\n"
- sed -i "s/iface eth0 inet dhcp/source \/etc\/network\/eth0.config/" /etc/network/interfaces
- service networking restart

I assume juju is restarting networking to set up the bridge that it configured; in this case, an "ifdown eth0" and a subsequent "ifup"s as required should suffice. No need to restart everything (eg. lo, etc) which I [rbasak] presume is causing the issue.

Note that dbus appears to silently fail; if it is going to stop, it should do so with a sensible log message.

[Original Description]

I'm running a MAAS and Juju enviornment using 64-bit Ubuntu Server 13.10 for the MAAS node as well as all juju nodes, including the bootstrap node.

I can't seem to deploy any charms to the nodes without the services getting stuck in a 'pending' state. The problem seems to be that upon charm deployment dbus for some reason is stopped, and thus the whole process of deploying the charm gets stuck.

It is almost verbatim of the problem someone appears to have hit on raring, too:

http://askubuntu.com/questions/364714/juju-deploy-of-charm-mysql-in-maas-provider-failing-after-successful-bootstrap

For me, however, it's not just mysql...it's any service I try to deploy.

juju release: 1.16.2-0ubuntu1~ubuntu13.10.1~juju1
maas version: 1.4+bzr1693+dfsg-0
Ubuntu OS: 13.10, 64-bit for everything

Steps to reproduce:

1) Set up a maas server using 13.10 with all the latest updates.
2) Install juju-core from the stable ppa (1.16.2 in this case).
3) Enlist and commission some nodes.
4) download some charms locally (I'm in a semi-restricted network environment, so, it's easier for me to pull down the charms locally). For example, bzr branch lp:charms/rabbitmq-server
5) bootstrap the juju environment. Due to restrictions in my network I have to use "juju bootstrap --upload-tools". Firewall blocks the ability to run "juju sync-tools" beforehand.
6) Once the environment is bootstrapped, try to deploy a charm. This is what I used:

juju deploy --repository=charms/ local:rabbitmq-server --show-log.

Actual results:

Installation of the OS goes fine, along with the cloud-init after reboot. I can also ssh into the node via juju.
The charm, however, stays in a 'pending' state long after it is deployed on the server.
ssh-ing into the node and looking at /var/log/juju/machine-x.log reveals that the deployment stage fails due to being unable to connect to the system bus.
This seems to prevent the charm from ever deploying.

Expected results:
OS installs and charm deploys without issue.

Workaround:
I can work around this by ssh-ing into the node and starting dbus by hand. I then go back and re-deploy the charm via juju using:

juju destroy-unit rabbitmq-server/0

juju destroy-service rabbitmq-server

juju deploy --to 1 --repository=charms/ local:rabbitmq-server --show-log

At that point everything starts working again. I can even deploy another charm onto the same machine without problems once dbus is up and going.

I'm not sure what or why dbus is failing to start and I can't tell if dbus was running and it suddenly shut down prior to charm deployment.

I've attached a machine log from an attempt to deploy rabbitmq-server onto a bare metal node using 13.10. The same thing happens to any charm I try to deploy in this manner.

Let me know if you need anything else.

Kent Baxley (kentb) wrote :
Curtis Hovey (sinzui) on 2013-11-06
tags: added: deploy maas-provider
tags: added: dbus
Curtis Hovey (sinzui) wrote :

I think this is really a maas + lxc bug. I need to look into this further to understand where the fix really needs to be made.

Curtis Hovey (sinzui) on 2013-11-06
summary: - Juju deploy of Charm in MAAS provider failing after successful
- bootstrap. Juju status stuck in “Pending” state
+ Juju deploy of Charm in MAAS fails because dbus fails
affects: juju-core → lxc (Ubuntu)

Quoting Curtis Hovey (<email address hidden>):
> ** Summary changed:
>
> - Juju deploy of Charm in MAAS provider failing after successful bootstrap. Juju status stuck in “Pending” state
> + Juju deploy of Charm in MAAS fails because dbus fails
>

Based on what I see in the machine.log, it looks like dbus was not
installed. This is curious since AIUI juju uses the ubuntu-cloud
image. When I install a precise ubuntu-cloud container,
/var/run/dbus/system_bus_socket does exist. With ubuntu containers,
it does not until I 'apt-get install dbus.'

Could you please re-run this and, while the container is hung, log
in with 'sudo lxc-console -n <container-name>', and see whether the
dbus package is installed?

 status: incomplete

Changed in lxc (Ubuntu):
status: New → Incomplete
Serge Hallyn (serge-hallyn) wrote :

Sorry, I see now that you ssh in to start dbus, so obviously it's
installed.

 status: new

Changed in lxc (Ubuntu):
status: Incomplete → New

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in lxc (Ubuntu):
status: New → Confirmed
Jeff Marcom (jeffmarcom) wrote :

Adding juju-core back because the bug still does affect functionality of juju-core and needs to be tracked in juju-core project, regardless of where/how the fix is applied.

no longer affects: lxc
Changed in juju-core:
status: New → Confirmed
Chris Glass (tribaal) wrote :

I hit the same problem, and don't use LXC at all on my setup. I therefore will change this bug to point back to juju-core.

My MaaS lives in a KVM machine on my laptop.
It can PXE boot/start-stop other KVM machines living on the same NAT'ed network.
- "juju bootstrap --upload-tools" works (the node is spun up, and after the ubuntu install, "juju status" works).
- I can "juju deploy" any charm, however:
- The machines come up (the node boots, ubuntu gets installed), but "juju status" reports "pending" forever.

The machine that is brought up has the same symptoms as reported above:

The machine log in /var/log/juju show:
2013-11-05 18:39:02 ERROR juju runner.go:211 worker: exited "deployer": exec ["start" "--system" "jujud-unit-rabbitmq-server-0"]: exit status 1 (start: Unable to connect to system bus: Failed to connect to socket /var/run/dbus/system_bus_socket: No such file or directory)

Running this command by hand works when run without "--system".

Once ran, /var/log/juju contains a unit log as:
http://pastebin.ubuntu.com/6371502/

The cloud-init log and cloud-init-output log are:
http://pastebin.ubuntu.com/6371297/
http://pastebin.ubuntu.com/6371310/

I further tried to fix the situation manually by symlinking "/var/lib/juju/tools/1.16.2.1-saucy-amd64/" to "/var/lib/juju/tools/unit-mysql-0", but to no avail.

Chris Glass (tribaal) wrote :

Changing back to juju-core since this affects me, and I'm not using LXC containers at all.

affects: lxc (Ubuntu) → juju-core (Ubuntu)
tags: added: landscape
Geoff Teale (tealeg) wrote :

I can confirm that I experience exactly the same issue using Juju with MAAS and nodes all under KVM.

Curtis Hovey (sinzui) on 2013-11-06
Changed in juju-core:
status: Confirmed → Triaged
importance: Undecided → High
affects: juju-core (Ubuntu) → maas
Changed in maas:
status: Confirmed → New

What release series are you deploying to? As in, what gets installed on the MAAS nodes? Does deploying to a different series help at all?

Changed in maas:
status: New → Incomplete
Kent Baxley (kentb) wrote :

Julian,

The series for all MAAS nodes being deployed is 13.10. I will check with an older release, but, there were also complaints about nodes being deployed with 13.04 also having this problem.

Geoff Teale (tealeg) wrote :

Julian, I've also be deploying 13.10. I can step back through the releases tomorrow when I'm at work if you'd like (I'm in Germany).

It would be very useful if you could deploy precise (12.04). It's what
we test with in the QA lab and there's no problems.

I suspect this is a server image issue, but let's confirm on that
release testing.

Thanks guys.

I do not hit this bug when using 12.04 nodes (even with a 13.10 MaaS server), so it seems likely to be a problem with the saucy server image indeed.
I'll do some more matrix testing with at least trusty and raring, to try to find where the regression was introduced.

What's the best place to assign this? It is likely that some users will run into this in the future, we should try to fix the saucy image.

Curtis Hovey (sinzui) wrote :

<andreas> ah, there is another difference between my env and tribaal's
<andreas> I used the fast path installer, which basically downloads the cloud image and dumps it on the nodes
<andreas> instead of going through the debian installer
<andreas> so if using fast-path is the big difference, that would explain why it works in the cloud too, because it's the same image.

Kent Baxley (kentb) wrote :

I can also confirm that using the fast path installer with 13.10 does not have any dbus failures associated with it. I can deploy a charm on my MAAS nodes (so far) without dbus crapping out on me.

Robie Basak (racb) wrote :

MAAS on its own does not deploy a stopped dbus service in my testing; it only happens when juju is involved. I've isolated this to "service networking restart" done by juju using cloud-init userdata. I've updated the description. I think a fix for juju would be straightforward: don't do that; use more targeted "ifdown" and "ifup" as required instead. Note that the "ifdown" should be called before changing /etc/network/interfaces.

However, adding a dbus task, as it is arguable that dbus should not die in this case, and certainly not silently. A "Won't Fix" resolution for dbus dying might be acceptable, though, if it is not expected to survive all network interfaces being deconfigured. But a log message explaining would be nice.

description: updated
Changed in maas:
status: Incomplete → Invalid
tags: added: midway
Robie Basak (racb) on 2013-11-26
Changed in juju-core (Ubuntu Trusty):
status: New → Triaged
importance: Undecided → High
Changed in juju-core (Ubuntu Saucy):
status: New → Triaged
importance: Undecided → High
no longer affects: dbus (Ubuntu Saucy)
no longer affects: dbus (Ubuntu Trusty)
tags: added: ubuntu-openstack
William Reade (fwereade) on 2013-11-28
Changed in juju-core:
milestone: none → 1.17.1
Martin Packman (gz) on 2014-01-23
Changed in juju-core:
milestone: 1.17.1 → 1.18.0
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in dbus (Ubuntu):
status: New → Confirmed
Nicola Larosa (teknico) wrote :

I'm hitting this problem inside Precise KVM nodes. The bottom of cloud-init-output.log has this:

+ start jujud-machine-3
jujud-machine-3 start/running, process 2244
Traceback (most recent call last):
  File "/usr/lib/python2.7/logging/handlers.py", line 807, in emit
    self._connect_unixsocket(self.address)
  File "/usr/lib/python2.7/logging/handlers.py", line 745, in _connect_unixsocket
    self.socket.connect(address)
  File "/usr/lib/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
error: [Errno 111] Connection refused
Logged from file __init__.py, line 116

The MAAS KVM node is on Saucy.

Robie Basak (racb) wrote :

Setting dbus task to Low priority, for triage purposes. I think the primary fix that needs to happen here is in juju.

Changed in dbus (Ubuntu):
importance: Undecided → Low
Serge Hallyn (serge-hallyn) wrote :

@racb

I don't see why this is anythign to do with juju.

ubuntu@cloud1:~$ sudo status dbus
dbus start/running, process 464
ubuntu@cloud1:~$ sudo restart networking
<4>init: Disconnected from system bus
<4>init: whoopsie main process ended, respawning
networking start/running
ubuntu@cloud1:~$ sudo status dbus
dbus stop/waiting

Serge Hallyn (serge-hallyn) wrote :

So quite simply,

1. dbus stops on "deconfiguring-networking"

2. restart networking issues deconfiguring-networking

3. dbus only starts on local-filesystems, so it never restarts.

Somehow dbus needs to be made to restart when networking restarts, or it simply should not shut down at deconfiguring-networking.

summary: - Juju deploy of Charm in MAAS fails because dbus fails
+ dbus does not restart when 'restart networking' command is issued.

Ok, stgraber has set me straight - 'restart networking' is not supposed to be used. Use ifdown -a and ifup -a if needed.

summary: - dbus does not restart when 'restart networking' command is issued.
+ juju userdata should not restart networking
Serge Hallyn (serge-hallyn) wrote :

For the record,

#ubuntu-devel: 02/11/14 23:10 <stgraber> hallyn: so tl;dr, they should use ifdown eth0 => 'cat > /etc/network/eth0.conf << EOF\niface br0 inet dhcp\n bridge_ports eth0\nEOF\n" => ifup br0, that'll DTRT

Please make the appropriate changes to juju. I am marking this bug invalid for dbus.

Changed in dbus (Ubuntu):
status: Confirmed → Invalid
Martin Packman (gz) on 2014-03-19
Changed in juju-core:
milestone: 1.20.0 → 1.18.0
Andrew Wilkins (axwalk) on 2014-03-20
Changed in juju-core:
status: Triaged → In Progress
assignee: nobody → Andrew Wilkins (axwalk)
milestone: 1.18.0 → 1.17.6
Andrew Wilkins (axwalk) on 2014-03-20
Changed in juju-core:
status: In Progress → Fix Committed
Curtis Hovey (sinzui) on 2014-03-20
Changed in juju-core:
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package juju-core - 1.17.6-0ubuntu1

---------------
juju-core (1.17.6-0ubuntu1) trusty; urgency=medium

  * New upstream point release, including fixes for:
    - br0 not bought up by cloud-init with MAAS provider (LP: #1271144).
    - ppc64el enablement for juju/lxc (LP: #1273769).
    - juju userdata should not restart networking (LP: #1248283).
    - error detecting hardware characteristics (LP: #1276909).
    - juju instances not including the default security group (LP: #1129720).
    - juju bootstrap does not honor https_proxy (LP: #1240260).
  * d/control,rules: Drop BD on bash-completion, install bash-completion
    direct from upstream source code.
  * d/rules: Set HOME prior to generating man pages.
  * d/control: Drop alternative dependency on mongodb-server; juju now only
    works on trusty with juju-mongodb.
 -- James Page <email address hidden> Mon, 24 Mar 2014 16:05:44 +0000

Changed in juju-core (Ubuntu Trusty):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments