Bootstrap timeout message for MAAS can be improved

Bug #1523608 reported by Ryan Beisner
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned
juju-core
Won't Fix
Undecided
Unassigned

Bug Description

(MAAS 1.9rc3 or 1.9rc2) + (Juju 1.25.1 or 1.25.0): bootstrap fails

Observable as:

Syslog:
Dec 7 18:13:48 fat-machine systemd[1]: Job sys-subsystem-net-devices-juju-br0.device/start timed out.
Dec 7 18:13:48 fat-machine systemd[1]: Timed out waiting for device sys-subsystem-net-devices-juju-br0.device.
Dec 7 18:13:48 fat-machine systemd[1]: Job sys-subsystem-net-devices-juju-br0.device/start failed with result 'timeout'.
Dec 7 18:13:48 fat-machine systemd[1]: Startup finished in 23.443s (kernel) + 1min 59.119s (userspace) = 2min 22.562s.

Juju output:
ERROR failed to bootstrap environment: bootstrap instance started but did not change to Deployed state: instance "/MAAS/api/1.0/nodes/node-d4692494-8228-11e4-8078-d4bed9a84493/" is started but not deployed

Illustrated @:
http://paste.ubuntu.com/13793514/

Impact:

Those with this combo of product will likely not be able to bootstrap or deploy units. Bare metal testing for OpenStack charms is blocked, as we cannot bootstrap.

Tags: uosci
Ryan Beisner (1chb1n)
description: updated
Revision history for this message
Ryan Beisner (1chb1n) wrote :
Revision history for this message
Andres Rodriguez (andreserl) wrote :

As per testing and debug, it seems that MAAS successfully deploys the machine. The problem seems to happen once the bridge is created. I'm adding Juju into the loop!

Changed in maas:
status: New → Incomplete
no longer affects: juju
no longer affects: juju (Ubuntu)
Revision history for this message
Ryan Beisner (1chb1n) wrote :

syslog on the unit shows:

Dec 7 18:13:48 fat-machine systemd[1]: Job sys-subsystem-net-devices-juju-br0.device/start timed out.
Dec 7 18:13:48 fat-machine systemd[1]: Timed out waiting for device sys-subsystem-net-devices-juju-br0.device.
Dec 7 18:13:48 fat-machine systemd[1]: Job sys-subsystem-net-devices-juju-br0.device/start failed with result 'timeout'.
Dec 7 18:13:48 fat-machine systemd[1]: Startup finished in 23.443s (kernel) + 1min 59.119s (userspace) = 2min 22.562s.

Ryan Beisner (1chb1n)
description: updated
Ryan Beisner (1chb1n)
description: updated
summary: - MAAS 1.9rc3 + Juju 1.25.0: bootstrap fails, unit has same IP assigned
- to eth0 and juju-br0
+ Timed out waiting for device sys-subsystem-net-devices-juju-br0.device
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.25.3
Revision history for this message
Andrew McDermott (frobware) wrote : Re: Timed out waiting for device sys-subsystem-net-devices-juju-br0.device

Please could you attach:

  /var/log/cloud-init-output.log
  /var/log/cloud-init.log

Revision history for this message
Andrew McDermott (frobware) wrote :

Which ubuntu series is being bootstrapped?

Revision history for this message
Ryan Beisner (1chb1n) wrote :

juju bootstrap --debug output:
http://paste.ubuntu.com/13799412/

With corresponding logs ...

Revision history for this message
Ryan Beisner (1chb1n) wrote :
Revision history for this message
Ryan Beisner (1chb1n) wrote :
Revision history for this message
Ryan Beisner (1chb1n) wrote :
Revision history for this message
Ryan Beisner (1chb1n) wrote :
Revision history for this message
Andrew McDermott (frobware) wrote :

Thanks for all the logs.

Looking through cloud-init-output.log I see that only juju-br0 has an IP address assigned. Is this run from MAAS 1.9rc2 or rc3?

cloud-init-output.log:435

+ ifconfig -a
eth0 Link encap:Ethernet HWaddr 18:03:73:fb:d1:cd
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:1 overruns:0 frame:0
          TX packets:1 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B) TX bytes:94 (94.0 B)

eth1 Link encap:Ethernet HWaddr 18:03:73:fb:d1:cf
          inet6 addr: fe80::1a03:73ff:fefb:d1cf/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:17 errors:0 dropped:0 overruns:0 frame:0
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:1088 (1.0 KB) TX bytes:680 (680.0 B)

eth2 Link encap:Ethernet HWaddr 18:03:73:fb:d1:d1
          UP BROADCAST MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

eth3 Link encap:Ethernet HWaddr 18:03:73:fb:d1:d3
          UP BROADCAST MULTICAST MTU:1500 Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

juju-br0 Link encap:Ethernet HWaddr 18:03:73:fb:d1:cd
          inet addr:10.245.168.13 Bcast:10.245.175.255 Mask:255.255.248.0
          inet6 addr: fe80::1a03:73ff:fefb:d1cd/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:2 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:516 (516.0 B) TX bytes:132 (132.0 B)

lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:65536 Metric:1
          RX packets:163 errors:0 dropped:0 overruns:0 frame:0
          TX packets:163 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:13226 (13.2 KB) TX bytes:13226 (13.2 KB)

Revision history for this message
Ryan Beisner (1chb1n) wrote :

@frobware - The logs were after reverting back to MAAS 1.9rc2, so the interface IP behavior is back to the old/expected way. The issue doesn't appear to be directly related to that behavior change, though that should be investigated too imho.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

To further clarify:

We see the same failure raised ("started but not deployed"), using any of these combos:

MAAS 1.9rc2 + Juju 1.25.0
MAAS 1.9rc3 + Juju 1.25.0
MAAS 1.9rc3 + Juju 1.25.1

Which is odd, because we were bootstrapping just fine on Thu Dec 3 with MAAS 1.9rc2 + Juju 1.25.0 in the same lab.

The logs above are Vivid.

When Vivid bootstrap fails, we can at least get back to the unit. When Trusty fails with the same error, we cannot access the node after that.

This all leads me to suspect the images or a delta in their packages, though I've not investigated there too deeply yet.

Revision history for this message
Andrew McDermott (frobware) wrote :

The combos you mention in #13 are all vivid based? Do you have the bandwidth to try the same against wily?

Revision history for this message
Ryan Beisner (1chb1n) wrote :

This could be a thing, certainly something to look into:

<beisner> this just SRU'd and hit images. cloud-init 0.7.7~bzr1091-0ubuntu12
<beisner> last known good in our logs was cloud-init 0.7.7~bzr1091-0ubuntu11

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Ryan,

So I tested your environment, and these is what I find:

1.25.1-0ubuntu1~14.04.1~juju1

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Ryan,

So I tested your environment, and these is what I find:

juju-core 1.25.1-0ubuntu1~14.04.1~juju1
maas 1.9.0~rc2+bzr4509-0ubuntu1~trusty1

and I successfully did a juju bootstrap:

jenkins@juju-osci-machine-13:~/lp1519527⟫ juju bootstrap --constraints="tags=uosci"

Bootstrapping environment "maas-vivid-jprop"
Starting new instance for initial state server
Launching instance
WARNING no architecture was specified, acquiring an arbitrary node
 - /MAAS/api/1.0/nodes/node-d65842d0-8228-11e4-9105-d4bed9a84493/
Installing Juju agent on bootstrap instance
Waiting for address
Attempting to connect to cylindrical-base.dellstack:22
Attempting to connect to cylindrical-base.dellstack:22
Attempting to connect to 10.245.168.13:22
[...]
Installing package: tmux
Fetching tools: curl -sSfw 'tools from %{url_effective} downloaded: HTTP %{http_code}; time %{time_total}s; size %{size_downlo
ad} bytes; speed %{speed_download} bytes/s ' --retry 10 -o $bin/tools.tar.gz <[https://streams.canonical.com/juju/tools/propos
ed/juju-1.25.1-vivid-amd64.tgz]>
Bootstrapping Juju machine agent
Starting Juju machine agent (jujud-machine-0)
Bootstrap agent installed
Waiting for API to become available
Waiting for API to become available
Waiting for API to become available
Bootstrap complete

And it works just fine:

jenkins@juju-osci-machine-13:~/lp1519527⟫ juju status
environment: maas-vivid-jprop
machines:
  "0":
    agent-state: started
    agent-version: 1.25.1
    dns-name: cylindrical-base.dellstack
    instance-id: /MAAS/api/1.0/nodes/node-d65842d0-8228-11e4-9105-d4bed9a84493/
    series: vivid
    hardware: arch=amd64 cpu-cores=8 mem=49152M tags=node-d65842d0-8228-11e4-9105-d4bed9a84493,uosci,gateway
    state-server-member-status: has-vote
services: {}

So, from my perspective, this seems to work without any issues whatsoever. I have a question though, when you were deploying, did you do it manually or were you using juju deployer?

I'm marking this bug as Invalid for MAAS, as I was able to bootstrap successfully.

Changed in maas:
status: Incomplete → Invalid
Revision history for this message
Ryan Beisner (1chb1n) wrote :
Download full text (13.3 KiB)

We don't use the manual provider. This is observed (and reproduced again just now) via a simple `juju bootstrap`:

Tue Dec 8 13:00:37 UTC 2015
http://paste.ubuntu.com/13822321/

...

Tue Dec 8 12:42:03 UTC 2015
2015-12-08 12:42:04 INFO juju.cmd supercommand.go:37 running juju [1.25.1-trusty-amd64 gc]
2015-12-08 12:42:04 DEBUG juju.provider.maas environprovider.go:28 opening environment "maas-vivid-jprop".
2015-12-08 12:42:04 DEBUG juju.provider.maas environ.go:139 MAAS API capabilities: [devices-management ipv6-deployment-ubuntu network-deployment-ubuntu networks-management static-ipaddresses storage-deployment-ubuntu]
2015-12-08 12:42:04 DEBUG juju.environs.configstore disk.go:351 writing jenv file
2015-12-08 12:42:04 DEBUG juju.environs.configstore disk.go:475 writing jenv file to /var/lib/jenkins/.juju/environments/maas-vivid-jprop.jenv
2015-12-08 12:42:04 INFO juju.network network.go:242 setting prefer-ipv6 to false
2015-12-08 12:42:05 INFO juju.cmd cmd.go:129 Bootstrapping environment "maas-vivid-jprop"
2015-12-08 12:42:05 DEBUG juju.environs.bootstrap bootstrap.go:102 environment "maas-vivid-jprop" supports service/machine networks: true
2015-12-08 12:42:05 DEBUG juju.environs.bootstrap bootstrap.go:104 network management by juju enabled: true
2015-12-08 12:42:05 INFO juju.environs.bootstrap tools.go:77 looking for bootstrap tools: version=<nil>
2015-12-08 12:42:05 INFO juju.environs.tools tools.go:85 finding tools in stream "proposed"
2015-12-08 12:42:05 INFO juju.environs.tools tools.go:87 reading tools with major.minor version 1.25
2015-12-08 12:42:05 DEBUG juju.environs.tools urls.go:108 trying datasource "keystone catalog"
2015-12-08 12:42:05 DEBUG juju.environs.simplestreams simplestreams.go:659 using default candidate for content id "com.ubuntu.juju:proposed:tools" are {20151126 mirrors:1.0 content-download streams/v1/cpc-mirrors.sjson []}
2015-12-08 12:42:06 DEBUG juju.environs.simplestreams simplestreams.go:429 read metadata index at "https://streams.canonical.com/juju/tools/streams/v1/index2.sjson"
2015-12-08 12:42:07 DEBUG juju.environs.simplestreams simplestreams.go:968 metadata: &{map[com.ubuntu.juju:12.04:amd64:{ 1.21.0 amd64 map[20151123:0xc2101057e0]} com.ubuntu.juju:14.04:ppc64el:{ 1.21.0 ppc64el map[20151123:0xc2102de1e0]} com.ubuntu.juju:win81:amd64:{ 1.21.0 amd64 map[20151123:0xc2102ef4e0]} com.ubuntu.juju:14.04:amd64:{ 1.21.0 amd64 map[20151123:0xc210105b40]} com.ubuntu.juju:14.04:arm64:{ 1.21.0 arm64 map[20151123:0xc210105c60]} com.ubuntu.juju:14.10:arm64:{ 1.21.0 arm64 map[20151123:0xc2102de660]} com.ubuntu.juju:15.04:i386:{ 1.21.0 i386 map[20151123:0xc2102eb360]} com.ubuntu.juju:15.10:i386:{ 1.22.7 i386 map[20151123:0xc2102ec300]} com.ubuntu.juju:win2012hv:amd64:{ 1.21.0 amd64 map[20151123:0xc2102ece40]} com.ubuntu.juju:win2012hvr2:amd64:{ 1.21.0 amd64 map[20151123:0xc2102ef540]} com.ubuntu.juju:win2012r2:amd64:{ 1.21.0 amd64 map[20151123:0xc2102ef180]} com.ubuntu.juju:14.10:ppc64el:{ 1.21.0 ppc64el map[20151123:0xc2102ded80]} com.ubuntu.juju:win7:amd64:{ 1.21.0 amd64 map[20151123:0xc2102ef300]} com.ubuntu.juju:12.04:i386:{ 1.21.0 i386 map[20151123:0xc210105a20]} com.ubuntu.juju:14....

Revision history for this message
Ryan Beisner (1chb1n) wrote :

^ Exact cmd:

juju bootstrap --debug --constraints="tags=uosci"

Ryan Beisner (1chb1n)
description: updated
Revision history for this message
Ryan Beisner (1chb1n) wrote :

*AAAAA HAAAAA*

Juju's default bootstrap timeout is 600s (10 min). Before the SRU for bug 1177432 landed in images, we were able to bootstrap within 10 minutes in our lab (~9 min, so it was close anyway). FWIW, much of that time is in hardware POST.

With the enablement of multiverse, restricted and backports in cloud images (and the resultant MAAS image builds), our bootstrap time was pushed beyond that 10 min default timeout.

More background on that SRU and cloud image change here:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1177432

The work-around for us was to increase Juju's bootstrap timeout via environments.yaml by added this to every environment definition:
    bootstrap-timeout: 900

It was not at all clear that the Juju bootstrap timeout threshold was being hit. I've raised Juju bug 1523608 to recommend better user feedback in the timeout error.

I believe all other indicators/hunches on this bug are symptomatic, and can likely be invalidated. I've not re-encountered the errors after increasing the bootstrap timeout value.

Changed in juju-core:
milestone: 1.25.3 → 1.25.4
summary: - Timed out waiting for device sys-subsystem-net-devices-juju-br0.device
+ Bootstrap timeout message for MAAS can be improved
Changed in juju-core:
milestone: 1.25.4 → 1.25.5
Changed in juju-core:
milestone: 1.25.5 → 1.25.6
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.6 → 1.25.7
Changed in juju-core:
status: Triaged → Won't Fix
importance: High → Undecided
milestone: 1.25.7 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.