MAAS disconnect treated as fatal during bootstrap

Bug #1753595 reported by Ashley Lai
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Eric Claude Jones

Bug Description

juju may not wait long enough for the bootstrap node to become 'deployed'.

https://solutions.qa.canonical.com/#/qa/testRun/759a4e7a-5a46-4574-bfb5-b218c75f01c5

19:53:41 INFO juju.environs.sync sync.go:333 using official agent binary 2.3.5-xenial-amd64 (28053kB)
19:53:42 INFO cmd bootstrap.go:389 Starting new instance for initial controller
19:53:42 INFO cmd bootstrap.go:151 Launching controller instance(s) on foundations-maas...
19:53:43 DEBUG juju.cloudconfig.instancecfg instancecfg.go:835 Setting numa ctl preference to false
19:53:44 DEBUG juju.service discovery.go:63 discovered init system "systemd" from series "xenial"
19:53:44 DEBUG juju.provider.maas environ.go:1019 maas user data; 3832 bytes
19:53:46 DEBUG juju.provider.maas environ.go:1051 started instance "4cfdbq"
19:53:46 INFO cmd bootstrap.go:225 - 4cfdbq (arch=amd64 mem=32G cores=8)
19:53:46 INFO juju.environs.bootstrap bootstrap.go:638 newest version: 2.3.5
19:53:46 INFO juju.environs.bootstrap bootstrap.go:653 picked bootstrap agent binary version: 2.3.5
19:53:46 INFO cmd bootstrap.go:425 Installing Juju agent on bootstrap instance
19:53:47 INFO cmd bootstrap.go:517 Fetching Juju GUI 2.12.1
19:57:17 ERROR juju.cmd.juju.commands bootstrap.go:528 failed to bootstrap model: bootstrap instance started but did not change to Deployed state: getting instance "4cfdbq": unexpected: Get http://10.244.40.33/MAAS/api/2.0/machines/?agent_name=10c266c3-9978-43a4-8a64-b9dea0e790f5&id=4cfdbq: dial tcp 10.244.40.33:80: i/o timeout
19:57:17 DEBUG juju.cmd.juju.commands bootstrap.go:529 (error details: [{github.com/juju/juju/cmd/juju/commands/bootstrap.go:620: failed to bootstrap model} {github.com/juju/juju/provider/maas/environ.go:185: bootstrap instance started but did not change to Deployed state} {github.com/juju/juju/provider/maas/environ.go:1185: } {github.com/juju/juju/provider/maas/environ.go:1750: getting instance "4cfdbq"} {github.com/juju/juju/provider/maas/environ.go:1541: } {github.com/juju/gomaasapi/errors.go:41: } {github.com/juju/gomaasapi/controller.go:811: } {github.com/juju/gomaasapi/controller.go:835: } {Get http://10.244.40.33/MAAS/api/2.0/machines/?agent_name=10c266c3-9978-43a4-8a64-b9dea0e790f5&id=4cfdbq: dial tcp 10.244.40.33:80: i/o timeout}])

maas log:
Mar 5 19:53:45 swoobat maas.power: [info] Changing power state (on) of node: juju-3 (4cfdbq)
Mar 5 19:53:50 swoobat maas.power: [info] Changed power state (on) of node: juju-3 (4cfdbq)
Mar 5 19:54:02 swoobat maas.interface: [info] eno1 (physical) on swoobat: New MAC, IP binding observed: 52:54:00:6f:5d:37, 10.244.40.219
Mar 5 19:55:09 swoobat maas.rpc.rackcontrollers: [warn] leafeon: Detected an external DHCP server on an interface with no VLAN defined: 'eno49 (physical) on leafeon': 10.245.208.5
Mar 5 19:58:12 swoobat maas.node: [info] juju-3: Status transition from DEPLOYING to DEPLOYED

Revision history for this message
Ashley Lai (alai) wrote :
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1753595] Re: juju bootstrap not seeing node in deployed state

Juju should have a default bootstrap timeout of 1200 seconds (AFAICT). That
doesn't seem to match what you're seeing here, but I know it can be
configured with
juju bootstrap --config bootstrap-timeout=SECONDS

Is it possible that a value was being passed. What is the total time from
the start of the process?

Actually, looking at the Juju log, it was something else:

19:57:17 ERROR juju.cmd.juju.commands bootstrap.go:528 failed to bootstrap
model: bootstrap instance started but did not change to Deployed state:
getting instance "4cfdbq": unexpected: Get http://10.244.40.33/MAAS/api/
2.0/machines/?agent_name=10c266c3-9978-43a4-8a64-b9dea0e790f5&id=4cfdbq:
dial tcp 10.244.40.33:80 <http://10.244.40.33/>: i/o timeout

^- That says that while we were polling MAAS to see if the machine was
ready, MAAS actually refused our connection request. So we failed to
contact maas to get updated status information on the node that we were
deploying.

Do you have any idea why we would be getting a TCP CONNECT failure talking
to MAAS during this time?

On Mon, Mar 5, 2018 at 11:49 PM, Ashley Lai <email address hidden>
wrote:

> ** Attachment added: "infra-logs.tar"
> https://bugs.launchpad.net/juju/+bug/1753595/+attachment/
> 5070257/+files/infra-logs.tar
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1753595
>
> Title:
> juju bootstrap not seeing node in deployed state
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1753595/+subscriptions
>

Revision history for this message
Ashley Lai (alai) wrote : Re: juju bootstrap not seeing node in deployed state

We passed in bootstrap-timeout=1800

From start from finish, it is less than 5 minutes. It is possible that the network went down when juju sent a request to maas. I'll keep an eye on our future runs and see if this problem comes up again.

19:53:27 DEBUG juju.cmd supercommand.go:57 args: []string{"/snap/juju/3614/bin/juju", "bootstrap", "--debug", "--config", "bootstrap-timeout=1800", "foundations-maas", "foundations-maas", "--bootstrap-constraints", "tags=juju", "--to", "juju-3"}
...
19:57:17 ERROR juju.cmd.juju.commands bootstrap.go:528 failed to bootstrap model: bootstrap instance started but did not change to Deployed state: getting instance "4cfdbq": unexpected: Get http://10.244.40.33/MAAS/api/2.0/machines/?agent_name=10c266c3-9978-43a4-8a64-b9dea0e790f5&id=4cfdbq: dial tcp 10.244.40.33:80: i/o timeout

John A Meinel (jameinel)
Changed in juju:
importance: Undecided → High
status: New → Incomplete
summary: - juju bootstrap not seeing node in deployed state
+ MAAS disconnect treated as fatal during bootstrap
Revision history for this message
Ashley Lai (alai) wrote :

We hit this issue again. It looks like juju destroys the model while juju-3 node is deploying. Not sure why juju not able to talk to maas but the log shows maas is up and the node is deploying.

maas.log on 10.244.40.31:
Mar 13 05:32:12 meinfoo maas.node: [info] juju-3: Status transition from READY to ALLOCATED
Mar 13 05:32:12 meinfoo maas.node: [info] juju-3: allocated to user root
Mar 13 05:32:14 meinfoo maas.interface: [info] Allocated automatic IP address 10.244.40.200 for ens6 (physical) on juju-3.
Mar 13 05:32:14 meinfoo maas.node: [info] juju-3: Status transition from ALLOCATED to DEPLOYING
Mar 13 05:32:14 meinfoo maas.power: [info] Changing power state (on) of node: juju-3 (b4crkm)
Mar 13 05:32:19 meinfoo maas.power: [info] Changed power state (on) of node: juju-3 (b4crkm)
Mar 13 05:32:32 meinfoo maas.interface: [info] eno1 (physical) on meinfoo: New MAC, IP binding observed: 52:54:00:f7:17:0f, 10.244.40.200
Mar 13 05:33:56 meinfoo maas.node: [info] juju-3: Releasing node
Mar 13 05:33:56 meinfoo maas.node: [info] juju-3: Status transition from DEPLOYING to RELEASING

juju output:
05:31:56 DEBUG juju.cmd supercommand.go:57 args: []string{"/snap/juju/3686/bin/juju", "bootstrap", "--debug", "--config", "bootstrap-timeout=1800", "foundations-maas", "foundations-maas", "--bootstrap-constraints", "tags=juju", "--to", "juju-3"}
...
05:32:16 INFO cmd bootstrap.go:517 Fetching Juju GUI 2.12.1
05:33:56 ERROR juju.cmd.juju.commands bootstrap.go:528 failed to bootstrap model: bootstrap instance started but did not change to Deployed state: getting instance "b4crkm": unexpected: Get http://10.244.40.33/MAAS/api/2.0/machines/?agent_name=5566d620-9660-4a30-8323-fc53fc57c218&id=b4crkm: dial tcp 10.244.40.33:80: i/o timeout
05:33:56 DEBUG juju.cmd.juju.commands bootstrap.go:529 (error details: [{github.com/juju/juju/cmd/juju/commands/bootstrap.go:620: failed to bootstrap model} {github.com/juju/juju/provider/maas/environ.go:185: bootstrap instance started but did not change to Deployed state} {github.com/juju/juju/provider/maas/environ.go:1185: } {github.com/juju/juju/provider/maas/environ.go:1750: getting instance "b4crkm"} {github.com/juju/juju/provider/maas/environ.go:1541: } {github.com/juju/gomaasapi/errors.go:41: } {github.com/juju/gomaasapi/controller.go:811: } {github.com/juju/gomaasapi/controller.go:835: } {Get http://10.244.40.33/MAAS/api/2.0/machines/?agent_name=5566d620-9660-4a30-8323-fc53fc57c218&id=b4crkm: dial tcp 10.244.40.33:80: i/o timeout}])
05:33:56 DEBUG juju.cmd.juju.commands bootstrap.go:1126 cleaning up after failed bootstrap
05:33:56 INFO juju.provider.common destroy.go:20 destroying model "controller"
05:33:56 INFO juju.provider.common destroy.go:31 destroying instances
05:33:57 INFO juju.provider.common destroy.go:55 destroying storage
05:33:57 INFO cmd supercommand.go:465 command finished

https://solutions.qa.canonical.com/#/qa/testRun/7e4bb7a4-8898-4855-b162-9991c7048668

Changed in juju:
status: Incomplete → New
Revision history for this message
Chris Gregan (cgregan) wrote :

escalated to field high

Tim Penhey (thumper)
Changed in juju:
status: New → Triaged
Changed in juju:
assignee: nobody → Eric Claude Jones (ecjones)
Changed in juju:
status: Triaged → In Progress
Revision history for this message
Eric Claude Jones (ecjones) wrote :
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
milestone: none → 2.3.5
Revision history for this message
Eric Claude Jones (ecjones) wrote :
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.