juju 2 unavailable after bootstrap - possible infinite recursion loop

Bug #1635464 reported by Brad Marshall
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Dimiter Naydenov

Bug Description

Today I hit a bit of an odd bug with juju 2 and maas 2 on Xenial - after bootstrapping, juju status seems fine, but I'm unable to deploy anything to it. The error messages are a little strange, it seems to mostly throw:

  ERROR upgrade in progress (upgrade in progress)

Looking on the machine-0 logs, I see the following:

2016/10/21 01:08:23 http: TLS handshake error from x.y.z.42:52912: remote error: bad certificate
runtime: goroutine stack exceeds 1000000000-byte limit
fatal error: stack overflow

and then a 1000s of line stack trace (which I'll attach). I've had a quick discussion with wallyworld about it, and it appears there's an infinite recursion issue somewhere here.

The possible only interesting bit about the deployment is that the bootstrap nodes have multiple NICs, one theory discussed was that multiple nics and spaces might be missing test coverage for this.

$ dpkg-query -W juju
juju 1:2.0.0-0ubuntu1~16.04.2~juju1

$ dpkg-query -W maas
maas 2.0.0+bzr5189-0ubuntu1~16.04.1

$ lsb_release -d
Description: Ubuntu 16.04.1 LTS

Please let us know if you need any further information.

Revision history for this message
Brad Marshall (brad-marshall) wrote :

Stack trace from machine-0

Revision history for this message
Brad Marshall (brad-marshall) wrote :

To clarify a bit more about the multiple NICs, all 4 have connectivity, but once deployed only one has an IP address allocated to it. Please let me know if you need any more information about the setup.

Ian Booth (wallyworld)
Changed in juju:
milestone: none → 2.0.1
importance: Undecided → High
status: New → Triaged
assignee: nobody → Richard Harding (rharding)
Ian Booth (wallyworld)
Changed in juju:
importance: High → Critical
Revision history for this message
Brad Marshall (brad-marshall) wrote :

This has reoccured after a redeploy and appears to be blocking deployments with juju 2, I'd appreciate if we could investigate as a matter of urgency.

tags: added: eda
tags: added: gap
Changed in juju:
status: Triaged → In Progress
assignee: Richard Harding (rharding) → Dimiter Naydenov (dimitern)
Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Even though I couldn't reproduce this normally by bootstrapping, the culprit is pretty obviously in worker/peergrouper/shim.go:

func (st *stateShim) Space(name string) (SpaceReader, error) {
 return st.Space(name)
}

Because stateShim embeds *state.State, the st.Space() above should've been st.State.Space(), otherwise we get the infinite recursion.

I've managed to reproduce this by bootstrapping on MAAS 2.1 and using the mongo CLI, modified the controllers collection to set MongoSpaceName="default, MongoSpaceState="valid", which triggers the call to Space() and the recursion. I'll describe the QA steps to get to that state in the PR I'm about to propose.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :
Changed in juju:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.