MAAS provider bridge script on trusty does not handle LACP bonds

Bug #1594855 reported by Dimiter Naydenov
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Andrew McDermott
juju-core
Fix Released
High
Andrew McDermott
1.25
Fix Released
High
Andrew McDermott

Bug Description

provider/maas: add-juju-bridge.py script needs better handling around bridging a LACP bond on trusty nodes.

Unlike on xenial nodes, where the script works ok, on trusty it was discovered a short delay is needed between ifdown & ifup commands. This only affects initial boot after deployment.

Related bugs:
https://bugs.launchpad.net/juju-core/+bug/1576674
https://bugs.launchpad.net/maas/+bug/1590689
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=742410
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=791906

The root cause of the problem seems to be a race condition present in the ifenslave package, when slave NICs are brought up and are waiting for the bond master to be initialized.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Related ifenslave bug describing the workaround: https://bugs.launchpad.net/ubuntu/+source/ifenslave/+bug/1269921

Changed in juju-core:
milestone: none → 2.0-beta10
Changed in juju-core:
assignee: nobody → Andrew McDermott (frobware)
Felipe Reyes (freyes)
tags: added: sts
Revision history for this message
Andrew McDermott (frobware) wrote :
Changed in juju-core:
assignee: Andrew McDermott (frobware) → nobody
tags: added: blocker
Changed in juju-core:
assignee: nobody → Andrew McDermott (frobware)
Changed in juju-core:
status: Triaged → In Progress
Changed in juju-core:
status: In Progress → Fix Committed
tags: removed: blocker
Curtis Hovey (sinzui)
Changed in juju-core:
status: Fix Committed → Fix Released
Revision history for this message
Hrvoje (hrvoje-habjanic) wrote :

Hi.

When can we expect update packages for trusty?

On ppa:juju/stable there is still old 1.25.5, and it is referencing downloadable content via 1.25.5 version ...

H.

Revision history for this message
Dimiter Naydenov (dimitern) wrote :

Hey, 1.25.5 is the latest stable release, this bug fix will be part of the upcoming 1.25.6 stable release (hopefully some time next week).

Revision history for this message
mahmoh (mahmoh) wrote :

Hi, hit this problem with a customer test env, with Juju 1.25.5 and MAAS 1.9.3 bzr4577 and nodes deploying trusty + wily-hwe & Liberty, where eth0 was bonded then juju-br0 on top of that shared an IP with eth0 and the agents 80% of the time failed to come up and register. Tested 1.25.6 and the nodes came up immediately, will make a couple of more passes to re-verify but the first run looks great. Thank you.

Revision history for this message
mahmoh (mahmoh) wrote :

I think I spoke to soon, now seeing agent lost after deployment even though the metal nodes come up reliably where they didn't before:

...
2: eth0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000

    inet 10.6.2.153/24 brd 10.6.2.255 scope global eth0
...
3: eth1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 9000 qdisc mq master bond0 state UP group default qlen 1000
...
12: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 9000 qdisc noqueue master juju-br0 state UP group default
...

13: juju-br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default

    inet 10.6.2.153/24 brd 10.6.2.255 scope global juju-br0
...

juju-br0 is on the bond which is on eth0 + eth1 (w/ no IP)

EVENTUAL END RESULT:

neutron-gateway/0 blocked idle 1.25.6.1 1 neutron-gateway.maas Missing relations: messaging

nodes-bond-mme/0 active idle 1.25.6.1 2 compute-03.maas

nodes-bond-mme/1 unknown lost 1.25.6.1 3 compute-04.maas agent is lost, sorry! See 'juju status-history nodes-bond-mme/1'

nodes-sriov/0 active idle 1.25.6.1 4 compute-01.maas

nodes-sriov/1 unknown lost 1.25.6.1 5 compute-02.maas agent is lost, sorry! See 'juju status-history nodes-sriov/1'

Revision history for this message
mahmoh (mahmoh) wrote :

$ juju status-history nodes-bond-mme/1
TIME TYPE STATUS MESSAGE
27 Jul 2016 13:35:47Z workload unknown Waiting for agent initialization to finish
27 Jul 2016 13:35:47Z agent allocating
27 Jul 2016 13:47:18Z workload maintenance installing charm software
27 Jul 2016 13:47:19Z agent executing running install hook
27 Jul 2016 13:47:19Z workload maintenance setting up lxc clone hook
27 Jul 2016 13:47:20Z workload maintenance setting up GRUB for SRIOV, huge-pages, CPU isolation
27 Jul 2016 13:47:28Z workload maintenance setting up network interfaces
27 Jul 2016 13:47:28Z workload maintenance setting up VNF CPU governor
27 Jul 2016 13:47:30Z workload maintenance node rebooting
27 Jul 2016 13:47:30Z workload active
27 Jul 2016 13:47:30Z workload active
27 Jul 2016 13:47:31Z agent failed run install hook
27 Jul 2016 13:47:31Z agent failed run install hook
$ juju status-history nodes-sriov/1
TIME TYPE STATUS MESSAGE
27 Jul 2016 13:35:58Z workload unknown Waiting for agent initialization to finish
27 Jul 2016 13:35:58Z agent allocating
27 Jul 2016 13:47:22Z workload maintenance installing charm software
27 Jul 2016 13:47:23Z agent executing running install hook
27 Jul 2016 13:47:23Z workload maintenance setting up lxc clone hook
27 Jul 2016 13:47:24Z workload maintenance setting up GRUB for SRIOV, huge-pages, CPU isolation
27 Jul 2016 13:47:37Z workload maintenance setting up network interfaces
27 Jul 2016 13:47:37Z workload maintenance setting up VNF CPU governor
27 Jul 2016 13:47:39Z workload maintenance node rebooting
27 Jul 2016 13:47:39Z workload active
27 Jul 2016 13:47:39Z workload active
27 Jul 2016 13:47:40Z agent failed run install hook
27 Jul 2016 13:47:40Z agent failed run install hook

affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta10 → none
milestone: none → 2.0-beta10
Changed in juju-core:
assignee: nobody → Andrew McDermott (frobware)
importance: Undecided → High
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.