Network connectivity lost on node reboot

Bug #1640812 reported by Brent Eagles
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Brent Eagles

Bug Description

A recent change (see [1]) to neutron sets the OVS fail mode to secure on all physical bridges, i.e. bridges that occur in the bridge mappings. The fail_mode is persistently stored in the OVS database so the bridges will come up in this mode on reboot. The details of what the secure fail mode means can be found in the OVS documentation, but the final result is that these bridges won't allow traffic to pass on reboot. Nodes with a single NIC that are bridged to an OVS bridge that is also used directly by neutron (basically the default for any controller and also for compute nodes when using DVR) that are also configured via DHCP will not be able to get their IP addresses on reboot and will effectively be *unplugged* from the network on reboot. The lack of IP addresses causes mayhem.

This seems to affect RHEL 7.3 consistently and current CentOS 7.2 only sporadically. I don't know about other releases of RHEL or CentOS. AFAIK at the moment this would affect any deployment matching the above description and is running a version of neutron with the patch [1].

[1] https://review.openstack.org/#/c/355315/

So far the workaround seems to be to modify the os-net-config generated interfaces for OVS bridges that are in the path of essential overcloud traffic (e.g. br-ex on single nic) to set the fail_mode to standalone on boot. This will give it a chance to acquire it's address via dhcp before the neutron agent starts and sets the bridge back to the secure fail_mode.

Revision history for this message
Brent Eagles (beagles) wrote :
Changed in tripleo:
status: New → Confirmed
importance: Undecided → Critical
assignee: nobody → Brent Eagles (beagles)
milestone: none → ocata-1
Revision history for this message
Brent Eagles (beagles) wrote :

As an aside, this issue raises the the question if it is valid for neutron to modify an OVS bridge that it *did not create* in a fundamental way like this. If so, it implies a contract between the deployer and neutron that the deployer can make "no assumptions" about what will happen with the bridge once neutron has been configured to access it. If this implied contract is valid, required and acceptable, then bridges used for neutron should not be used for anything else. The implications with respect to tripleo is that we should reconsider how we use OVS bridges for network configuration in the overcloud. For example, in single NIC situations, instead of having

(triple configured)
- eth0
  - br-ex -used for control plane access, internal api, management, external, etc. also neutron is configured to use this for the external traffic e.g. dataplane in our defaults, which is why the fail_mode gets altered

(neutron configured)

- br-int
- br-tun

To something like:
(triple configured)
- eth0
 - br-ctl - used as br-ex is currently used except neutron knows nothing about it.
- br-ex -patched to br-ctl - ostensibly for external traffic and this is what neutron in the overcloud is configured to use
(neutron configured)
- br-int
- br-tun

At the cost of an extra bridge (ovs bridge to ovs bridge with patch ports is allegedly cheap btw) we get:
 - an independently configured bridge for overcloud traffic insulates non-tenant node traffic against changes to neutron, including upgrades, neutron bugs, etc.
 - insulates neutron from changes to the underlying network that it doesn't "care" about.
 - the difference between a single nic environment and one where there is a dedicated nic for external traffic is, instead of a patch port from br-ctl to br-ex, it is directly connected to the nic for the external traffic.

Changed in tripleo:
status: Confirmed → In Progress
Revision history for this message
Brent Eagles (beagles) wrote :

This is actually appears to be happening on CentOS with enough frequency at the moment that I'd have to say it is affected also.

Revision history for this message
Marios Andreou (marios-b) wrote :

Adding a note for context, this is also discussed at https://bugzilla.redhat.com/show_bug.cgi?id=1386299

Revision history for this message
Ben Nemec (bnemec) wrote :

Note that in addition to the current proposed patches, this requires updates to both the upgrade docs and tripleo-ci to include this parameter in non-tht nic-configs.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Brent Eagles (<email address hidden>) on branch: master
Review: https://review.openstack.org/396285

Revision history for this message
Marios Andreou (marios-b) wrote :

adding a note for reviewers as there was movement on this last night. The approach proposed by Brent with https://review.openstack.org/#/c/396285/ (tripleo-heat-templates) dependant on https://review.openstack.org/#/c/395795/ (os-net-config) is being abandoned in favor of the new fix also from Brent at https://review.openstack.org/#/c/397405/ (os-net-config). This needs to go to master then newton asap.

Updating external trackers - needinfo beagles please sanity check.

Revision history for this message
Steven Hardy (shardy) wrote :

https://review.openstack.org/#/c/395854/ still WIP, deferring to ocata-2

Changed in tripleo:
milestone: ocata-1 → ocata-2
tags: added: mitaka-backport-potential newton-backport-potential
Brent Eagles (beagles)
Changed in tripleo:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-net-config 6.0.0.0b1

This issue was fixed in the openstack/os-net-config 6.0.0.0b1 development milestone.

Changed in tripleo:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/os-net-config 5.1.0

This issue was fixed in the openstack/os-net-config 5.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.