restart of openvswitch-switch causes instance network down when l2population enabled

Bug #1460164 reported by James Troup on 2015-05-29
36
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Medium
James Page
Icehouse
Undecided
Unassigned
Juno
Medium
Unassigned
Kilo
Medium
James Page
neutron
Undecided
James Page
Kilo
Undecided
Unassigned
neutron (Ubuntu)
High
Unassigned
Trusty
High
James Page
Wily
High
James Page
Xenial
High
Unassigned

Bug Description

[Impact]
Restarts of openvswitch (typically on upgrade) result in loss of tunnel connectivity when the l2population driver is in use. This results in loss of access to all instances on the effected compute hosts

[Test Case]
Deploy cloud with ml2/ovs/l2population enabled
boot instances
restart ovs; instance connectivity will be lost until the neutron-openvswitch-agent is restarted on the compute hosts.

[Regression Potential]
Minimal - in multiple stable branches upstream.

[Original Bug Report]
On 2015-05-28, our Landscape auto-upgraded packages on two of our
OpenStack clouds. On both clouds, but only on some compute nodes, the
upgrade of openvswitch-switch and corresponding downtime of
ovs-vswitchd appears to have triggered some sort of race condition
within neutron-plugin-openvswitch-agent leaving it in a broken state;
any new instances come up with non-functional network but pre-existing
instances appear unaffected. Restarting n-p-ovs-agent on the affected
compute nodes is sufficient to work around the problem.

The packages Landscape upgraded (from /var/log/apt/history.log):

Start-Date: 2015-05-28 14:23:07
Upgrade: nova-compute-libvirt:amd64 (2014.1.4-0ubuntu2, 2014.1.4-0ubuntu2.1), libsystemd-login0:amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), nova-compute-kvm:amd64 (2014.1.4-0ubuntu2, 2014.1.4-0ubuntu2.1), systemd-services:amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), isc-dhcp-common:amd64 (4.2.4-7ubuntu12.1, 4.2.4-7ubuntu12.2), nova-common:amd64 (2014.1.4-0ubuntu2, 2014.1.4-0ubuntu2.1), python-nova:amd64 (2014.1.4-0ubuntu2, 2014.1.4-0ubuntu2.1), libsystemd-daemon0:amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), grub-common:amd64 (2.02~beta2-9ubuntu1.1, 2.02~beta2-9ubuntu1.2), libpam-systemd:amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), udev:amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), grub2-common:amd64 (2.02~beta2-9ubuntu1.1, 2.02~beta2-9ubuntu1.2), openvswitch-switch:amd64 (2.0.2-0ubuntu0.14.04.1, 2.0.2-0ubuntu0.14.04.2), libudev1:amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), isc-dhcp-client:amd64 (4.2.4-7ubuntu12.1, 4.2.4-7ubuntu12.2), python-eventlet:amd64 (0.13.0-1ubuntu2, 0.13.0-1ubuntu2.1), python-novaclient:amd64 (2.17.0-0ubuntu1.1, 2.17.0-0ubuntu1.2), grub-pc-bin:amd64 (2.02~beta2-9ubuntu1.1, 2.02~beta2-9ubuntu1.2), grub-pc:amd64 (2.02~beta2-9ubuntu1.1, 2.02~beta2-9ubuntu1.2), nova-compute:amd64 (2014.1.4-0ubuntu2, 2014.1.4-0ubuntu2.1), openvswitch-common:amd64 (2.0.2-0ubuntu0.14.04.1, 2.0.2-0ubuntu0.14.04.2)
End-Date: 2015-05-28 14:24:47

From /var/log/neutron/openvswitch-agent.log:

2015-05-28 14:24:18.336 47866 ERROR neutron.agent.linux.ovsdb_monitor [-] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)

Looking at a stuck instances, all the right tunnels and bridges and
what not appear to be there:

root@vector:~# ip l l | grep c-3b
460002: qbr7ed8b59c-3b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
460003: qvo7ed8b59c-3b: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP mode DEFAULT group default qlen 1000
460004: qvb7ed8b59c-3b: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbr7ed8b59c-3b state UP mode DEFAULT group default qlen 1000
460005: tap7ed8b59c-3b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbr7ed8b59c-3b state UNKNOWN mode DEFAULT group default qlen 500
root@vector:~# ovs-vsctl list-ports br-int | grep c-3b
qvo7ed8b59c-3b
root@vector:~#

But I can't ping the unit from within the qrouter-${id} namespace on
the neutron gateway. If I tcpdump the {q,t}*c-3b interfaces, I don't
see any traffic.

James Troup (elmo) wrote :

I should have said; both clouds are Ubuntu 14.04 running OpenStack Icehouse. I've put all the relevant logs I could think of/find up at:

https://chinstrap.canonical.com/~james/nx/vector-logs.tar.xz

(It's only accessible by Canonical people, sorry.)

James Page (james-page) wrote :

This is caused by the restart of the ovs daemons (part of the upgrade process - its done post install to minimize downtime):

2015-05-28 14:24:18.336 47866 ERROR neutron.agent.linux.ovsdb_monitor [-] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)

I would suspect that the code in neutron is not behaving very nicely in terms of recovering from this disconnect, resulting the in the lockup you see.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in neutron (Ubuntu):
status: New → Confirmed
James Page (james-page) on 2015-07-24
Changed in neutron (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → High
JuanJo Ciarlante (jjo) wrote :

FYI today's openvswitch-switch upgrade triggered a cluster-wide outage
on one (or more) of our production openstacks.

tags: added: canonical-bootstack
Mark Shuttleworth (sabdfl) wrote :

Is the fix here to ensure that restarts of one are automatically sequenced with restarts of the other service?

James Page (james-page) wrote :

The neutron code does make some attempts to monitor the state of openvswitch - a restart of the ovs database process should be detected by the agent, and appropriate action taken.

Having the agent detect and respond to the status of openvswitch and the flows its managing should be the right approach to dealing with this situation; I believe that this part of the codebase has improved since Icehouse, so we'll take a look, do some testing and see whether there are some cherry-picks we can make for Icehouse to improve resilience.

James Page (james-page) wrote :

On a fresh icehouse install I see the following on a restart of ovs:

2015-12-18 11:05:26.855 6876 DEBUG neutron.agent.linux.async_process [-] Halting async process [['ovsdb-client', 'monitor', 'Interface', 'name,ofport', '--format=json']]. stop /usr/lib/python2.7/dist-packages/neutron/agent/linux/async_process.py:90
2015-12-18 11:05:26.857 6876 CRITICAL neutron [-] Trying to re-send() an already-triggered event.

The neutron-plugin-openvswitch-agent then terminates and gets restarted by upstart, triggering a full sync of ovs state:

2015-12-18 11:05:27.229 11075 INFO neutron.common.config [-] Logging enabled!
2015-12-18 11:05:27.230 11075 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] ******************************************************************************** log_opt_values /usr/lib/python2.7/dist-packages/oslo/config/cfg.py:1928
2015-12-18 11:05:27.230 11075 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Configuration options gathered from: log_opt_values /usr/lib/python2.7/dist-packages/oslo/config/cfg.py:1929
2015-12-18 11:05:27.230 11075 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] command line args: ['--config-file=/etc/neutron/neutron.conf', '--config-file=/etc/neutron/plugins/ml2/ml2_conf.ini', '--log-file=/var/log/neutron/openvswitch-agent.log'] log_opt_values /usr/lib/python2.7/dist-packages/oslo/config/cfg.py:1930
2015-12-18 11:05:27.230 11075 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] config files: ['/etc/neutron/neutron.conf', '/etc/neutron/plugins/ml2/ml2_conf.ini'] log_opt_values /usr/lib/python2.7/dist-packages/oslo/config/cfg.py:1931

James Page (james-page) wrote :

This commit is not in Icehouse, and looks to improve the general error handling in this instance:

https://github.com/openstack/neutron/commit/91b7fc7f162751936f7cb15d4add932a4aebd55b

James Page (james-page) wrote :

Critically, the observation made in comment #8 does not always happen - i.e. I don't reliable seen the openvswitch-agent process exiting abnormally and then restarting all flows. I suspect this is a racey in some way - so if you luck out with a CRITICAL failure on an ovs restart, ovs gets re-configured on a full sync.

James Page (james-page) wrote :

I retested with Liberty, and saw pretty much the same behaviour; digging into this a bit deeper, I think this is related to the l2-population driver usage - with l2 pop disabled, a restart of ovs resulted in a short network outage for the instances, but service was restored quickly - with l2 pop enabled, a full agent restart was required to get things humming again.

James Page (james-page) wrote :

Confirmed that disabling l2-population has the same effect on restarts of ovs on Icehouse as well.

summary: - upgrade of openvswitch-switch can sometimes break neutron-plugin-
- openvswitch-agent
+ restart of openvswitch-switch causes instance network down when
+ l2population enabled
James Page (james-page) wrote :

Looking at the tunnel_sync function in Icehouse:

    def tunnel_sync(self):
        resync = False
        try:
            for tunnel_type in self.tunnel_types:
                details = self.plugin_rpc.tunnel_sync(self.context,
                                                      self.local_ip,
                                                      tunnel_type)
                if not self.l2_pop:
                    tunnels = details['tunnels']
                    for tunnel in tunnels:
                        if self.local_ip != tunnel['ip_address']:

you can quite clearly see that if l2_pop is enabled, then tunnel ports are not setup after the restart.

James Page (james-page) wrote :

On a full agent restart, tunnels are setup:

2015-12-18 14:41:51.505 17767 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Port 826251b8-cc26-435e-9488-16ae67bde4f6 updated. Details: {u'admin_state_up': True, u'network_id': u'15b42697-cf68-4b78-9e19-2d167d0b37cc', u'segmentation_id': 5, u'physical_network': None, u'device': u'826251b8-cc26-435e-9488-16ae67bde4f6', u'port_id': u'826251b8-cc26-435e-9488-16ae67bde4f6', u'network_type': u'gre'}
2015-12-18 14:41:51.506 17767 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Assigning 1 as local vlan for net-id=15b42697-cf68-4b78-9e19-2d167d0b37cc
2015-12-18 14:41:51.757 17767 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-c53ce242-0dc1-41d2-9ab9-de88980dc3ab None] setup_tunnel_port: gre-0a052634 10.5.38.52 gre
2015-12-18 14:41:51.769 17767 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Configuration for device 826251b8-cc26-435e-9488-16ae67bde4f6 completed.
2015-12-18 14:41:51.868 17767 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] setup_tunnel_port: gre-0a052630 10.5.38.48 gre
2015-12-18 14:41:51.974 17767 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] setup_tunnel_port: gre-0a052633 10.5.38.51 gre

but on a openvswitch-switch restart, this does not happen:

5-12-18 14:42:21.836 17767 ERROR neutron.agent.linux.ovsdb_monitor [-] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)
2015-12-18 14:42:23.103 17767 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Mapping physical network physnet1 to bridge br-data
2015-12-18 14:42:24.923 17767 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Agent tunnel out of sync with plugin!
2015-12-18 14:42:25.188 17767 INFO neutron.agent.securitygroups_rpc [-] Preparing filters for devices set([u'826251b8-cc26-435e-9488-16ae67bde4f6'])
2015-12-18 14:42:25.664 17767 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Port 826251b8-cc26-435e-9488-16ae67bde4f6 updated. Details: {u'admin_state_up': True, u'network_id': u'15b42697-cf68-4b78-9e19-2d167d0b37cc', u'segmentation_id': 5, u'physical_network': None, u'device': u'826251b8-cc26-435e-9488-16ae67bde4f6', u'port_id': u'826251b8-cc26-435e-9488-16ae67bde4f6', u'network_type': u'gre'}
2015-12-18 14:42:25.665 17767 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Assigning 1 as local vlan for net-id=15b42697-cf68-4b78-9e19-2d167d0b37cc
2015-12-18 14:42:26.054 17767 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [-] Configuration for device 826251b8-cc26-435e-9488-16ae67bde4f6 completed.

James Page (james-page) wrote :

The fdb_add method checks to see if the agent already has a registered ofport for each tunnel it requires - this is now stale, so results in the tunnel setup being skipped - the tunnel ports are still present in the ovsdb - however all of the flows are missing resulting in full loss of instance connectivity.

James Page (james-page) wrote :

Resetting the ofport trackers appears to resolve the problem for a openvswitch-switch restart.

Fix proposed to branch: master
Review: https://review.openstack.org/259485

Changed in neutron:
assignee: nobody → James Page (james-page)
status: New → In Progress
tags: added: sts
Miguel Angel Ajo (mangelajo) wrote :

Are liberty (most likely) or kilo affected by this bug?

In such case add kilo-backport-potential &/or liberty-backport-potential, please.

tags: added: kilo-backport-potential liberty-backport-potential
Miguel Angel Ajo (mangelajo) wrote :

based on the IRC conversation with James Page, I added the backport flags.

tags: added: l2-pop

Reviewed: https://review.openstack.org/259485
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=17c14977ce0e2291e911739f8c85838f1c1f3473
Submitter: Jenkins
Branch: master

commit 17c14977ce0e2291e911739f8c85838f1c1f3473
Author: James Page <email address hidden>
Date: Fri Dec 18 15:02:11 2015 +0000

    Ensure that tunnels are fully reset on ovs restart

    When the l2population mechanism driver is enabled, if ovs is restarted
    tunnel ports are not re-configured in full due to stale ofport handles
    in the OVS agent.

    Reset all handles when OVS is restarted to ensure that tunnels are
    fully recreated in this situation.

    Change-Id: If0e034a034a7f000a1c58aa8a43d2c857dee6582
    Closes-bug: #1460164

Changed in neutron:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/272566
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=05f8099a68918342b6e0a65e78619becfa19a4ae
Submitter: Jenkins
Branch: stable/liberty

commit 05f8099a68918342b6e0a65e78619becfa19a4ae
Author: James Page <email address hidden>
Date: Fri Dec 18 15:02:11 2015 +0000

    Ensure that tunnels are fully reset on ovs restart

    When the l2population mechanism driver is enabled, if ovs is restarted
    tunnel ports are not re-configured in full due to stale ofport handles
    in the OVS agent.

    Reset all handles when OVS is restarted to ensure that tunnels are
    fully recreated in this situation.

    Change-Id: If0e034a034a7f000a1c58aa8a43d2c857dee6582
    Closes-bug: #1460164
    (cherry picked from commit 17c14977ce0e2291e911739f8c85838f1c1f3473)

tags: added: in-stable-liberty
James Page (james-page) on 2016-02-11
Changed in neutron (Ubuntu Xenial):
status: Triaged → Fix Released
Changed in neutron (Ubuntu Wily):
importance: Undecided → High
status: New → In Progress
assignee: nobody → James Page (james-page)
James Page (james-page) on 2016-02-11
description: updated
Changed in neutron (Ubuntu Trusty):
status: New → In Progress
assignee: nobody → James Page (james-page)
importance: Undecided → High

Hello James, or anyone else affected,

Accepted neutron into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/1:2014.1.5-0ubuntu3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Chris J Arges (arges) wrote :

Hello James, or anyone else affected,

Accepted neutron into wily-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:7.0.3-0ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Wily):
status: In Progress → Fix Committed
Corey Bryant (corey.bryant) wrote :

This has passed testing so I'm marking it as verification-done.

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 1:2014.1.5-0ubuntu3

---------------
neutron (1:2014.1.5-0ubuntu3) trusty; urgency=medium

  [ Corey Bryant ]
  * d/p/make_del_fdb_flow_idempotent.patch: Cherry pick from Juno
    to prevent KeyError on duplicate port removal in del_fdb_flow()
    (LP: #1531963).
  * d/tests/*-plugin: Fix race between service restart and pidof test.

  [ James Page ]
  * d/p/ovs-restart.patch: Ensure that tunnels are fully reset on ovs
    restart (LP: #1460164).

 -- Corey Bryant <email address hidden> Wed, 10 Feb 2016 14:52:04 -0500

Changed in neutron (Ubuntu Trusty):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:7.0.3-0ubuntu1

---------------
neutron (2:7.0.3-0ubuntu1) wily; urgency=medium

  * New upstream point release (LP: #1544568):
    - Ensure that tunnels are fully reset on ovs restart (LP: #1460164).
    - d/p/iproute2-compat.patch: Drop, included upstream.

 -- James Page <email address hidden> Thu, 11 Feb 2016 17:07:16 +0000

Changed in neutron (Ubuntu Wily):
status: Fix Committed → Fix Released
James Page (james-page) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

James Page (james-page) wrote :

This bug was fixed in the package neutron - 1:2015.1.3-0ubuntu1
---------------

 neutron (1:2015.1.3-0ubuntu1) trusty-kilo; urgency=medium
 .
   [ Corey Bryant ]
   * d/tests/neutron-agents: Give daemon 5 seconds to start to prevent race.
 .
   [ James Page ]
   * New upstream stable release (LP: #1559215):
     - d/p/dhcp-protect-against-case-when-device-name-is-none.patch: Dropped,
       included upstream.
   * Ensure tunnels are fully reconstructed after openvswitch restarts
     (LP: #1460164):
     - d/p/ovs-restart.patch: Cherry picked from upstream review.

James Page (james-page) on 2016-03-30
Changed in cloud-archive:
status: In Progress → Invalid

Reviewed: https://review.openstack.org/272643
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b79ed67ef703be3a034ed0cf95d401b0711dae46
Submitter: Jenkins
Branch: stable/kilo

commit b79ed67ef703be3a034ed0cf95d401b0711dae46
Author: James Page <email address hidden>
Date: Fri Dec 18 15:02:11 2015 +0000

    Ensure that tunnels are fully reset on ovs restart

    When the l2population mechanism driver is enabled, if ovs is restarted
    tunnel ports are not re-configured in full due to stale ofport handles
    in the OVS agent.

    Reset all handles when OVS is restarted to ensure that tunnels are
    fully recreated in this situation.

    Change-Id: If0e034a034a7f000a1c58aa8a43d2c857dee6582
    Closes-bug: #1460164
    (cherry picked from commit 17c14977ce0e2291e911739f8c85838f1c1f3473)

tags: added: in-stable-kilo

This issue was fixed in the openstack/neutron 2015.1.4 release.

tags: removed: kilo-backport-potential liberty-backport-potential

This issue was fixed in the openstack/neutron 2015.1.4 release.

Corey Bryant (corey.bryant) wrote :

Marking Juno as "Won't fix" for the Ubuntu Cloud Archive since it is EOL.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers