On 2015-05-28, our Landscape auto-upgraded packages on two of our
OpenStack clouds. On both clouds, but only on some compute nodes, the
upgrade of openvswitch-switch and corresponding downtime of
ovs-vswitchd appears to have triggered some sort of race condition
within neutron-plugin-openvswitch-agent leaving it in a broken state;
any new instances come up with non-functional network but pre-existing
instances appear unaffected. Restarting n-p-ovs-agent on the affected
compute nodes is sufficient to work around the problem.
The packages Landscape upgraded (from /var/log/apt/history.log):
2015-05-28 14:24:18.336 47866 ERROR neutron.agent.linux.ovsdb_monitor [-] Error received from ovsdb monitor: ovsdb-client: unix:/var/run/openvswitch/db.sock: receive failed (End of file)
Looking at a stuck instances, all the right tunnels and bridges and
what not appear to be there:
root@vector:~# ip l l | grep c-3b
460002: qbr7ed8b59c-3b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
460003: qvo7ed8b59c-3b: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP mode DEFAULT group default qlen 1000
460004: qvb7ed8b59c-3b: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbr7ed8b59c-3b state UP mode DEFAULT group default qlen 1000
460005: tap7ed8b59c-3b: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master qbr7ed8b59c-3b state UNKNOWN mode DEFAULT group default qlen 500
root@vector:~# ovs-vsctl list-ports br-int | grep c-3b
qvo7ed8b59c-3b
root@vector:~#
But I can't ping the unit from within the qrouter-${id} namespace on
the neutron gateway. If I tcpdump the {q,t}*c-3b interfaces, I don't
see any traffic.
On 2015-05-28, our Landscape auto-upgraded packages on two of our plugin- openvswitch- agent leaving it in a broken state;
OpenStack clouds. On both clouds, but only on some compute nodes, the
upgrade of openvswitch-switch and corresponding downtime of
ovs-vswitchd appears to have triggered some sort of race condition
within neutron-
any new instances come up with non-functional network but pre-existing
instances appear unaffected. Restarting n-p-ovs-agent on the affected
compute nodes is sufficient to work around the problem.
The packages Landscape upgraded (from /var/log/ apt/history. log):
Start-Date: 2015-05-28 14:23:07 libvirt: amd64 (2014.1.4-0ubuntu2, 2014.1. 4-0ubuntu2. 1), libsystemd- login0: amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), nova-compute- kvm:amd64 (2014.1.4-0ubuntu2, 2014.1. 4-0ubuntu2. 1), systemd- services: amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), isc-dhcp- common: amd64 (4.2.4-7ubuntu12.1, 4.2.4-7ubuntu12.2), nova-common:amd64 (2014.1.4-0ubuntu2, 2014.1. 4-0ubuntu2. 1), python-nova:amd64 (2014.1.4-0ubuntu2, 2014.1. 4-0ubuntu2. 1), libsystemd- daemon0: amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), grub-common:amd64 (2.02~beta2- 9ubuntu1. 1, 2.02~beta2- 9ubuntu1. 2), libpam- systemd: amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), udev:amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), grub2-common:amd64 (2.02~beta2- 9ubuntu1. 1, 2.02~beta2- 9ubuntu1. 2), openvswitch- switch: amd64 (2.0.2- 0ubuntu0. 14.04.1, 2.0.2-0ubuntu0. 14.04.2) , libudev1:amd64 (204-5ubuntu20.11, 204-5ubuntu20.12), isc-dhcp- client: amd64 (4.2.4-7ubuntu12.1, 4.2.4-7ubuntu12.2), python- eventlet: amd64 (0.13.0-1ubuntu2, 0.13.0-1ubuntu2.1), python- novaclient: amd64 (2.17.0-0ubuntu1.1, 2.17.0-0ubuntu1.2), grub-pc-bin:amd64 (2.02~beta2- 9ubuntu1. 1, 2.02~beta2- 9ubuntu1. 2), grub-pc:amd64 (2.02~beta2- 9ubuntu1. 1, 2.02~beta2- 9ubuntu1. 2), nova-compute:amd64 (2014.1.4-0ubuntu2, 2014.1. 4-0ubuntu2. 1), openvswitch- common: amd64 (2.0.2- 0ubuntu0. 14.04.1, 2.0.2-0ubuntu0. 14.04.2)
Upgrade: nova-compute-
End-Date: 2015-05-28 14:24:47
From /var/log/ neutron/ openvswitch- agent.log:
2015-05-28 14:24:18.336 47866 ERROR neutron. agent.linux. ovsdb_monitor [-] Error received from ovsdb monitor: ovsdb-client: unix:/var/ run/openvswitch /db.sock: receive failed (End of file)
Looking at a stuck instances, all the right tunnels and bridges and
what not appear to be there:
root@vector:~# ip l l | grep c-3b MULTICAST, UP,LOWER_ UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default MULTICAST, PROMISC, UP,LOWER_ UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP mode DEFAULT group default qlen 1000 MULTICAST, PROMISC, UP,LOWER_ UP> mtu 1500 qdisc pfifo_fast master qbr7ed8b59c-3b state UP mode DEFAULT group default qlen 1000 MULTICAST, UP,LOWER_ UP> mtu 1500 qdisc pfifo_fast master qbr7ed8b59c-3b state UNKNOWN mode DEFAULT group default qlen 500
460002: qbr7ed8b59c-3b: <BROADCAST,
460003: qvo7ed8b59c-3b: <BROADCAST,
460004: qvb7ed8b59c-3b: <BROADCAST,
460005: tap7ed8b59c-3b: <BROADCAST,
root@vector:~# ovs-vsctl list-ports br-int | grep c-3b
qvo7ed8b59c-3b
root@vector:~#
But I can't ping the unit from within the qrouter-${id} namespace on
the neutron gateway. If I tcpdump the {q,t}*c-3b interfaces, I don't
see any traffic.