neutron-openvswitch-agent break network connection on second reboot

Bug #1798588 reported by puthi
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Expired
Undecided
Unassigned

Bug Description

Issue:
======
  Installation steps for Openstack Queens Compute Nodes:
  1- Install and configure openvswitch, neutron-openvswitch-agent, openstack-nova-compute,libvirt
  2- create virtual bridges
    /bin/ovs-vsctl del-br br-int
    /bin/ovs-vsctl del-br br-bond0
    /bin/ovs-vsctl add-br br-int
    /bin/ovs-vsctl add-br br-bond0
    /bin/ovs-vsctl --may-exist add-bond br-bond0 bond0 eno49 eno50 bond_mode=active-backup
  3- add mgmt0 port to br-bond0 bridge which will be used by hypervisor to connect to openstack controllers by reconfigure /etc/sysconfig/network-scripts/ifcfg-*
  5- delete libvirt default network.
  Note: I share the same physical network interfaces between Data Network(VM network) and Management Network(nova and neutron network to controllers). And all the physical interfaces are bonding(active-backup).
  6- reboot for the first time
  After the first reboot the network connection is functioning as expected. Then try the second reboot, without changing any config, the network never come back again.
  This can be fix by rerun the step 2 and do another reboot but if i reboot the machine another time, the network will break again. Also, if i stop and disable neutron-openvswitch-agent, and rerun step 2, the problem will disappear however many time i reboot the network connection will still function.

  I'm not sure if i miss something here but i have this same setup working from Juno through Ocata, i never test on Pike though.

Setup
=====
OS: CentOS Linux release 7.5.1804 (Core)
Neutron Openvswitch agent tested version: openstack-neutron-openvswitch-12.0.2-1.el7.noarch, openstack-neutron-openvswitch-12.0.4-1.el7.noarch
Openvswitch tested version: openvswitch-2.9.0-3, openvswitch-2.9.0-4

Network Setup:
[eno5,eno6] <= Br-bond0[bond0,mgmt0,int-br-bond0] <= Br-int[phy-br-bond0] <= veths (VMs IF)

# ovs-vsctl show
0f950035-7f7a-4e5f-a337-04ab76945679
    Manager "ptcp:6640:127.0.0.1"
        is_connected: true
    Bridge br-int
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "int-br-bond0"
            Interface "int-br-bond0"
                type: patch
                options: {peer="phy-br-bond0"}
        Port br-int
            Interface br-int
                type: internal
    Bridge "br-bond0"
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "mgmt0"
            Interface "mgmt0"
                type: internal
        Port "br-bond0"
            Interface "br-bond0"
                type: internal
        Port "bond0"
            Interface "eno6"
            Interface "eno5"
        Port "phy-br-bond0"
            Interface "phy-br-bond0"
                type: patch
                options: {peer="int-br-bond0"}
    ovs_version: "2.9.0"

# cat /etc/sysconfig/network-scripts/ifcfg-mgmt0
DEVICE=mgmt0
ONBOOT=yes
BOOTPROTO=static
TYPE=OVSIntPort
DEVICETYPE=ovs
OVS_BRIDGE=br-bond0
HOTPLUG=no
IPADDR=10.1.1.11
PREFIX=24
DEFROUTE=yes

# cat /etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0
ONBOOT=yes
DEVICETYPE=ovs
TYPE=OVSBond
OVS_BRIDGE="br-bond0"
BOOTPROTO=none
BOND_IFACES="eno5 eno6"
OVS_OPTIONS="bond_mode=active-backup"
HOTPLUG=no
NM_CONTROLLED=no

# egrep -v "^$|^#" /etc/neutron/neutron.conf
[DEFAULT]
auth_strategy = keystone
core_plugin = ml2
service_plugins = router
debug = true
rpc_backend = rabbit
[agent]
[cors]
[database]
[keystone_authtoken]
auth_uri = http://xxxxx:5000
auth_type = password
auth_url = http://xxxxx:35357
project_domain_name = default
user_domain_name = default
project_name = service
username = neutron
password = xxxxx
[matchmaker_redis]
[nova]
[oslo_concurrency]
lock_path = /var/lib/neutron/tmp
[oslo_messaging_amqp]
[oslo_messaging_kafka]
[oslo_messaging_notifications]
[oslo_messaging_rabbit]
amqp_durable_queues = True
rabbit_hosts = xxxx:5672,xxxx:5672,xxxx:5672
rabbit_userid = xxxx
rabbit_password = xxxxx
rabbit_retry_interval = 1
rabbit_retry_backoff = 2
rabbit_max_retries = 0
rabbit_ha_queues = True
[oslo_messaging_zmq]
[oslo_middleware]
[oslo_policy]
[quotas]
[ssl]

# egrep -v "^$|^#" /etc/neutron/plugin.ini
[DEFAULT]
debug = true
[l2pop]
[ml2]
type_drivers = flat,vlan,vxlan
mechanism_drivers = openvswitch,linuxbridge,l2population
[ml2_type_flat]
flat_networks = physnet30
[ml2_type_geneve]
[ml2_type_gre]
tunnel_id_ranges = 1:1000
[ml2_type_vlan]
network_vlan_ranges = physnet30
[ml2_type_vxlan]
[securitygroup]
firewall_driver = neutron.agent.firewall.NoopFirewallDriver
enable_security_group = True
[ovs]
enable_tunneling = False
local_ip = 10.1.1.11
network_vlan_ranges = physnet30
bridge_mappings = physnet30:br-bond0

Revision history for this message
Brian Haley (brian-haley) wrote :

Is this a new deployment on Centos 7.5.1804 ?

I'm just asking since re-running the ovs-vsctl commands seem to fix things.

Changed in neutron:
status: New → Incomplete
Revision history for this message
puthi (puthi) wrote :

No, this nodes is the node I used to debug but there are another 10 nodes that I deployed with VMs running, behaving the same.

As I debug I narrow down to [ovs] plugin config bridge_mapping=physnet30:br-bond0 that does something different to bridge br-bond0, it make the br-bond0 hang.

puthi (puthi)
description: updated
Revision history for this message
puthi (puthi) wrote :

I modified step 2 first comment to add
    /bin/ovs-vsctl del-br br-int
    /bin/ovs-vsctl del-br br-bond0
as i missed last time i reported.

Revision history for this message
puthi (puthi) wrote :

openvswitch-agent.log
ip and the hostname are replaced, just to keep it a bit confidential.
Here is the step i reproduce the problem
- at 14:07
  i run step 2
  #!/bin/bash

  /bin/ovs-vsctl del-br br-int
  /bin/ovs-vsctl del-br br-bond0
  /bin/ovs-vsctl add-br br-int
  /bin/ovs-vsctl add-br br-bond0
  /bin/ovs-vsctl --may-exist add-bond br-bond0 bond0 eno5 eno6 bond_mode=active-backup

  and reboot the machine.
- at 14:09:18 the machine is booted and pingible
- at 14:11:01 i reboot the machine again (without running step 2)
- at 14:13:xx the machine is booted and network is not available any more.

comparing between the 2 scenario that i tested, in the openvswitch-agent.log the only different is this:
DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Transaction caused no change do_commit /usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/transaction.py:121

it seems that when the bridge br-bond0 exist neutron-openvswitch-agent decide not to do anything to the bridge and the network broke from there. But if the br-bond0 is delete and on start up neutron-openvswitch-agent decide to do something and the network start work.
The problem seems to point to Openvswitch it-self but when i test disable neutron-openvswitch-agent service completely, the network never break again. So it doesn't seem to be openvswitch problem. I'm running out of ideas for where to look next too.

On the side node, if i run the step 2 and just restart the network server (systemctl restart network), it fix the problem as well.

Revision history for this message
puthi (puthi) wrote :

This is openvswitch log co-related to the previous comment.
Please note that this log has the timestamp as UTC, so please +7 hours to much the time in openvswitch-agent.log.

Revision history for this message
puthi (puthi) wrote :

Just in case anybody having the same setup as me and running into this same problem, here the work around i did

- Create extra script for openvswitch during start up to delete and recreate bridge br-bond0

vim /etc/systemd/system/openvswitch.service.d/fix-ovs-bridge.conf
[Service]
ExecStartPre=/etc/init.d/fix_ovs.sh

- and in /etc/init.d/fix_ovs.sh
#!/bin/bash

/bin/ovs-vsctl del-br br-int
/bin/ovs-vsctl del-br br-bond0
/bin/ovs-vsctl add-br br-int
/bin/ovs-vsctl add-br br-bond0
/bin/ovs-vsctl --may-exist add-bond br-bond0 bond0 eno5 eno6 bond_mode=active-backup

- then run
chmod +x /etc/init.d/fix_ovs.sh
systemctl daemon-reload

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.