'br-prv' ovs bridge is missing after reboot of compute node

Bug #1555162 reported by Ksenia Svechnikova
36
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Won't Fix
Critical
Atsuko Ito
Mitaka
Fix Released
Critical
Atsuko Ito

Bug Description

ISO 9.0-28

Deploy cluster with active-backup bonding and Neutron VLAN

Steps to reproduce:
            1. Create cluster
            2. Add 3 nodes with controller role
            3. Add 1 node with compute role and 1 node with cinder role
            4. Setup bonding for all interfaces (including admin interface
               bonding)
            5. Run network verification
            6. Deploy the cluster
            7. Run network verification
            8. Run OSTF
            9. Save network configuration from slave nodes
            10. Reboot all environment nodes
            11. Verify that network configuration is the same after reboot

Expected result: Verification pass

Actual result:
On step 11 we can see in the neutron-openvswitch-agent.log:

Mar 9 00:56:30 err: 2016-03-09 00:56:30.068 6390 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-4107d9e2-c48a-42d4-98b6-41e3d2adce2e - - - - -] Bridge br-prv for physical network physnet2 does not exist. Agent terminated!

Link to the swarm: https://product-ci.infra.mirantis.net/view/9.0_swarm/job/9.0.system_test.ubuntu.bonding_ha/39/

This reproduced also on VLAN non-bonding env after reboot one compute node

Revision history for this message
Ksenia Svechnikova (kdemina) wrote :
description: updated
summary: - Network settings change after reboot compute
+ 'br-prv' ovs bridge is missing after reboot of compute node
description: updated
tags: added: team-network
Changed in fuel:
status: New → Confirmed
tags: added: l23network
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Dmitry Bilunov (dbilunov)
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
Atsuko Ito (yottatsa) wrote :

This is not l23network issue. Looks like it's upstart issue. I worked around this bug by removing /etc/init/cloud-* and /etc/init/failsafe.conf.

Revision history for this message
Atsuko Ito (yottatsa) wrote :

I had the same issues year ago in Yandex, when we rebuilded OpenVSwitch and start using upstart instear rc.d.
I recommend to rebuild our custom package back to rc.d.

To check this, execute next commands and reboot the machine:
rm /etc/init/openvswitch-switch.conf
update-rc.d openvswitch-switch defaults

Revision history for this message
Atsuko Ito (yottatsa) wrote :
Revision history for this message
Atsuko Ito (yottatsa) wrote :

1553733 was merged, please revalidate

tags: added: area-linux team-linux
removed: area-library l23network team-network
Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

The issue is still present. I've reproduced it on my venv with HugePages. Attach the snapshot:

Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
Ksenia Svechnikova (kdemina) wrote :

Confirm at ISO 92

Revision history for this message
Atsuko Ito (yottatsa) wrote :

Could not reproduce swarm because of https://bugs.launchpad.net/fuel/+bug/1559025

tags: added: feature-huge-pages
Changed in fuel:
assignee: Dmitry Bilunov (dbilunov) → Vladimir Eremin (yottatsa)
Revision history for this message
Atsuko Ito (yottatsa) wrote :
Atsuko Ito (yottatsa)
tags: added: feature-dpdk
removed: feature-huge-pages
Dmitry Klenov (dklenov)
tags: removed: feature-dpdk
Revision history for this message
Atsuko Ito (yottatsa) wrote :

Waiting for tonight swarm

Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

expected result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Bug https://bugs.launchpad.net/fuel/+bug/1564934 looks the same: after controller restart br-floating was not created so ovs agent failed to start:
 2016-04-01 00:15:53.545 21019 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-24b0cb67-4150-486c-9b66-a89cd9d58f93 - - - - -] Bridge br-floating for physical network physnet1 does not exist. Agent terminated!

description: updated
tags: removed: need-info
Atsuko Ito (yottatsa)
Changed in fuel:
assignee: Vladimir Eremin (yottatsa) → Fuel Library Team (fuel-library)
tags: added: team-network tricky
removed: team-linux
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Aleksandr Didenko (adidenko)
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

https://product-ci.infra.mirantis.net/view/9.0_swarm/job/9.0.system_test.ubuntu.bonding_ha/ is green for 3 days already.
I've run a test on fuel-9.0-191-2016-04-12_02-00-00.iso

1) Run 'bonding_ha' systest group (it's successful)
2) Revert-resume deploy_bonding_neutron_vlan
3) Reboot 2 controllers and 1 compute 200+ times (every 5 minutes during 16+ hours)

Then I've checked neutron logs on all nodes for "Bridge .* does not exist. Agent terminated" pattern. Zero matches.

I'll try to reproduce it with ha_neutron_destructive_vxlan test on the same ISO #191. But meanwhile I'm removing swarm blocker tag and marking this bug as incomplete, since it's not reproducable anymore.
If you're able to reproduce it on ISO newer than 9.0-191, then please post an update here with diagnostic snapshot. Thanks.

tags: removed: swarm-blocker
Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Aleksei Stepanov (penguinolog) wrote :
Revision history for this message
Aleksei Stepanov (penguinolog) wrote :
Changed in fuel:
status: Incomplete → Confirmed
tags: added: swarm-blocker
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

@Alexey, thanks for the snapshot. The test from snapshot uses custom network template where ALL bridges are OVS ones (provider=ovs). So it was much easier to reproduce and troubleshoot the issue.

OK, so it looks like we have 2 problems here:
1) Improper configuration of OVS bridges and ports in /etc/network/interfaces.d/ifcfg-* files (L23_stored_config in fuel-library)
2) Upstart/pre-if-up configuration for openvswitch-switch service.

Some details on each of those problems:
1) Let's take a look at stored config for br-ex - http://paste.openstack.org/show/494246/
If you run 'ifdown br-ex' and then 'ifup br-ex' this bridge will lose connection to physical network (ovs port enp0s4 will be missing from br-ex). It's happening because enp0s4 is not configured as OVSPort. Another problem is that both ports (enp0s4 and p_ff798dba-0) have "auto" enabled. So system tries to bring them up directly (not under bridge) when starts 'networking', which may fail, of course, since bridges are not yet created. Ports, connected to OVS bridge, should not have "auto" parameter (see examples here https://github.com/openvswitch/ovs/blob/master/debian/openvswitch-switch.README.Debian). Those ports will be brought up by /etc/network/if-pre-up.d/openvswitch script (see line #7 in this paste http://paste.openstack.org/show/494249/ ).
So we need to fix our manifests to configure OVS bridges and their ports like this: http://paste.openstack.org/show/494250/

2) The same /etc/network/if-pre-up.d/openvswitch script is the first one who brings openvswitch-switch service up (see http://paste.openstack.org/show/494251/). And it looks like the problem is related to ovsdb accessibility - it tries to configure OVS interfaces while ovsdb is not up yet (/var/run/openvswitch/db.sock is not accessible). After adding a simple wait for socket loop (see http://paste.openstack.org/show/494252/ ) the problem with missing OVS bridges was solved and I was able to find 'NO SOCKET' message in upstart networking log, so it obviously tried to execute ovs commands (which would fail without wait loop).

Changed in fuel:
status: Confirmed → Triaged
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/openvswitch (master)

Fix proposed to branch: master
Change author: Alexander Didenko <email address hidden>
Review: https://review.fuel-infra.org/19782

Changed in fuel:
status: Triaged → In Progress
Changed in fuel:
milestone: 9.0 → 10.0
Revision history for this message
Aleksandr Didenko (adidenko) wrote :
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/trusty/openvswitch (master)

Reviewed: https://review.fuel-infra.org/19782
Submitter: Pkgs Jenkins <email address hidden>
Branch: master

Commit: a75ff1e6dc800246b65993f8128ccb3c046640ec
Author: Alexander Didenko <email address hidden>
Date: Tue Apr 19 09:20:05 2016

Fix race condition between networking and ovs

Networking upstart service checks if openvsiwtch-switch service
is running (and starts it if it's not running) via
debian/ifupdown.sh hook. But service may be in 'start/pre-start'
state in which it's loading kmod and not yet able to connect to
the database. Which leads to failures in interfaces configuration.

Backporting upstream fix for LP: #1314887

Change-Id: Ifc975caed9cd17c4d0ff5d9ddd00cad6e215f620
Partial-bug: #1555162

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to packages/trusty/openvswitch (9.0)

Fix proposed to branch: 9.0
Change author: Alexander Didenko <email address hidden>
Review: https://review.fuel-infra.org/19889

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to packages/trusty/openvswitch (9.0)

Reviewed: https://review.fuel-infra.org/19889
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0

Commit: 83e0a677518892162bd9370307c6daf660fbc306
Author: Alexander Didenko <email address hidden>
Date: Wed Apr 20 09:39:11 2016

Fix race condition between networking and ovs

Networking upstart service checks if openvsiwtch-switch service
is running (and starts it if it's not running) via
debian/ifupdown.sh hook. But service may be in 'start/pre-start'
state in which it's loading kmod and not yet able to connect to
the database. Which leads to failures in interfaces configuration.

Backporting upstream fix for LP: #1314887

Change-Id: Ifc975caed9cd17c4d0ff5d9ddd00cad6e215f620
Partial-bug: #1555162
(cherry picked from commit a75ff1e6dc800246b65993f8128ccb3c046640ec)

tags: removed: swarm-blocker
Revision history for this message
Atsuko Ito (yottatsa) wrote :

The problem with storing ovs ports as a linux ones because we're handling it strange way: even if provider is lnx, we still could add it in ovs bridge https://github.com/openstack/fuel-library/blob/master/deployment/puppet/l23network/lib/puppet/provider/l2_port/lnx.rb#L132

Changed in fuel:
assignee: Aleksandr Didenko (adidenko) → Vladimir Eremin (yottatsa)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/309052

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/309052
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=645e01af3fc8dfdc4f33c0e9e8f43bdbcfd564c0
Submitter: Jenkins
Branch: master

commit 645e01af3fc8dfdc4f33c0e9e8f43bdbcfd564c0
Author: Vladimir Eremin <email address hidden>
Date: Thu Apr 21 17:53:22 2016 +0300

    Inherit provider for ports from bridges as default

    If Port provider is not specified and Bridge provider is OVS, we need to
    use OVS provider for Port instead of default one. Otherwise,
    L23_stored_config would choose wrong provider.

    Change-Id: I1213b70be19b6ce7324d69b1763d4bfd900fe3d9
    Closes-Bug: #1555162

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/309445

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/309445
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=81bf170400d01ccd18e4c5b2c47f0dbdaa9f7b0d
Submitter: Jenkins
Branch: stable/mitaka

commit 81bf170400d01ccd18e4c5b2c47f0dbdaa9f7b0d
Author: Vladimir Eremin <email address hidden>
Date: Thu Apr 21 17:53:22 2016 +0300

    Inherit provider for ports from bridges as default

    If Port provider is not specified and Bridge provider is OVS, we need to
    use OVS provider for Port instead of default one. Otherwise,
    L23_stored_config would choose wrong provider.

    Change-Id: I1213b70be19b6ce7324d69b1763d4bfd900fe3d9
    Closes-Bug: #1555162
    (cherry picked from commit 645e01af3fc8dfdc4f33c0e9e8f43bdbcfd564c0)

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

for mitaka verified on 285 iso

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on packages/trusty/openvswitch (9.0)

Change abandoned by Dmitry Teselkin <email address hidden> on branch: 9.0
Review: https://review.fuel-infra.org/25435
Reason: Merged in https://review.fuel-infra.org/#/q/topic:group/prod-7907

norman shen (jshen28)
Changed in fuel:
status: Fix Committed → Confirmed
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :
Changed in fuel:
status: Confirmed → Fix Committed
status: Fix Committed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.