A vm's port is in down state after compute node reboot

Bug #1577721 reported by Sergey Kolekonov
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Invalid
High
Oleg Bondarev
8.0.x
Invalid
High
Anatolii Neliubin
9.x
Fix Released
High
Oleg Bondarev

Bug Description

Steps to reproduce:
- Deploy an environment with Neutron + VXLAN (MOS 9.0 + CentOS-based compute node)
- Create required security groups
- Spawn an instance, assign a floating IP address and check that it's available
- Properly reboot a compute node
Expected results:
The instance is available by floating IP
Actual result:
The instance can't obtain an IP address, the floating IP address doesn't work, instance's port is in down state
Detailed bug description:
MOS 9.0 ISO fuel-9.0-274
Time on compute and controller nodes are different (6 mins). This can be a root cause of the issue.
Swarm run: https://product-ci.infra.mirantis.net/job/9.0.system_test.ubuntu.rh/94/

Changed in mos:
assignee: nobody → Sergey Kolekonov (skolekonov)
Revision history for this message
Sergey Kolekonov (skolekonov) wrote :

Logs from the compute node

Changed in mos:
status: New → Confirmed
tags: added: area-neutron
Changed in mos:
importance: Undecided → High
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Two events detected by ovsdb-client almost at the same time:
2016-05-02 23:03:14.007 2686 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor Interface name,ofport,external_ids --format=json]: {"data":[["b13d9fa2-73ff-474b-9d4d-5c651a794201","delete","qvo62f4fa37-a1",-1,["map",[["attached-mac","fa:16:3e:9b:04:3b"],["iface-id","62f4fa37-a181-4db4-8db7-ed12552af141"],["iface-status","active"],["vm-uuid","b739f239-702e-402a-a5b6-815926afc82f"]]]],["61beda06-8062-4176-bca5-11a1378e0b26","insert","qvo62f4fa37-a1",["set",[]],["map",[["attached-mac","fa:16:3e:9b:04:3b"],["iface-id","62f4fa37-a181-4db4-8db7-ed12552af141"],["iface-status","active"],["vm-uuid","b739f239-702e-402a-a5b6-815926afc82f"]]]]],"headings":["row","action","name","ofport","external_ids"]} _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:236

2016-05-02 23:03:14.008 2686 DEBUG neutron.agent.linux.async_process [-] Output received from [ovsdb-client monitor Interface name,ofport,external_ids --format=json]: {"data":[["61beda06-8062-4176-bca5-11a1378e0b26","old",null,["set",[]],null],["","new","qvo62f4fa37-a1",2,["map",[["attached-mac","fa:16:3e:9b:04:3b"],["iface-id","62f4fa37-a181-4db4-8db7-ed12552af141"],["iface-status","active"],["vm-uuid","b739f239-702e-402a-a5b6-815926afc82f"]]]]],"headings":["row","action","name","ofport","external_ids"]} _read_stdout /usr/lib/python2.7/site-packages/neutron/agent/linux/async_process.py:236

meaning that port qvo62f4fa37-a1 was deleted and created right away.

ovs agent in each loop processes all added ports first and then all deleted ports. In this case it was the same port so it was added first and then removed. Not sure what caused such behavior. It's probably some race due to server and agent node being out of sync. More investigation needed.

Revision history for this message
Sergey Kolekonov (skolekonov) wrote :

Moving to mos-neutron as it seems to be a Neutron issue

Changed in mos:
assignee: Sergey Kolekonov (skolekonov) → MOS Neutron (mos-neutron)
Changed in mos:
assignee: MOS Neutron (mos-neutron) → Oleg Bondarev (obondarev)
Changed in mos:
status: Confirmed → In Progress
Revision history for this message
Oleg Bondarev (obondarev) wrote :

So on compute node restart, nova does hard reboot of all instances on that node which causes corresponding ovs ports to be deleted and recreated. This is normal behavior.

From neutron code I can see that OVS agent has special handling for the case when ovs port was deleted and instantly recreated: it just ignores 'deleted' event, processes port and puts it to ACTIVE state. I was able to reproduce the scenario but didn't hit the bug.
Environment with repro is needed to further debug the issue. Marking as incomplete for now.

Changed in mos:
status: In Progress → Incomplete
Revision history for this message
Alexander Ignatov (aignatov) wrote :

@skolekonov, need one more repro.

Changed in mos:
assignee: Oleg Bondarev (obondarev) → Sergey Kolekonov (skolekonov)
Changed in mos:
status: Incomplete → Confirmed
Changed in mos:
status: Confirmed → In Progress
Revision history for this message
Sergey Kolekonov (skolekonov) wrote :
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Upstream bug filed, please see analysis there: https://bugs.launchpad.net/neutron/+bug/1585623

Revision history for this message
Oleg Bondarev (obondarev) wrote :

Invalid for 10.0 since fix is in Newton: https://review.openstack.org/#/c/321131/

Revision history for this message
Dina Belova (dbelova) wrote :

Downstream code on review - https://review.fuel-infra.org/#/c/21659/

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/neutron (9.0/mitaka)

Reviewed: https://review.fuel-infra.org/21659
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka

Commit: a7c1dd7b286c92c52ff3c7043c838610421abf59
Author: Jenkins <email address hidden>
Date: Mon Jun 6 08:19:38 2016

Merge the tip of origin/stable/mitaka into origin/9.0/mitaka

1bef591 Pass ha_router_port flag for _snat_router_interfaces ports
bd8adf3 Fixed help messages for path_mtu and global_physnet_mtus options
f995169 Fix bgp-speaker-network-remove error
fc3c8a6 OVS: compare names when checking devices both added and deleted

Closes-Bug: #1577721
Change-Id: Icbf0c83de37c75482bd1db89110f186b4faf7c67

Revision history for this message
Dina Belova (dbelova) wrote :

Fix committed as a part of sync with stable/mitaka ^^

tags: added: on-verification
Revision history for this message
Kristina Berezovskaia (kkuznetsova) wrote :

Verify on
Verify on
cat /etc/fuel_build_id:
 487
cat /etc/fuel_build_number:
 487
cat /etc/fuel_release:
 9.0
cat /etc/fuel_openstack_version:
 mitaka-9.0
rpm -qa | egrep 'fuel|astute|network-checker|nailgun|packetary|shotgun':
 fuel-release-9.0.0-1.mos6349.noarch
 fuel-misc-9.0.0-1.mos8459.noarch
 python-packetary-9.0.0-1.mos140.noarch
 fuel-bootstrap-cli-9.0.0-1.mos285.noarch
 fuel-migrate-9.0.0-1.mos8459.noarch
 rubygem-astute-9.0.0-1.mos750.noarch
 fuel-mirror-9.0.0-1.mos140.noarch
 shotgun-9.0.0-1.mos90.noarch
 fuel-openstack-metadata-9.0.0-1.mos8742.noarch
 fuel-notify-9.0.0-1.mos8459.noarch
 nailgun-mcagents-9.0.0-1.mos750.noarch
 python-fuelclient-9.0.0-1.mos325.noarch
 fuel-9.0.0-1.mos6349.noarch
 fuel-utils-9.0.0-1.mos8459.noarch
 fuel-setup-9.0.0-1.mos6349.noarch
 fuel-provisioning-scripts-9.0.0-1.mos8742.noarch
 fuel-library9.0-9.0.0-1.mos8459.noarch
 network-checker-9.0.0-1.mos74.x86_64
 fuel-agent-9.0.0-1.mos285.noarch
 fuel-ui-9.0.0-1.mos2717.noarch
 fuel-ostf-9.0.0-1.mos936.noarch
 fuelmenu-9.0.0-1.mos274.noarch
 fuel-nailgun-9.0.0-1.mos8742.noarch
Env with rhel compute

repeat speps from description. After rebooting compute, vm is OK and floating ip is available

tags: removed: on-verification
Revision history for this message
Anatolii Neliubin (aneliubin) wrote :

A customer faced the issue in MOS 8.0 MU4, please backport the fix.

tags: added: customer-found
Revision history for this message
Max Yatsenko (myatsenko) wrote :

The cause of this bug in "mitaka" (MOS9.0) was refactoring - new code was added:
https://github.com/openstack/neutron/commit/ccdf211b4cf224d415520c7d70b7f53952674414

So, the patches provided in upstream:

https://review.openstack.org/#/c/321131/
https://review.openstack.org/#/c/321131/

solved the issue for "mitaka" and "master".

As MOS8.0 ("liberty") doesn't have this code, hence, an issue with MOS8.0 can be caused
by another problems that are not relevant to this bug and provided in upstream patch
can't be backported to MOS8.0

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Anatoly, this bug is Invalid for Liberty since Liberty doesn't contain wrong code at all. I cannot reproduce the issue on 8.0-MU-5, no matter how many times I reboot compute node - all ports come up again and again and continue working. Please provide more info on the issue from the customer's side. Logs, diagnostic snapshot, etc.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Moving to Invalid for 8.0 since there was no feedback and the issue is not reproducible.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.