After failover Neutron port stays in binding_failed state

Bug #1483641 reported by Ilya Shakhat
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Fix Released
Critical
Eugene Nikanorov
7.0.x
Fix Released
Critical
Ilya Shakhat

Bug Description

Traffic flow did not recover after cluster failover. The issue affects a single Neutron port. The user-visible impact depends on how the port was used. In the current case the affected port belongs to router, so the instance is still able to get IP address from DHCP, but has no access to gateway. In the OVS output the port has tag 4095 (DEAD_VLAN_TAG).

Steps to reproduce:
1. Deploy cluster with 3 controllers
2. Create several (3-4) networks and routers to increase the chances of failure
3. Spawn instances in every network
4. Suspend primary controller
5. Look for OVS ports with tag 4095

Revision history for this message
Ilya Shakhat (shakhat) wrote :

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "874"
  build_id: "2015-08-07_10-37-51"
  nailgun_sha: "b508953557fc248de55a0ac2557b60063a7961c5"
  python-fuelclient_sha: "2b3d634df7ef119191e2d219c003060f599ad198"
  fuel-agent_sha: "57145b1d8804389304cd04322ba0fb3dc9d30327"
  fuel-nailgun-agent_sha: "1512b9af6b41cc95c4d891c593aeebe0faca5a63"
  astute_sha: "e1d3a435e5df5b40cbfb1a3acf80b4176d15a2dc"
  fuel-library_sha: "96781c2bf9584a6848d44f2ff619887bca1b8427"
  fuel-ostf_sha: "c7f745431aa3c147f2491c865e029e0ffea91c47"
  fuelmain_sha: "bdca75d0256338519c7eddd8a840ee6ecba7f992"

tags: added: neutron
Changed in mos:
milestone: none → 7.0
importance: Undecided → High
assignee: nobody → MOS Neutron (mos-neutron)
tags: added: ha
Revision history for this message
Ilya Shakhat (shakhat) wrote :
Download full text (6.5 KiB)

    Bridge br-int
        fail_mode: secure
        Port "tap88eb7c2b-95"
            tag: 5
            Interface "tap88eb7c2b-95"
                type: internal
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port br-int
            Interface br-int
                type: internal
        Port "tapb73b24e0-61"
            tag: 6
            Interface "tapb73b24e0-61"
                type: internal
        Port "qr-6486ce2a-13"
            tag: 5
            Interface "qr-6486ce2a-13"
                type: internal
        Port "qr-02551ed4-85"
            tag: 4095
            Interface "qr-02551ed4-85"
                type: internal
        Port "qr-1f36a4b7-27"
            tag: 4095
            Interface "qr-1f36a4b7-27"
                type: internal
        Port "tap67c7eee8-41"
            tag: 1
            Interface "tap67c7eee8-41"
                type: internal

root@node-4:~# ip netns
qrouter-f795c549-7060-4dc7-b9f2-7ce1e85ba3d4
qrouter-4de3467e-3423-49de-8175-c5d595b60635
qdhcp-a235918d-760a-4c56-a4fd-f2ea59b8336a
qdhcp-423c1cd8-d223-4c7b-ad9d-f9a0cf0b50fa
qdhcp-6bf11fb7-6a86-4b2c-91b7-e98b04df74c6
haproxy
vrouter

root@node-4:~# neutron router-show f795c549-7060-4dc7-b9f2-7ce1e85ba3d4
+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Field | Value |
+-----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| admin_state_up | True |
| distributed | False |
| external_gateway_info | {"network_id": "8107cd5a-7729-4cba-9743-36a4afa0cf18", "enable_snat": true, "external_fixed_ips": [{"subnet_id": "d12d89ca-3ed7-4b1e-a13b-73f37df33ff7", "ip_address": "172.18.161.240"}]} |
| ha | False |
| id | f795c549-7060-4dc7-b9f2-7ce1e85ba3d4 |
| name | router04 ...

Read more...

Revision history for this message
Ilya Shakhat (shakhat) wrote :
Download full text (16.3 KiB)

2015-08-10 12:42:55.939 8301 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-6c811788-cb50-4aeb-87b8-b32b24556a34 ] Agent rpc_loop - iteration:4108 - port information retrieved. Elapsed:1.402 rpc_loop /usr/lib/python2.7/dist-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1738

2015-08-10 12:42:55.945 8301 DEBUG neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-6c811788-cb50-4aeb-87b8-b32b24556a34 ] Starting to process devices in:{'current': set([u'6486ce2a-13a1-4c42-ad23-54bbc8353638', u'88eb7c2b-95a8-4aac-80d3-d272fdc12652', u'b73b24e0-6101-41dc-baff-1b08a80a6a05', u'02551ed4-859d-402d-9f2c-08a39c8ce120', u'67c7eee8-41c5-41e8-ae5b-f3c37246b89b', u'1f36a4b7-2791-4252-bbd7-7d45b52364dd']), 'removed': set([]), 'added': set([u'1f36a4b7-2791-4252-bbd7-7d45b52364dd'])} rpc_loop /usr/lib/python2.7/dist-packages/neutron/plugins/openvswitch/agent/ovs_neutron_agent.py:1745

#1356

2015-08-10 12:42:55.958 8301 DEBUG oslo_messaging._drivers.impl_rabbit [req-6c811788-cb50-4aeb-87b8-b32b24556a34 ] Publisher.send: sending message q-plugin to {'oslo.message': '{"_context_roles": ["admin"], "_msg_id": "0b6d456a5fc4433c853539875a09b922", "_context_request_id": "req-6c811788-cb50-4aeb-87b8-b32b24556a34", "_context_tenant_id": null, "args": {"host": "node-4.domain.tld", "agent_id": "ovs-agent-node-4.domain.tld", "devices": ["1f36a4b7-2791-4252-bbd7-7d45b52364dd"]}, "_unique_id": "2deb058864b84f2f97ec36f818945418", "_context_tenant_name": null, "_context_user": null, "_context_user_id": null, "_context_project_name": null, "_context_read_deleted": "no", "_reply_q": "reply_1960b22b6810400fa4229466f011d48e", "_context_auth_token": null, "_context_tenant": null, "_context_is_admin": true, "version":
 "1.3", "_context_project_id": null, "_context_timestamp": "2015-08-10 10:23:10.189944", "_context_user_name": null, "method": "get_devices_details_list"}', 'oslo.version': '2.0'} with routing key neutron send /usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/impl_rabbit.py:400

--------- devices_details_list contains 1 item

# 1363

# ovs_lib #262

2015-08-10 12:42:56.128 8301 DEBUG neutron.agent.linux.utils [req-6c811788-cb50-4aeb-87b8-b32b24556a34 ] Running command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--columns=name,external_ids,ofport', 'list', 'Interface'] create_process /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:84

2015-08-10 12:42:56.534 8301 DEBUG neutron.agent.linux.utils [req-6c811788-cb50-4aeb-87b8-b32b24556a34 ]
Command: ['sudo', 'neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ovs-vsctl', '--timeout=10', '--oneline', '--format=json', '--', '--columns=name,external_ids,ofport', 'list', 'Interface']
Exit code: 0
Stdin:
Stdout: {"data":[["p_478728e7-0",["map",[]],1],["br-int",["map",[]],65534],["vxlan-c0a80202",["map",[]],2],["tapb73b24e0-61",["map",[["attached-mac","fa:16:3e:43:7e:63"],["iface-id","b73b24e0-6101-41dc-baff-1b08a80a6a05"],["iface-status","active"]]],9],["patch-int",["map",[]],1],["qr-6486ce2a-13",["map",[["attached-mac","fa:16:3e:92:d5:ed"],["iface-id","6486ce2a-13a1-4c42-ad23-54bbc835...

Revision history for this message
Ilya Shakhat (shakhat) wrote :

According to logs the flow was:
1. Attempt to bind the port to agent in dead state (12:42:31, node-4), resulting in port with binding_failed
2. Agent processes ports and knows that 1 was added (12:42:55, node-4).
3. Agent asks server for port details (12:42:55, node-5), server returns the port with binding_failed state

Revision history for this message
Ilya Shakhat (shakhat) wrote :

The issue is not fixed by restart of OVS or L3 agents. The only solution is to reschedule the affected resource to another controller.

Changed in mos:
importance: High → Critical
Ilya Shakhat (shakhat)
Changed in mos:
status: New → Triaged
Ilya Shakhat (shakhat)
Changed in mos:
assignee: MOS Neutron (mos-neutron) → Ilya Shakhat (shakhat)
Ilya Shakhat (shakhat)
Changed in mos:
status: Triaged → In Progress
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

This bug can only affect service ports such as DHCP and router's.
These are the only cases where ports are bound with host directly.
Otherwise it's done by OVS agent through RPC, and thus agent can't be dead

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/neutron (openstack-ci/fuel-7.0/2015.1.0)

Reviewed: https://review.fuel-infra.org/10352
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: 295d5313cb7648334f560e3293a3dd58d4914e48
Author: Ilya Shakhat <email address hidden>
Date: Mon Aug 17 13:00:47 2015

Resync agent upon revival

During failover it is possible that resources assigned to dead
agents turn into bad state. When agent returns to life there's
no way to refresh state of resources. This patch adds notification
for agents about their state, enforcing full sync in case
the agent returns to life.

Closes-Bug: #1483641

Change-Id: Ib8163c205fab3c50cc543c369590a1cf93c68c2e

Changed in mos:
status: In Progress → Fix Committed
oleksii shyman (oshyman)
tags: added: on-verification
oleksii shyman (oshyman)
tags: removed: on-verification
Revision history for this message
oleksii shyman (oshyman) wrote :

Verified on ISO 246

Changed in mos:
status: Fix Committed → Fix Released
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Ilya Shakhat <email address hidden>
Review: https://review.fuel-infra.org/13319

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/neutron (openstack-ci/fuel-8.0/liberty)

Change abandoned by Eugene Nikanorov <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/13319
Reason: https://review.openstack.org/#/c/232661/ supersedes this one

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change restored on openstack/neutron (openstack-ci/fuel-8.0/liberty)

Change restored by Eugene Nikanorov <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/13319

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/neutron (openstack-ci/fuel-8.0/liberty)

Change abandoned by Eugene Nikanorov <email address hidden> on branch: openstack-ci/fuel-8.0/liberty
Review: https://review.fuel-infra.org/13319
Reason: Abandoning in favor of upstream patch: https://review.openstack.org/#/c/232661/

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Fix proposed to branch: openstack-ci/fuel-8.0/liberty
Change author: Eugene Nikanorov <email address hidden>
Review: https://review.fuel-infra.org/13927

Dmitry Pyzhov (dpyzhov)
tags: added: swarm-blocker
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/neutron (openstack-ci/fuel-8.0/liberty)

Reviewed: https://review.fuel-infra.org/13927
Submitter: Pkgs Jenkins <email address hidden>
Branch: openstack-ci/fuel-8.0/liberty

Commit: fcc1ac0ccc38af5b1ab4301b15f0faee1c71d81f
Author: Eugene Nikanorov <email address hidden>
Date: Thu Nov 19 13:04:03 2015

Resync L3, DHCP and OVS/LB agents upon revival

In big and busy clusters there could be a condition when
rabbitmq clustering mechanism synchronizes queues and during
this period agents connected to that instance of rabbitmq
can't communicate with the server and server considers them
dead moving resources away. After agent become active again,
it needs to cleanup state entries and synchronize its state
with neutron-server.
The solution is to make agents aware of their state from
neutron-server point of view. This is done by changing state
reports from cast to call that would return agent's status.
When agent was dead and becomes alive, it would receive special
AGENT_REVIVED status indicating that it should refresh its
local data which it would not do otherwise.

Conflicts:
        neutron/tests/unit/agent/l3/test_agent.py

Cherry-picked from commit 3b6bd917e4b968a47a5aacb7f590143fc83816d9
Closes-Bug: #1505166
Closes-Bug: #1483641
Change-Id: Id28248f4f75821fbacf46e2c44e40f27f59172a9

Dmitry Pyzhov (dpyzhov)
no longer affects: mos/8.0.x
Revision history for this message
Kristina Berezovskaia (kkuznetsova) wrote :

Verify on:
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "8.0"
  api: "1.0"
  build_number: "361"
  build_id: "361"
  fuel-nailgun_sha: "53c72a9600158bea873eec2af1322a716e079ea0"
  python-fuelclient_sha: "4f234669cfe88a9406f4e438b1e1f74f1ef484a5"
  fuel-agent_sha: "7463551bc74841d1049869aaee777634fb0e5149"
  fuel-nailgun-agent_sha: "92ebd5ade6fab60897761bfa084aefc320bff246"
  astute_sha: "c7ca63a49216744e0bfdfff5cb527556aad2e2a5"
  fuel-library_sha: "ba8063d34ff6419bddf2a82b1de1f37108d96082"
  fuel-ostf_sha: "889ddb0f1a4fa5f839fd4ea0c0017a3c181aa0c1"
  fuel-mirror_sha: "8adb10618bb72bb36bb018386d329b494b036573"
  fuelmenu_sha: "824f6d3ebdc10daf2f7195c82a8ca66da5abee99"
  shotgun_sha: "63645dea384a37dde5c01d4f8905566978e5d906"
  network-checker_sha: "9f0ba4577915ce1e77f5dc9c639a5ef66ca45896"
  fuel-upgrade_sha: "616a7490ec7199f69759e97e42f9b97dfc87e85b"
  fuelmain_sha: "07d5f1c3e1b352cb713852a3a96022ddb8fe2676"
(neutron+vlan, neutron+dvr+vxlan, neutron+vxlan+l3, 3 controllers, 2 compute)

Repeat this steps many times on different envs. After destroy, reset controllers there are no ports in binding_failed state

Changed in mos:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.