[R3.0-51]: VM cleanup issue in agent in control-node switchover

Bug #1593716 reported by alok kumar
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.21.x
Fix Committed
High
Tapan Karwa
R3.0
New
High
Tapan Karwa
R3.1
Fix Committed
High
Tapan Karwa
Trunk
Fix Committed
High
Tapan Karwa

Bug Description

Deleted VM still present in agent VM list after control-node switchover.

d5f6f264-a4cf-4239-8132-603185fe9837 is deleted in config and nova but still present in agent.
http://10.204.217.94:8085/Snh_VmListReq?uuid=

This is reproducible through test case test_controlnode_switchover_policy_between_vns_traffic on regression testbed "testbed_regression_6_node_multi_intf.py.ubuntu-14.04".

OS: ubuntu 14.04
contrail version: 3.0.2.0-51~kilo

Testcase description:
Test to validate that with policy having rule to check icmp fwding between VMs on different VNs , ping between VMs should pass with control-node switchover without any traffic drops.

control service is stopped and after switchover started again in this testcase.

Ashok and Manish have debugged it and have found the problem in agent.

setup Info:
env.roledefs = {
    'all': [host1, host2, host3, host4, host5, host6],
    'cfgm': [host1, host2],
    'openstack': [host2],
    'webui': [host3],
    'control': [host1, host2, host3],
    'compute': [host4, host5, host6],
    'collector': [host1, host3],
    'database': [host1, host2, host3],
    'build': [host_build],
}

env.hostnames = {
    'all': ['nodea35', 'nodea34', 'nodec53', 'nodec54', 'nodec55', 'nodec56']
}

gcore of the agent is @mayamruga:/home/bhushana/Documents/technical/bugs/<bugId>

alok kumar (kalok)
tags: added: regression
Revision history for this message
Manish Singh (manishs) wrote :

CN issue. On xmpp peer going off CN should have flushed the VM subscription.

Revision history for this message
aswani kumar (aswanikumar90) wrote :

Facing same issue on R3.1 kilo build 3

no longer affects: juniperopenstack/r3.1
information type: Proprietary → Public
tags: added: contrail-control
removed: vrouter
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.1

Review in progress for https://review.opencontrail.org/23365
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/23366
Submitter: Manish Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/23366
Committed: http://github.org/Juniper/contrail-controller/commit/112aa3c8f475f290fdd5581c9c5927e36f2d87e2
Submitter: Zuul
Branch: master

commit 112aa3c8f475f290fdd5581c9c5927e36f2d87e2
Author: Manish <email address hidden>
Date: Wed Aug 17 14:39:40 2016 +0530

VM cleanup fails sometimes.

Problem:
On cfg channel change sequence of event was to select a new cfg peer if
available(sequence number incremented say S1) and then notify all config to new
cfg server and then again increment sequence number(S2).
This used to result in all the notified config to be present with S1. Now when
they are put for delete current sequence number S2 does not match with S1.
This results in inconsistency and VM fails.
Second increment of sequence number is present to mark all config as stale if no
new cfg server is selected in non headless mode.

Solution:
If active peer is selected then do not increment sequence number again(i.e. no
S2).

Change-Id: I9e8bf4a6cfba36ad45bac36182c73ea2a835cdbf
Closes-bug: #1593716

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/23365
Committed: http://github.org/Juniper/contrail-controller/commit/69cad55c2bd9127060011a1c37cd0ddbb5f4a865
Submitter: Zuul
Branch: R3.1

commit 69cad55c2bd9127060011a1c37cd0ddbb5f4a865
Author: Manish <email address hidden>
Date: Wed Aug 17 14:39:40 2016 +0530

VM cleanup fails sometimes.

Problem:
On cfg channel change sequence of event was to select a new cfg peer if
available(sequence number incremented say S1) and then notify all config to new
cfg server and then again increment sequence number(S2).
This used to result in all the notified config to be present with S1. Now when
they are put for delete current sequence number S2 does not match with S1.
This results in inconsistency and VM fails.
Second increment of sequence number is present to mark all config as stale if no
new cfg server is selected in non headless mode.

Solution:
If active peer is selected then do not increment sequence number again(i.e. no
S2).

Change-Id: I9e8bf4a6cfba36ad45bac36182c73ea2a835cdbf
Closes-bug: #1593716

Revision history for this message
Ashok Singh (ashoksr) wrote :
Download full text (4.0 KiB)

We are hitting the issue
https://bugs.launchpad.net/juniperopenstack/+bug/1593716

VM object still present in Agent's config DB. Agent has not sent Unsubscribe for VM bd4466bb-184d-4a9a-981d-6a6302faf2f7. This can happen when there is Xmpp Channel flap because of presence of issue https://bugs.launchpad.net/juniperopenstack/+bug/1593716 in R2.21.2-36

(gdb) p Agent::singleton_->cfg_
$3 = (AgentConfig *) 0x7f9830000d00
(gdb) p $3->vm_cfg_table_
There is no member or method named vm_cfg_table_.
(gdb) p $3->cfg_vm_table_
$4 = (autogen::DBTable_Agent_VirtualMachine *) 0x7f983000c830
(gdb) dump_ifmap_entries 0x7f983000c830
--------------------------------------------
0x7f98281e2f70  name=00256b4b-752e-4c10-a8f6-2f7821e07f81
0x7f98044892c0  name=03e0b056-7d87-43c8-83d8-7f96179a8800
0x7f9814111e60  name=05a4ec9e-544f-4843-a729-b3d4427f30c8
0x7f98343f5bb0  name=0daf5cdc-52ab-4317-a7aa-f1318a373926
0x7f9834398d40  name=18c494c9-61d7-4008-bf13-1dff51d96acb
0x7f98142bee80  name=1fc8ae71-6331-495f-a5ee-e4298ece5725
0x7f9810940b70  name=3d95a1e3-f6d0-4804-8c3c-d7d62b78cfb2
0x7f97fc062b90  name=41d4c09b-84bc-48f2-8e2b-0ea575c98ca0
0x7f98042258e0  name=41f56cb8-3117-4822-8708-c478ebc7b08c
0x7f97f0a59390  name=4c4b151b-9cfe-40d5-8eb2-db7e0b4d1e02
0x7f980032d2a0  name=5ac03053-8049-4508-820d-511c7f365bf2
0x7f980401c5d0  name=842a9d4d-e9f4-4c04-a0b2-2875beeedab1
0x7f98346b0900  name=87d3cc5c-50cc-4c92-bbe1-6cd1140601c1
0x7f981c381c90  name=88e80b6c-c6b7-4068-b501-d118999bb6f8
0x7f981c2f04b0  name=aab86767-ddb6-4728-b7b0-65ee09c2cca2
0x7f982c313d00  name=b8f9616d-035b-4c06-8493-2b4d8d57831c
0x7f983411fa80  name=b9b7a201-83dd-4423-90e7-e29c59c6cfa7
0x7f97fc937340  name=bd4466bb-184d-4a9a-981d-6a6302faf2f7
0x7f97fca69fb0  name=d0b1d2ea-d5fb-4c45-8425-68e805e6a6ea
0x7f983c459e80  name=e5be3c1f-935c-4451-b0a9-80f50049c52d
0x7f9830338130  name=eb23f5d4-0706-429f-8e92-36bf6e7d3983
0x7f9844299800  name=eec8f83f-6ee5-47a5-804c-33b5399c7255
0x7f9811083940  name=f7d53973-0bc4-4c47-884f-7ae0d851bf56

Xmpp Channel flaps confirmed by trace logs of controller taken from core file

controller_trace.log:3220:2017-01-25 09:32:55.881445 AgentXmppSession: peer = "10.3.135.73" event = "NOT_READY" tree_builder = "NULL" message = "BGP peer decommissioned for xmpp channel." file = "controller/src/vnsw/agent/controller/controller_peer.cc" line = 1465
controller_trace.log:3249:2017-01-25 09:32:55.881880 AgentXmppSession: peer = "10.3.135.72" event = "NOT_READY" tree_builder = "NULL" message = "BGP peer selected asconfig peer on decommission of old config peer." file = "controller/src/vnsw/agent/controller/controller_peer.cc" line = 1491
controller_trace.log:3252:2017-01-25 09:32:55.881891 AgentXmppSession: peer = "10.3.135.73" event = "NOT_READY" tree_builder = "10.3.135.72" message = "Peer elected Multicast Tree Builder" file = "controller/src/vnsw/agent/controller/controller_peer.cc" line = 1543
controller_trace.log:3311:2017-01-26 11:26:12.898147 AgentXmppSession: peer = "10.3.135.74" event = "NOT_READY" tree_builder = "NULL" message = "BGP peer decommissioned for xmpp channel." file = "c...

Read more...

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.21.x

Review in progress for https://review.opencontrail.org/33789
Submitter: Ashok Singh (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/33789
Committed: http://github.com/Juniper/contrail-controller/commit/52a0fdaae74b2c38b58dc443129fe93a336081a7
Submitter: Zuul (<email address hidden>)
Branch: R2.21.x

commit 52a0fdaae74b2c38b58dc443129fe93a336081a7
Author: Manish <email address hidden>
Date: Wed Aug 17 14:39:40 2016 +0530

VM cleanup fails sometimes.

Problem:
On cfg channel change sequence of event was to select a new cfg peer if
available(sequence number incremented say S1) and then notify all config to new
cfg server and then again increment sequence number(S2).
This used to result in all the notified config to be present with S1. Now when
they are put for delete current sequence number S2 does not match with S1.
This results in inconsistency and VM fails.
Second increment of sequence number is present to mark all config as stale if no
new cfg server is selected in non headless mode.

Solution:
If active peer is selected then do not increment sequence number again(i.e. no
S2).

Closes-bug: #1593716
(cherry picked from commit 69cad55c2bd9127060011a1c37cd0ddbb5f4a865)

Change-Id: I7d0685df007599c614745b564b88333efcfbe0f6

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.