Bug #1593716 “[R3.0-51]: VM cleanup issue in agent in control-no...” : Series r3.1 : Bugs : Juniper Openstack

alok kumar (kalok) on 2016-06-17

tags:

added: regression

Revision history for this message

Manish Singh (manishs) wrote on 2016-07-07:

#1

CN issue. On xmpp peer going off CN should have flushed the VM subscription.

Revision history for this message

aswani kumar (aswanikumar90) wrote on 2016-07-27:

#2

Facing same issue on R3.1 kilo build 3

no longer affects:

juniperopenstack/r3.1

aswani kumar (aswanikumar90) on 2016-07-28

information type:

Proprietary → Public

Hari Prasad Killi (haripk) on 2016-07-28

tags:

added: contrail-control
removed: vrouter

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2016-08-17: [Review update] R3.1

#3

Review in progress for https://review.opencontrail.org/23365
Submitter: Manish Singh (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2016-08-17: [Review update] master

#5

Review in progress for https://review.opencontrail.org/23366
Submitter: Manish Singh (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2016-08-18: A change has been merged

#7

Reviewed: https://review.opencontrail.org/23366
Committed: http://github.org/Juniper/contrail-controller/commit/112aa3c8f475f290fdd5581c9c5927e36f2d87e2
Submitter: Zuul
Branch: master

commit 112aa3c8f475f290fdd5581c9c5927e36f2d87e2
Author: Manish <email address hidden>
Date: Wed Aug 17 14:39:40 2016 +0530

VM cleanup fails sometimes.

Problem:
On cfg channel change sequence of event was to select a new cfg peer if
available(sequence number incremented say S1) and then notify all config to new
cfg server and then again increment sequence number(S2).
This used to result in all the notified config to be present with S1. Now when
they are put for delete current sequence number S2 does not match with S1.
This results in inconsistency and VM fails.
Second increment of sequence number is present to mark all config as stale if no
new cfg server is selected in non headless mode.

Solution:
If active peer is selected then do not increment sequence number again(i.e. no
S2).

Change-Id: I9e8bf4a6cfba36ad45bac36182c73ea2a835cdbf
Closes-bug: #1593716

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2016-08-18:

#8

Reviewed: https://review.opencontrail.org/23365
Committed: http://github.org/Juniper/contrail-controller/commit/69cad55c2bd9127060011a1c37cd0ddbb5f4a865
Submitter: Zuul
Branch: R3.1

commit 69cad55c2bd9127060011a1c37cd0ddbb5f4a865
Author: Manish <email address hidden>
Date: Wed Aug 17 14:39:40 2016 +0530

VM cleanup fails sometimes.

Problem:
On cfg channel change sequence of event was to select a new cfg peer if
available(sequence number incremented say S1) and then notify all config to new
cfg server and then again increment sequence number(S2).
This used to result in all the notified config to be present with S1. Now when
they are put for delete current sequence number S2 does not match with S1.
This results in inconsistency and VM fails.
Second increment of sequence number is present to mark all config as stale if no
new cfg server is selected in non headless mode.

Solution:
If active peer is selected then do not increment sequence number again(i.e. no
S2).

Change-Id: I9e8bf4a6cfba36ad45bac36182c73ea2a835cdbf
Closes-bug: #1593716

Revision history for this message

Ashok Singh (ashoksr) wrote on 2017-07-19:

#10

Download full text (4.0 KiB)

We are hitting the issue
https://bugs.launchpad.net/juniperopenstack/+bug/1593716

VM object still present in Agent's config DB. Agent has not sent Unsubscribe for VM bd4466bb-184d-4a9a-981d-6a6302faf2f7. This can happen when there is Xmpp Channel flap because of presence of issue https://bugs.launchpad.net/juniperopenstack/+bug/1593716 in R2.21.2-36

(gdb) p Agent::singleton_->cfg_
$3 = (AgentConfig *) 0x7f9830000d00
(gdb) p $3->vm_cfg_table_
There is no member or method named vm_cfg_table_.
(gdb) p $3->cfg_vm_table_
$4 = (autogen::DBTable_Agent_VirtualMachine *) 0x7f983000c830
(gdb) dump_ifmap_entries 0x7f983000c830
--------------------------------------------
0x7f98281e2f70 name=00256b4b-752e-4c10-a8f6-2f7821e07f81
0x7f98044892c0 name=03e0b056-7d87-43c8-83d8-7f96179a8800
0x7f9814111e60 name=05a4ec9e-544f-4843-a729-b3d4427f30c8
0x7f98343f5bb0 name=0daf5cdc-52ab-4317-a7aa-f1318a373926
0x7f9834398d40 name=18c494c9-61d7-4008-bf13-1dff51d96acb
0x7f98142bee80 name=1fc8ae71-6331-495f-a5ee-e4298ece5725
0x7f9810940b70 name=3d95a1e3-f6d0-4804-8c3c-d7d62b78cfb2
0x7f97fc062b90 name=41d4c09b-84bc-48f2-8e2b-0ea575c98ca0
0x7f98042258e0 name=41f56cb8-3117-4822-8708-c478ebc7b08c
0x7f97f0a59390 name=4c4b151b-9cfe-40d5-8eb2-db7e0b4d1e02
0x7f980032d2a0 name=5ac03053-8049-4508-820d-511c7f365bf2
0x7f980401c5d0 name=842a9d4d-e9f4-4c04-a0b2-2875beeedab1
0x7f98346b0900 name=87d3cc5c-50cc-4c92-bbe1-6cd1140601c1
0x7f981c381c90 name=88e80b6c-c6b7-4068-b501-d118999bb6f8
0x7f981c2f04b0 name=aab86767-ddb6-4728-b7b0-65ee09c2cca2
0x7f982c313d00 name=b8f9616d-035b-4c06-8493-2b4d8d57831c
0x7f983411fa80 name=b9b7a201-83dd-4423-90e7-e29c59c6cfa7
0x7f97fc937340 name=bd4466bb-184d-4a9a-981d-6a6302faf2f7
0x7f97fca69fb0 name=d0b1d2ea-d5fb-4c45-8425-68e805e6a6ea
0x7f983c459e80 name=e5be3c1f-935c-4451-b0a9-80f50049c52d
0x7f9830338130 name=eb23f5d4-0706-429f-8e92-36bf6e7d3983
0x7f9844299800 name=eec8f83f-6ee5-47a5-804c-33b5399c7255
0x7f9811083940 name=f7d53973-0bc4-4c47-884f-7ae0d851bf56

Xmpp Channel flaps confirmed by trace logs of controller taken from core file

controller_trace.log:3220:2017-01-25 09:32:55.881445 AgentXmppSession: peer = "10.3.135.73" event = "NOT_READY" tree_builder = "NULL" message = "BGP peer decommissioned for xmpp channel." file = "controller/src/vnsw/agent/controller/controller_peer.cc" line = 1465
controller_trace.log:3249:2017-01-25 09:32:55.881880 AgentXmppSession: peer = "10.3.135.72" event = "NOT_READY" tree_builder = "NULL" message = "BGP peer selected asconfig peer on decommission of old config peer." file = "controller/src/vnsw/agent/controller/controller_peer.cc" line = 1491
controller_trace.log:3252:2017-01-25 09:32:55.881891 AgentXmppSession: peer = "10.3.135.73" event = "NOT_READY" tree_builder = "10.3.135.72" message = "Peer elected Multicast Tree Builder" file = "controller/src/vnsw/agent/controller/controller_peer.cc" line = 1543
controller_trace.log:3311:2017-01-26 11:26:12.898147 AgentXmppSession: peer = "10.3.135.74" event = "NOT_READY" tree_builder = "NULL" message = "BGP peer decommissioned for xmpp channel." file = "c...

We are hitting the issue 
https://bugs.launchpad.net/juniperopenstack/+bug/1593716

VM object still present in Agent's config DB. Agent has not sent Unsubscribe for VM bd4466bb-184d-4a9a-981d-6a6302faf2f7. This can happen when there is Xmpp Channel flap because of presence of issue https://bugs.launchpad.net/juniperopenstack/+bug/1593716 in R2.21.2-36

(gdb) p Agent::singleton_->cfg_
$3 = (AgentConfig *) 0x7f9830000d00
(gdb) p $3->vm_cfg_table_
There is no member or method named vm_cfg_table_.
(gdb) p $3->cfg_vm_table_
$4 = (autogen::DBTable_Agent_VirtualMachine *) 0x7f983000c830
(gdb) dump_ifmap_entries 0x7f983000c830
--------------------------------------------
0x7f98281e2f70  name=00256b4b-752e-4c10-a8f6-2f7821e07f81    
0x7f98044892c0  name=03e0b056-7d87-43c8-83d8-7f96179a8800    
0x7f9814111e60  name=05a4ec9e-544f-4843-a729-b3d4427f30c8    
0x7f98343f5bb0  name=0daf5cdc-52ab-4317-a7aa-f1318a373926    
0x7f9834398d40  name=18c494c9-61d7-4008-bf13-1dff51d96acb    
0x7f98142bee80  name=1fc8ae71-6331-495f-a5ee-e4298ece5725    
0x7f9810940b70  name=3d95a1e3-f6d0-4804-8c3c-d7d62b78cfb2    
0x7f97fc062b90  name=41d4c09b-84bc-48f2-8e2b-0ea575c98ca0    
0x7f98042258e0  name=41f56cb8-3117-4822-8708-c478ebc7b08c    
0x7f97f0a59390  name=4c4b151b-9cfe-40d5-8eb2-db7e0b4d1e02    
0x7f980032d2a0  name=5ac03053-8049-4508-820d-511c7f365bf2    
0x7f980401c5d0  name=842a9d4d-e9f4-4c04-a0b2-2875beeedab1    
0x7f98346b0900  name=87d3cc5c-50cc-4c92-bbe1-6cd1140601c1    
0x7f981c381c90  name=88e80b6c-c6b7-4068-b501-d118999bb6f8    
0x7f981c2f04b0  name=aab86767-ddb6-4728-b7b0-65ee09c2cca2    
0x7f982c313d00  name=b8f9616d-035b-4c06-8493-2b4d8d57831c    
0x7f983411fa80  name=b9b7a201-83dd-4423-90e7-e29c59c6cfa7    
0x7f97fc937340  name=bd4466bb-184d-4a9a-981d-6a6302faf2f7    
0x7f97fca69fb0  name=d0b1d2ea-d5fb-4c45-8425-68e805e6a6ea    
0x7f983c459e80  name=e5be3c1f-935c-4451-b0a9-80f50049c52d    
0x7f9830338130  name=eb23f5d4-0706-429f-8e92-36bf6e7d3983    
0x7f9844299800  name=eec8f83f-6ee5-47a5-804c-33b5399c7255    
0x7f9811083940  name=f7d53973-0bc4-4c47-884f-7ae0d851bf56

Xmpp Channel flaps confirmed by trace logs of controller taken from core file

controller_trace.log:3220:2017-01-25 09:32:55.881445 AgentXmppSession: peer = "10.3.135.73" event = "NOT_READY" tree_builder = "NULL" message = "BGP peer decommissioned for xmpp channel." file = "controller/src/vnsw/agent/controller/controller_peer.cc" line = 1465 
controller_trace.log:3249:2017-01-25 09:32:55.881880 AgentXmppSession: peer = "10.3.135.72" event = "NOT_READY" tree_builder = "NULL" message = "BGP peer selected asconfig peer on decommission of old config peer." file = "controller/src/vnsw/agent/controller/controller_peer.cc" line = 1491 
controller_trace.log:3252:2017-01-25 09:32:55.881891 AgentXmppSession: peer = "10.3.135.73" event = "NOT_READY" tree_builder = "10.3.135.72" message = "Peer elected Multicast Tree Builder" file = "controller/src/vnsw/agent/controller/controller_peer.cc" line = 1543 
controller_trace.log:3311:2017-01-26 11:26:12.898147 AgentXmppSession: peer = "10.3.135.74" event = "NOT_READY" tree_builder = "NULL" message = "BGP peer decommissioned for xmpp channel." file = "controller/src/vnsw/agent/controller/controller_peer.cc" line = 1465 
controller_trace.log:3667:2017-02-28 23:54:58.624388 AgentXmppSession: peer = "10.3.135.72" event = "NOT_READY" tree_builder = "NULL" message = "BGP peer decommissioned for xmpp channel." file = "controller/src/vnsw/agent/controller/controller_peer.cc" line = 1465 
controller_trace.log:3698:2017-02-28 23:54:58.624927 AgentXmppSession: peer = "10.3.135.73" event = "NOT_READY" tree_builder = "NULL" message = "BGP peer selected asconfig peer on decommission of old config peer." file = "controller/src/vnsw/agent/controller/controller_peer.cc" line = 1491 
controller_trace.log:3701:2017-02-28 23:54:58.624939 AgentXmppSession: peer = "10.3.135.72" event = "NOT_READY" tree_builder = "10.3.135.73" message = "Peer elected Multicast Tree Builder" file = "controller/src/vnsw/agent/controller/controller_peer.cc" line = 1543

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-07-19: [Review update] R2.21.x

#11

Review in progress for https://review.opencontrail.org/33789
Submitter: Ashok Singh (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2017-07-20: A change has been merged

#13

Reviewed: https://review.opencontrail.org/33789
Committed: http://github.com/Juniper/contrail-controller/commit/52a0fdaae74b2c38b58dc443129fe93a336081a7
Submitter: Zuul (<email address hidden>)
Branch: R2.21.x

commit 52a0fdaae74b2c38b58dc443129fe93a336081a7
Author: Manish <email address hidden>
Date: Wed Aug 17 14:39:40 2016 +0530

VM cleanup fails sometimes.

Problem:
On cfg channel change sequence of event was to select a new cfg peer if
available(sequence number incremented say S1) and then notify all config to new
cfg server and then again increment sequence number(S2).
This used to result in all the notified config to be present with S1. Now when
they are put for delete current sequence number S2 does not match with S1.
This results in inconsistency and VM fails.
Second increment of sequence number is present to mark all config as stale if no
new cfg server is selected in non headless mode.

Solution:
If active peer is selected then do not increment sequence number again(i.e. no
S2).

Closes-bug: #1593716
(cherry picked from commit 69cad55c2bd9127060011a1c37cd0ddbb5f4a865)

Change-Id: I7d0685df007599c614745b564b88333efcfbe0f6

	Status	Importance	Assigned to	Milestone
Juniper Openstack	Status tracked in Trunk
R2.21.x	Fix Committed	High	Tapan Karwa	Juniper Openstack r2.21.3
R3.0	New	High	Tapan Karwa
R3.1	Fix Committed	High	Tapan Karwa
Trunk	Fix Committed	High	Tapan Karwa	Juniper Openstack r3.2.0.0-fcs "r3.2.0.0"

Juniper Openstack

[R3.0-51]: VM cleanup issue in agent in control-node switchover

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches