Comment 11 for bug 1455862

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/11496
Committed: http://github.org/Juniper/contrail-controller/commit/bf9e76793aab5870da98bf49b5fb6c6eed82625f
Submitter: Zuul
Branch: master

commit bf9e76793aab5870da98bf49b5fb6c6eed82625f
Author: Manish <email address hidden>
Date: Thu May 14 17:23:05 2015 +0530

Handle error from XMPP session send.

Problem:
If XMPP channel send fails which internally(TCP send) translates into a defer
send operation, then agent controller module was treating this as a failure.
In case of route update which is seen in bug any such defer will result in
exported flag for route set to false. Since it is set to false, later on route
delete the unsubscribe for this route will not be sent. However update of route
was deferred and would have gone in some time but control node will never get
delete which will result in issue mentioned by the bug.

Solution:
Ideally all the failed send should be used to enqueue further requests to
control node and replay them whenever callback(writereadycb) is called.
In this way agent will not overload socket.
However as a quick fix the error will not be used to judge the further operation
after send is done. This is in asumption that send will always be succesful.
Currently following messages are sent:
1) VM config sub/unsub
2) Config subscribe for agent
3) VRF sub/unsub
4) Route sub/unsub.
Connection not present will be taken care by channel flap handling.

Change-Id: Ib6e0856b5c689b51209add4ab459b8bd2e952143
Closes-bug: 1453483
(cherry picked from commit 0a70915fd3bc1954154e657f59123d1a4597f2a4)

VRF state not deleted in DelPeer walk.

Problem statement remains same as in this commit:
https://github.com/Juniper/contrail-controller/commit/8e302fcb991c8f5d8f5defb85b9851f8cde5f479

However above commit does not solve the issue.
Reason being, walk count was being incremented on enqueue of walk but when walk
is processed it calls Cancel for any previously started walk. This Cancel
decrements the walk count. This defeats the purpose of moving walk count
increment to enqueue in above commit.
Also consider a walk for VRF where there are four route tables. This should
result in walk count to be 5 (1 for vrf table + 4 for route tables). With above
fix this will be 2 (1 for Vrf + 1 for route table). It didnt take into
consideration that route walk count needs to be incremented for each route
table.

Solution:
Use a seperate enqueue walk count and restore the walk_count as it was before
the above commit. Use both of them to check for walk done.

Closes-bug: 1455862
Change-Id: I8d96732375f649e70d6754cb6c5b34c24867ce0c
(cherry picked from commit 705165854bbdaff9c47a2a9443410eec53e4fb37)

Multicast route gets deleted when vxlan id is changed in configured mode

Problem:
In oper multicast if local peer vxlan-id is changed then there was add issued
for route with new vxlan and delete issued for same with old vxlan.
Since the peer is local the path search only compares peer and not vxlan.
This results in deletion of local path and eventually the multicast route.

Solution:
Need not withdraw path from local peer on vxlan id change. Just trigger update
of same. This will result in controller route _export call which in turn using
state set on flood route, will be able to identify that vxlan id is changed and
it will take care of withdrawal for old vxlan and update with new vxlan.

Change-Id: I3afeddd2620615bb477aec5a0c6715fcdc99352b
Closes-bug: 1457007
(cherry picked from commit a77f2c31a2c6fb8a8abe22f3a562767cf230cbef)

Mac route in agent was having invalid label and destination NH.

Problem:
Two issues were seen -
1) Label was wrongly interpreted as MPLS though it was VXLAN and never got
corrected. Sequence of events was that EVPN route was received before global
vrouter config. Encapsulation priorities was absent in agent, so evpn route
was programmed with Mpls-gre(control-node had sent Vxlan). Since Mpls encap
was chosen, label was interpreted as mpls label and not vxlan. When global
vrouter config was received resync was done for route. Resync also failed to
rectify encap to Vxlan(since vxlan is now available in priorities) because this
decision to rectify is based on vxlan id(i.e. if vxlan id is 0 default to mpls
as its invalid). In this case vxlan id
was 0 as explained above and hence encap continued to be Mpls.

2) Nexthop was different between evpn route and derived mac route.
This happened because in path sync of evpn route return was false even though NH
change was seen which resulted in avoidance of mac route rebake. Return was
false because value set by ChangeNh as true was overridden by MplsChange.

Solution:
For case 1) - If encap is Vxlan only in the message sent by control-node then
put label as vxlan id and mpls label as invalid, even though tunnel type is
computed as Mpls(encapsulation prioirties is not received). In case of Mpls
encap sent use label as Mpls and reset vxlan to invalid. In case both Vxlan and
Mpls are sent in encap then fallback to old approach of interpreting label on
computed tunnel type.

For case 2) - Fix the return value.

Change-Id: Ibeeb3de16d618ecb931c35d8937591d9c9f7f15e
Closes-bug: 1457355
(cherry picked from commit f8f09e3f94590333a3f1a38acff47aeb834c1f18)

Multicast route not deleted on vrf delete.

In Tor agent multicast route does not get deleted when VRF is deleted. This is
becasue deletion is triggered from logical-switch delete. Though that is valid,
vrf delete should also result in route delete.
As mentioned in the bug there can be cases in scaled setup where vrf delete is
received and VN delete is delayed. This may result in multicast route dangling
till VN is received. Ultimately if Vrf delete timeout gets executed then it will
crash.

Change-Id: I8db095a6e99dddeb17bea2edbcbe10fab4c58623
Closes-bug: 1458187
(cherry picked from commit e6c1005a12b70c8843f5dca929d169cef5f2e1e0)

AgentXmppChannel is invalid in route notification.

Problem:
Every AgentXmppChannel has a peer created on channel coming up. When channel
goes down this peer is deleted and a walk is started to delete states and path
for this peer in route and vrf entries. After channel has gone into not-ready
state, it may get timedout and ultimately deleted. However walk is still
underway and the deleted peer of this channel is not yet unregistered from vrf
and route tables. Either this walk notification or any update on db-entry in
these tables will send deleted channel in argument. This will result in crash.

Solution:
On every notification there is a check to find out if its for active peer or
deleted one(IsBgpPeerActive). In the same routine check if channel passed is one
of the channels being used by agent or not. If its not then conside peer as
inactive.

Change-Id: I8cdb01d9e5a6c83e6f9f7b6785288d9ea5c973d2
Closes-bug: 1458529
(cherry picked from commit 9056ce9349ba7f46fb0d7a8f88af68a9e537944b)

On deletion of AgentXmppchannel, BGP peer was not cleared properly.

Ideally every channel state change is responsible for cleaning up or creating
BGP peer. However with commit
https://github.com/Juniper/contrail-controller/commit/6d845c15ca2bd114ded81d4092aa134b929ec39e
above assumption will not be true.
To fix it artificially inject a channel down event on deletion of agent xmpp
channel. Since BGP peer is being manipulated also push the process of applying
discovery servers to controller work queue.

Change-Id: Ia733ed1061747153fbfb841f63813ad148ce6bfc
Closes-bug: 1460435
(cherry picked from commit 45df6471a7948f1bc4dadab830cc6d57b42e3859)