Problem:
If XMPP channel send fails which internally(TCP send) translates into a defer
send operation, then agent controller module was treating this as a failure.
In case of route update which is seen in bug any such defer will result in
exported flag for route set to false. Since it is set to false, later on route
delete the unsubscribe for this route will not be sent. However update of route
was deferred and would have gone in some time but control node will never get
delete which will result in issue mentioned by the bug.
Solution:
Ideally all the failed send should be used to enqueue further requests to
control node and replay them whenever callback(writereadycb) is called.
In this way agent will not overload socket.
However as a quick fix the error will not be used to judge the further operation
after send is done. This is in asumption that send will always be succesful.
Currently following messages are sent:
1) VM config sub/unsub
2) Config subscribe for agent
3) VRF sub/unsub
4) Route sub/unsub.
Connection not present will be taken care by channel flap handling.
Change-Id: Ib6e0856b5c689b51209add4ab459b8bd2e952143
Closes-bug: 1453483
(cherry picked from commit 0a70915fd3bc1954154e657f59123d1a4597f2a4)
However above commit does not solve the issue.
Reason being, walk count was being incremented on enqueue of walk but when walk
is processed it calls Cancel for any previously started walk. This Cancel
decrements the walk count. This defeats the purpose of moving walk count
increment to enqueue in above commit.
Also consider a walk for VRF where there are four route tables. This should
result in walk count to be 5 (1 for vrf table + 4 for route tables). With above
fix this will be 2 (1 for Vrf + 1 for route table). It didnt take into
consideration that route walk count needs to be incremented for each route
table.
Solution:
Use a seperate enqueue walk count and restore the walk_count as it was before
the above commit. Use both of them to check for walk done.
Closes-bug: 1455862
Change-Id: I8d96732375f649e70d6754cb6c5b34c24867ce0c
(cherry picked from commit 705165854bbdaff9c47a2a9443410eec53e4fb37)
Multicast route gets deleted when vxlan id is changed in configured mode
Problem:
In oper multicast if local peer vxlan-id is changed then there was add issued
for route with new vxlan and delete issued for same with old vxlan.
Since the peer is local the path search only compares peer and not vxlan.
This results in deletion of local path and eventually the multicast route.
Solution:
Need not withdraw path from local peer on vxlan id change. Just trigger update
of same. This will result in controller route _export call which in turn using
state set on flood route, will be able to identify that vxlan id is changed and
it will take care of withdrawal for old vxlan and update with new vxlan.
Change-Id: I3afeddd2620615bb477aec5a0c6715fcdc99352b
Closes-bug: 1457007
(cherry picked from commit a77f2c31a2c6fb8a8abe22f3a562767cf230cbef)
Mac route in agent was having invalid label and destination NH.
Problem:
Two issues were seen -
1) Label was wrongly interpreted as MPLS though it was VXLAN and never got
corrected. Sequence of events was that EVPN route was received before global
vrouter config. Encapsulation priorities was absent in agent, so evpn route
was programmed with Mpls-gre(control-node had sent Vxlan). Since Mpls encap
was chosen, label was interpreted as mpls label and not vxlan. When global
vrouter config was received resync was done for route. Resync also failed to
rectify encap to Vxlan(since vxlan is now available in priorities) because this
decision to rectify is based on vxlan id(i.e. if vxlan id is 0 default to mpls
as its invalid). In this case vxlan id
was 0 as explained above and hence encap continued to be Mpls.
2) Nexthop was different between evpn route and derived mac route.
This happened because in path sync of evpn route return was false even though NH
change was seen which resulted in avoidance of mac route rebake. Return was
false because value set by ChangeNh as true was overridden by MplsChange.
Solution:
For case 1) - If encap is Vxlan only in the message sent by control-node then
put label as vxlan id and mpls label as invalid, even though tunnel type is
computed as Mpls(encapsulation prioirties is not received). In case of Mpls
encap sent use label as Mpls and reset vxlan to invalid. In case both Vxlan and
Mpls are sent in encap then fallback to old approach of interpreting label on
computed tunnel type.
For case 2) - Fix the return value.
Change-Id: Ibeeb3de16d618ecb931c35d8937591d9c9f7f15e
Closes-bug: 1457355
(cherry picked from commit f8f09e3f94590333a3f1a38acff47aeb834c1f18)
Multicast route not deleted on vrf delete.
In Tor agent multicast route does not get deleted when VRF is deleted. This is
becasue deletion is triggered from logical-switch delete. Though that is valid,
vrf delete should also result in route delete.
As mentioned in the bug there can be cases in scaled setup where vrf delete is
received and VN delete is delayed. This may result in multicast route dangling
till VN is received. Ultimately if Vrf delete timeout gets executed then it will
crash.
Change-Id: I8db095a6e99dddeb17bea2edbcbe10fab4c58623
Closes-bug: 1458187
(cherry picked from commit e6c1005a12b70c8843f5dca929d169cef5f2e1e0)
AgentXmppChannel is invalid in route notification.
Problem:
Every AgentXmppChannel has a peer created on channel coming up. When channel
goes down this peer is deleted and a walk is started to delete states and path
for this peer in route and vrf entries. After channel has gone into not-ready
state, it may get timedout and ultimately deleted. However walk is still
underway and the deleted peer of this channel is not yet unregistered from vrf
and route tables. Either this walk notification or any update on db-entry in
these tables will send deleted channel in argument. This will result in crash.
Solution:
On every notification there is a check to find out if its for active peer or
deleted one(IsBgpPeerActive). In the same routine check if channel passed is one
of the channels being used by agent or not. If its not then conside peer as
inactive.
Change-Id: I8cdb01d9e5a6c83e6f9f7b6785288d9ea5c973d2
Closes-bug: 1458529
(cherry picked from commit 9056ce9349ba7f46fb0d7a8f88af68a9e537944b)
On deletion of AgentXmppchannel, BGP peer was not cleared properly.
Ideally every channel state change is responsible for cleaning up or creating
BGP peer. However with commit https://github.com/Juniper/contrail-controller/commit/6d845c15ca2bd114ded81d4092aa134b929ec39e
above assumption will not be true.
To fix it artificially inject a channel down event on deletion of agent xmpp
channel. Since BGP peer is being manipulated also push the process of applying
discovery servers to controller work queue.
Change-Id: Ia733ed1061747153fbfb841f63813ad148ce6bfc
Closes-bug: 1460435
(cherry picked from commit 45df6471a7948f1bc4dadab830cc6d57b42e3859)
Reviewed: https:/ /review. opencontrail. org/11496 github. org/Juniper/ contrail- controller/ commit/ bf9e76793aab587 0da98bf49b5fb6c 6eed82625f
Committed: http://
Submitter: Zuul
Branch: master
commit bf9e76793aab587 0da98bf49b5fb6c 6eed82625f
Author: Manish <email address hidden>
Date: Thu May 14 17:23:05 2015 +0530
Handle error from XMPP session send.
Problem:
If XMPP channel send fails which internally(TCP send) translates into a defer
send operation, then agent controller module was treating this as a failure.
In case of route update which is seen in bug any such defer will result in
exported flag for route set to false. Since it is set to false, later on route
delete the unsubscribe for this route will not be sent. However update of route
was deferred and would have gone in some time but control node will never get
delete which will result in issue mentioned by the bug.
Solution: writereadycb) is called.
Ideally all the failed send should be used to enqueue further requests to
control node and replay them whenever callback(
In this way agent will not overload socket.
However as a quick fix the error will not be used to judge the further operation
after send is done. This is in asumption that send will always be succesful.
Currently following messages are sent:
1) VM config sub/unsub
2) Config subscribe for agent
3) VRF sub/unsub
4) Route sub/unsub.
Connection not present will be taken care by channel flap handling.
Change-Id: Ib6e0856b5c689b 51209add4ab459b 8bd2e952143 4154e657f59123d 1a4597f2a4)
Closes-bug: 1453483
(cherry picked from commit 0a70915fd3bc195
VRF state not deleted in DelPeer walk.
Problem statement remains same as in this commit: /github. com/Juniper/ contrail- controller/ commit/ 8e302fcb991c8f5 d8f5defb85b9851 f8cde5f479
https:/
However above commit does not solve the issue.
Reason being, walk count was being incremented on enqueue of walk but when walk
is processed it calls Cancel for any previously started walk. This Cancel
decrements the walk count. This defeats the purpose of moving walk count
increment to enqueue in above commit.
Also consider a walk for VRF where there are four route tables. This should
result in walk count to be 5 (1 for vrf table + 4 for route tables). With above
fix this will be 2 (1 for Vrf + 1 for route table). It didnt take into
consideration that route walk count needs to be incremented for each route
table.
Solution:
Use a seperate enqueue walk count and restore the walk_count as it was before
the above commit. Use both of them to check for walk done.
Closes-bug: 1455862 e70d6754cb6c5b3 4c24867ce0c 9c47a2a9443410e ec53e4fb37)
Change-Id: I8d96732375f649
(cherry picked from commit 705165854bbdaff
Multicast route gets deleted when vxlan id is changed in configured mode
Problem:
In oper multicast if local peer vxlan-id is changed then there was add issued
for route with new vxlan and delete issued for same with old vxlan.
Since the peer is local the path search only compares peer and not vxlan.
This results in deletion of local path and eventually the multicast route.
Solution:
Need not withdraw path from local peer on vxlan id change. Just trigger update
of same. This will result in controller route _export call which in turn using
state set on flood route, will be able to identify that vxlan id is changed and
it will take care of withdrawal for old vxlan and update with new vxlan.
Change-Id: I3afeddd2620615 bb477aec5a0c671 5fcdc99352b a8abe22f3a56276 7cf230cbef)
Closes-bug: 1457007
(cherry picked from commit a77f2c31a2c6fb8
Mac route in agent was having invalid label and destination NH.
Problem: control- node had sent Vxlan). Since Mpls encap
Two issues were seen -
1) Label was wrongly interpreted as MPLS though it was VXLAN and never got
corrected. Sequence of events was that EVPN route was received before global
vrouter config. Encapsulation priorities was absent in agent, so evpn route
was programmed with Mpls-gre(
was chosen, label was interpreted as mpls label and not vxlan. When global
vrouter config was received resync was done for route. Resync also failed to
rectify encap to Vxlan(since vxlan is now available in priorities) because this
decision to rectify is based on vxlan id(i.e. if vxlan id is 0 default to mpls
as its invalid). In this case vxlan id
was 0 as explained above and hence encap continued to be Mpls.
2) Nexthop was different between evpn route and derived mac route.
This happened because in path sync of evpn route return was false even though NH
change was seen which resulted in avoidance of mac route rebake. Return was
false because value set by ChangeNh as true was overridden by MplsChange.
Solution:
For case 1) - If encap is Vxlan only in the message sent by control-node then
put label as vxlan id and mpls label as invalid, even though tunnel type is
computed as Mpls(encapsulation prioirties is not received). In case of Mpls
encap sent use label as Mpls and reset vxlan to invalid. In case both Vxlan and
Mpls are sent in encap then fallback to old approach of interpreting label on
computed tunnel type.
For case 2) - Fix the return value.
Change-Id: Ibeeb3de16d618e cb931c35d893759 1d9c9f7f15e 3a3f1a38acff47a eb834c1f18)
Closes-bug: 1457355
(cherry picked from commit f8f09e3f9459033
Multicast route not deleted on vrf delete.
In Tor agent multicast route does not get deleted when VRF is deleted. This is
becasue deletion is triggered from logical-switch delete. Though that is valid,
vrf delete should also result in route delete.
As mentioned in the bug there can be cases in scaled setup where vrf delete is
received and VN delete is delayed. This may result in multicast route dangling
till VN is received. Ultimately if Vrf delete timeout gets executed then it will
crash.
Change-Id: I8db095a6e99ddd eb17bea2edbcbe1 0fab4c58623 843f5dca929d169 cef5f2e1e0)
Closes-bug: 1458187
(cherry picked from commit e6c1005a12b70c8
AgentXmppChannel is invalid in route notification.
Problem:
Every AgentXmppChannel has a peer created on channel coming up. When channel
goes down this peer is deleted and a walk is started to delete states and path
for this peer in route and vrf entries. After channel has gone into not-ready
state, it may get timedout and ultimately deleted. However walk is still
underway and the deleted peer of this channel is not yet unregistered from vrf
and route tables. Either this walk notification or any update on db-entry in
these tables will send deleted channel in argument. This will result in crash.
Solution: tive). In the same routine check if channel passed is one
On every notification there is a check to find out if its for active peer or
deleted one(IsBgpPeerAc
of the channels being used by agent or not. If its not then conside peer as
inactive.
Change-Id: I8cdb01d9e5a6c8 3e6f9f7b6785288 d9ea5c973d2 6fb0d7a8f88af68 a9e537944b)
Closes-bug: 1458529
(cherry picked from commit 9056ce9349ba7f4
On deletion of AgentXmppchannel, BGP peer was not cleared properly.
Ideally every channel state change is responsible for cleaning up or creating /github. com/Juniper/ contrail- controller/ commit/ 6d845c15ca2bd11 4ded81d4092aa13 4b929ec39e
BGP peer. However with commit
https:/
above assumption will not be true.
To fix it artificially inject a channel down event on deletion of agent xmpp
channel. Since BGP peer is being manipulated also push the process of applying
discovery servers to controller work queue.
Change-Id: Ia733ed10617471 53fbfb841f63813 ad148ce6bfc bc4dadab830cc6d 57b42e3859)
Closes-bug: 1460435
(cherry picked from commit 45df6471a7948f1