[R2.20.21] TSN HA: 10 min of ARP traffic loss during TSN switch over
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Juniper Openstack | Status tracked in Trunk | |||||
R2.20 |
Fix Committed
|
High
|
Hari Prasad Killi | |||
Trunk |
Fix Committed
|
High
|
Hari Prasad Killi |
Bug Description
Problem
-------------
Observed 10 min (604 second) of ARP traffic loss during TSN switch over
Description
------------------
Sending ARP Traffic From BMS and TSN is responding to that ARP request. Packets is send at 1000 PPS.
Now stopped supervisor-vrouter service on active TSN/Tor Agent. TOR gets connected to Backup TSN/TOR Agent.
Traffic is not resuming for more than 10 min.
There are 2 different issue observed.
From QFX side New TSN route need to be programmed for broadcast packet. Which is taking around a min. After that ARP packets are reaching to active TSN but getting dropped there.
Here in TSN , TOR tunnel IP not added as bracts route. Traffic is only resuming after 10min. During this time observed high cpu hogging by contrail-vrouter and tor agent.
top - 19:34:11 up 2 days, 3:05, 1 user, load average: 19.93, 8.12, 3.30
Tasks: 339 total, 1 running, 338 sleeping, 0 stopped, 0 zombie
%Cpu(s): 25.3 us, 4.7 sy, 0.0 ni, 69.9 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem: 19779974+total, 12261744 used, 18553800+free, 179420 buffers
KiB Swap: 20128972+total, 0 used, 20128972+free. 7524344 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
25554 root 20 0 2451348 729840 20884 S 570.8 0.4 9:11.46 contr
ail-vroute
25555 root 20 0 2413780 331968 15300 S 283.6 0.2 4:06.66 contrail-tor-ag
Setup
----------
env.roledefs = {
'all': [host1, host2, host3, host4, host5, host6],
'cfgm': [host1, host2, host3],
'openstack': [host1, host2, host3],
'webui': [host2],
'control': [host1, host3],
'compute': [host4, host5, host6],
'tsn': [host4, host5],
'toragent': [host4, host5],
'collector': [host1, host3],
'database': [host1, host2, host3],
'build': [host_build],
}
env.hostnames = {
'all': ['nodei6', 'nodei7', 'nodei8', 'nodei9', 'nodei10', 'nodei19']
}
Following are the current observations.
1. The connection between agent and QFX is flapping. The OVSDB server is closing the connection as it received a wrong packet - seen with higher data. Possible issue with SSL code.
2. Taking more time (order of 10s of seconds) before walks are initiated
3. Delay in the subscribe message reaching control node