ToR-agent cannot reply keepalive to HAproxy within 2000ms

Bug #1490788 reported by Daisuke Nakajima
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
New
Medium
Hari Prasad Killi
Trunk
New
Medium
Hari Prasad Killi

Bug Description

When ToR-Agent is busy status, ToR-agent cannot reply keepalive to HAproxy, then HAproxt detects Active ToR-agent as Down status. So, HAproxy change its connection to other.
As a result, QFX traffic will be unstable and some traffic might be dropped.
Also, some Remote MAC address are not stored to QFX.

Typically, ToR-agent gets many MAC address at same time due to OVSDB session failure, ToR Agent consume high CPU. Then the issue will be seen.

Sep 1 10:06:38 contrail-1 haproxy[2321]: Server contrail-tor-agent-1/172.23.10.196 is DOWN, reason: Layer4 timeout, check duration: 2000ms. 1 active and 0 backup servers left. 2 sessions active, 0 requeued, 0 remaining in queue.
Sep 1 10:06:38 contrail-1 haproxy[2321]: Server contrail-tor-agent-1/172.23.10.196 is DOWN, reason: Layer4 timeout, check duration: 2000ms. 1 active and 0 backup servers left. 2 sessions active, 0 requeued, 0 remaining in queue.
Sep 1 10:06:52 contrail-1 haproxy[2321]: Server contrail-tor-agent-1/172.23.10.196 is UP, reason: Layer4 check passed, check duration: 1000ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Sep 1 10:06:52 contrail-1 haproxy[2321]: Server contrail-tor-agent-1/172.23.10.196 is UP, reason: Layer4 check passed, check duration: 1000ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Sep 1 10:06:55 contrail-1 haproxy[2321]: 172.23.11.41:63264 [01/Sep/2015:10:06:15.065] contrail-tor-agent-1 contrail-tor-agent-1/172.23.10.196 1/0/40078 2360451 -- 582/3/3/1/0 0/0
Sep 1 10:06:57 contrail-1 haproxy[2321]: 172.23.11.48:53725 [01/Sep/2015:10:04:27.079] contrail-tor-agent-1 contrail-tor-agent-1/172.23.10.196 1/0/150064 9566 -- 582/3/3/1/0 0/0
Sep 1 10:07:10 contrail-1 haproxy[2321]: 172.23.11.48:65377 [01/Sep/2015:10:07:00.169] contrail-tor-agent-1 contrail-tor-agent-1/172.23.10.196 1/10004/10414 1500 -- 585/6/6/2/2 0/0
Sep 1 10:07:11 contrail-1 haproxy[2321]: 172.23.11.41:64941 [01/Sep/2015:10:06:56.159] contrail-tor-agent-1 contrail-tor-agent-1/172.23.10.196 1/15003/15003 1500 -- 584/5/5/1/3 0/0
Sep 1 10:07:11 contrail-1 haproxy[2321]: 172.23.11.48:57517 [01/Sep/2015:10:06:58.157] contrail-tor-agent-1 contrail-tor-agent-1/172.23.10.196 1/13005/13005 1500 -- 583/4/4/0/2 0/0
Sep 1 10:07:38 contrail-1 haproxy[2321]: 172.23.11.41:53413 [01/Sep/2015:10:06:58.160] contrail-tor-agent-1 contrail-tor-agent-1/172.23.10.197 1/0/40076 2637916 -- 580/3/3/3/0 0/0
Sep 1 10:07:39 contrail-1 haproxy[2321]: 172.23.11.45:62708 [01/Sep/2015:10:06:29.551] contrail-tor-agent-1 contrail-tor-agent-1/172.23.10.197 1/0/70088 7611 -- 581/3/3/2/0 0/0
Sep 1 10:07:42 contrail-1 haproxy[2321]: 172.23.11.37:57043 [01/Sep/2015:10:05:22.273] contrail-tor-agent-1 contrail-tor-agent-1/172.23.10.197 1/0/140147 369147 -- 580/3/3/1/0 0/0
Sep 1 10:07:49 contrail-1 haproxy[2321]: Server contrail-tor-agent-1/172.23.10.197 is DOWN, reason: Layer4 timeout, check duration: 2001ms. 1 active and 0 backup servers left. 3 sessions active, 0 requeued, 0 remaining in queue.
Sep 1 10:07:49 contrail-1 haproxy[2321]: Server contrail-tor-agent-1/172.23.10.197 is DOWN, reason: Layer4 timeout, check duration: 2001ms. 1 active and 0 backup servers left. 3 sessions active, 0 requeued, 0 remaining in queue.
Sep 1 10:07:49 contrail-1 haproxy[2321]: 172.23.11.48:53428 [01/Sep/2015:10:07:02.176] contrail-tor-agent-1 contrail-tor-agent-1/172.23.10.197 1/0/46893 8268 CD 585/6/6/2/0 0/0
Sep 1 10:07:53 contrail-1 haproxy[2321]: 172.23.11.37:51487 [01/Sep/2015:10:07:45.448] contrail-tor-agent-1 contrail-tor-agent-1/172.23.10.197 1/8003/8004 1502 -- 584/5/5/1/1 0/0
Sep 1 10:07:53 contrail-1 haproxy[2321]: 172.23.11.37:52017 [01/Sep/2015:10:07:43.430] contrail-tor-agent-1 contrail-tor-agent-1/172.23.10.197 1/999/10085 1502 -- 583/4/4/0/0 0/0
Sep 1 10:07:57 contrail-1 haproxy[2321]: Server contrail-tor-agent-1/172.23.10.197 is UP, reason: Layer4 check passed, check duration: 0ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Sep 1 10:07:57 contrail-1 haproxy[2321]: Server contrail-tor-agent-1/172.23.10.197 is UP, reason: Layer4 check passed, check duration: 0ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Sep 1 10:08:12 contrail-1 haproxy[2321]: Server contrail-tor-agent-1/172.23.10.196 is DOWN, reason: Layer4 timeout, check duration: 2000ms. 1 active and 0 backup servers left. 4 sessions active, 0 requeued, 0 remaining in queue.
Sep 1 10:08:12 contrail-1 haproxy[2321]: Server contrail-tor-agent-1/172.23.10.196 is DOWN, reason: Layer4 timeout, check duration: 2000ms. 1 active and 0 backup servers left. 4 sessions active, 0 requeued, 0 remaining in queue.
Sep 1 10:08:20 contrail-1 haproxy[2321]: 172.23.11.45:56311 [01/Sep/2015:10:07:40.659] contrail-tor-agent-1 contrail-tor-agent-1/172.23.10.196 1/0/40083 2329151 -- 582/3/3/3/0 0/0
Sep 1 10:08:29 contrail-1 haproxy[2321]: 172.23.11.41:64634 [01/Sep/2015:10:07:39.260] contrail-tor-agent-1 contrail-tor-agent-1/172.23.10.196 1/0/50058 2171454 -- 583/4/4/2/0 0/0
Sep 1 10:08:36 contrail-1 haproxy[2321]: Server contrail-tor-agent-1/172.23.10.196 is UP, reason: Layer4 check passed, check duration: 0ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Sep 1 10:08:36 contrail-1 haproxy[2321]: Server contrail-tor-agent-1/172.23.10.196 is UP, reason: Layer4 check passed, check duration: 0ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
Sep 1 10:08:49 contrail-1 haproxy[2321]: Server contrail-tor-agent-1/172.23.10.197 is DOWN, reason: Layer4 timeout, check duration: 2000ms. 1 active and 0 backup servers left. 2 sessions active, 0 requeued, 0 remaining in queue.
Sep 1 10:08:49 contrail-1 haproxy[2321]: Server contrail-tor-agent-1/172.23.10.197 is DOWN, reason: Layer4 timeout, check duration: 2000ms. 1 active and 0 backup servers left. 2 sessions active, 0 requeued, 0 remaining in queue.
Sep 1 10:08:58 contrail-1 haproxy[2321]: Server contrail-tor-agent-1/172.23.10.197 is UP, reason: Layer4 check passed, check duration: 1001ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

information type: Proprietary → Public
description: updated
description: updated
tags: added: customer
tags: added: bms
removed: bmc
Revision history for this message
Daisuke Nakajima (dnakajima) wrote :

One of the workaround is to adjust keepalive timer of haproxy to scale.
e.g.)change 2000ms to 4000ms.

Revision history for this message
Hari Prasad Killi (haripk) wrote :

This is a TCP session between config node (where HA Proxy is running) and the TSN node. We have to check if there are any packet drops in vrouter which could be causing this. Need dropstats output a few times when this issue is observed. Also, please provide what rate of traffic is seen on vrouter at this point.

Other reason could be if the network is flaky.

tags: added: vrouter
Revision history for this message
Daisuke Nakajima (dnakajima) wrote :

I will reproduce it on early next week and collect that you mentioned.

tags: added: releasenote
Revision history for this message
Daisuke Nakajima (dnakajima) wrote :
Download full text (5.7 KiB)

This issue was reproduced at 2.21-102 when 60K unknown unicast traffic which rate was 100kbps reached to TSN.
After that, tor-agent had two connections and seemed not to send any OVSDB messages to QFX.

<?xml-stylesheet type="text/xsl" href="/universal_parse.xsl"?><OvsdbClientResp type="sandesh"><client type="struct" identifier="1"><SandeshOvsdbClient><protocol type="string" identifier="1">PSSL</protocol><server type="string" identifier="2">0.0.0.0</server><port type="u16" identifier="3">9999</port><tor_service_node type="string" identifier="4">10.84.50.4</tor_service_node><keepalive_interval type="i32" identifier="5">10000</keepalive_interval><ha_stale_route_interval type="i32" identifier="6">300000</ha_stale_route_interval><sessions type="list" identifier="7"><list type="struct" size="1"><SandeshOvsdbClientSession><status type="string" identifier="1">Established</status><remote_ip type="string" identifier="2">10.84.50.1</remote_ip><remote_port type="u16" identifier="3">60620</remote_port><connection_time type="string" identifier="4">2015-Oct-13 06:04:23.986580</connection_time><txn_stats type="struct" identifier="5"><SandeshOvsdbTxnStats><txn_initiated type="u64" identifier="1">0</txn_initiated><txn_succeeded type="u64" identifier="2">0</txn_succeeded><txn_failed type="u64" identifier="3">0</txn_failed><txn_pending type="u64" identifier="4">0</txn_pending><pending_send_msg type="u64" identifier="5">0</pending_send_msg></SandeshOvsdbTxnStats></txn_stats></SandeshOvsdbClientSession></list></sessions></SandeshOvsdbClient></client><more type="bool" identifier="0">false</more></OvsdbClientResp>

Oct 13 14:58:31 contrail-02 haproxy[67033]: Server contrail-tor-agent-1/10.84.50.4 is DOWN, reason: Layer4 timeout, check duration: 2002ms. 1 active and 0 backup servers left. 3 sessions active, 0 requeued, 0 remaining in queue.
Oct 13 14:58:31 contrail-02 haproxy[67033]: Server contrail-tor-agent-1/10.84.50.4 is DOWN, reason: Layer4 timeout, check duration: 2002ms. 1 active and 0 backup servers left. 3 sessions active, 0 requeued, 0 remaining in queue.

root@contrail-05:~# contrail-status
== Contrail vRouter ==
supervisor-vrouter: active
contrail-tor-agent-0 active
contrail-tor-agent-1 active
contrail-tor-agent-2 active
contrail-vrouter-agent active
contrail-vrouter-nodemgr active

root@contrail-05:~# dropstats
GARP 0
ARP no where to go 0
Invalid ARPs 6

Invalid IF 0
Trap No IF 0
IF TX Discard 0
IF Drop 0
IF RX Discard 0

Flow Unusable 0
Flow No Memory 0
Flow Table Full 0
Flow NAT no rflow 0
Flow Action Drop 0
Flow Action Invalid 0
Flow Invalid Protocol 0
Flow Queue Limit Exceeded 0

Discards 766814
TTL Exceeded 0
Mcast Clone Fail 0
Cloned Original 2626891

Invalid NH 55
Invalid Label 0
Invalid Protocol...

Read more...

Revision history for this message
Daisuke Nakajima (dnakajima) wrote :
Revision history for this message
Daisuke Nakajima (dnakajima) wrote :

While stopping to send Unknown unicast, ksoftirqd and kworker occupied CPU.

Tasks: 340 total, 9 running, 331 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.0 sy, 0.0 ni, 23.2 id, 0.0 wa, 0.0 hi, 76.8 si, 0.0 st
KiB Mem: 13168126+total, 4836636 used, 12684462+free, 154340 buffers
KiB Swap: 13413273+total, 0 used, 13413273+free. 592932 cached Mem

   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
   263 root 20 0 0 0 0 R 66.8 0.0 28:07.37 ksoftirqd/3
   258 root 20 0 0 0 0 R 66.5 0.0 28:08.08 ksoftirqd/2
   253 root 20 0 0 0 0 R 66.2 0.0 28:20.01 ksoftirqd/1
   299 root 20 0 0 0 0 R 33.6 0.0 14:31.11 kworker/1:1
   300 root 20 0 0 0 0 R 33.2 0.0 14:25.87 kworker/2:1
   301 root 20 0 0 0 0 S 33.2 0.0 15:08.83 kworker/3:1
     3 root 20 0 0 0 0 R 0.3 0.0 0:12.34 ksoftirqd/0
  5961 root 20 0 25088 1860 1172 R 0.3 0.0 0:00.06 top
 57022 root 20 0 1224500 863744 46208 S 0.3 0.7 37:23.54 contrail-vroute

Revision history for this message
Daisuke Nakajima (dnakajima) wrote :

this issue was solved after adding option vr_bridge_oentries=1 in /etc/modprobe.d/vrouter.conf

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.