Comment 33 for bug 1447619

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The status update. We repeated tests with:
* increased net_ticktime=500 (from 60 default),
* with adjusted sysctls, see http://pastebin.com/6L2pM5ya
* with increased rx tx buffers = 4096,
but results were the same: starting from some moment of time, there were a packet loss, despite on the fact that max throughput was only ~3-4Gbit/s from 10Gbit/s. Alexander Nevenchannyy (anevenchannyy) found the RC of packet loss. This caused by wrong IRQ balance. He discovered that during a particular iperf test there is a limit reached for throughput ~3-4Gbit and IRQ load was >91% for a single CPU core:

root@node-9:~# iperf --client 192.168.0.4 --format m --nodelay --len 8k --time 10 --parallel 1 -u -i1 -b10000M
------------------------------------------------------------
Client connecting to 192.168.0.4, UDP port 5001
Sending 8192 byte datagrams
UDP buffer size: 0.20 MByte (default)
------------------------------------------------------------
[ 3] local 192.168.0.6 port 52336 connected with 192.168.0.4 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0- 1.0 sec 531 MBytes 4452 Mbits/sec
[ 3] 1.0- 2.0 sec 528 MBytes 4425 Mbits/sec
[ 3] 2.0- 3.0 sec 535 MBytes 4485 Mbits/sec
[ 3] 3.0- 4.0 sec 536 MBytes 4498 Mbits/sec
[ 3] 4.0- 5.0 sec 538 MBytes 4514 Mbits/sec
[ 3] 5.0- 6.0 sec 538 MBytes 4517 Mbits/sec
[ 3] 6.0- 7.0 sec 535 MBytes 4490 Mbits/sec
[ 3] 7.0- 8.0 sec 537 MBytes 4507 Mbits/sec
[ 3] 8.0- 9.0 sec 538 MBytes 4517 Mbits/sec
[ 3] 9.0-10.0 sec 539 MBytes 4524 Mbits/sec
[ 3] 0.0-10.0 sec 5356 MBytes 4492 Mbits/sec

... and load stats was:
05:31:48 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle
05:31:49 PM all 2.44 0.00 0.67 0.00 0.00 7.16 0.00 0.00 0.00 89.72
05:31:49 PM 0 7.14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 92.86
05:31:49 PM 1 3.03 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 95.96
05:31:49 PM 2 4.12 0.00 3.09 0.00 0.00 0.00 0.00 0.00 0.00 92.78
05:31:49 PM 3 3.92 0.00 1.96 0.98 0.00 0.00 0.00 0.00 0.00 93.14
05:31:49 PM 4 0.00 0.00 0.00 0.00 0.00 91.40 0.00 0.00 0.00 8.60
05:31:49 PM 5 4.04 0.00 1.01 0.00 0.00 0.00 0.00 0.00 0.00 94.95
05:31:49 PM 6 1.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.99
05:31:49 PM 7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:31:49 PM 8 0.00 0.00 1.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00
05:31:49 PM 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
05:31:49 PM 10 2.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 98.00
05:31:49 PM 11 2.97 0.00 0.99 0.00 0.00 0.00 0.00 0.00 0.00 96.04

With one of CPU cores overloaded:
52 root 20 0 0 0 0 R 49.4 0.0 2:47.76 ksoftirqd/4

Hence, won't fix. This is not a rabbitmq mnesia issue, we have to address CPU load for IRQ balance, see http://natsys-lab.blogspot.ru/2012/09/linux-scaling-softirq-among-many-cpu.html