Checkbox test run terminates if SSH session is disconnected without use of "screen" utility

Bug #1767015 reported by Alec Duroy on 2018-04-26
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Checkbox
Low
Maciej Kisielewski

Bug Description

Revised bug description
-----------------------

If a server test run is initiated via an SSH login, and if that login is terminated, the test run will also be terminated. This is a common occurrence when running network tests, and sometimes also CPU stress tests. Our workaround to this to date has been the "screen" utility, which maintains a terminal session even if the SSH connection is broken. Running the tests at a physical or KVM terminal also bypasses the problem. Better robustness to an SSH session breaking is desirable in case the user forgets to use "screen," though.

Original bug description
------------------------

While running certification test under 18.04 beta release, encountered persistent Ethernet Device test failure. All NIC cards either 1G or 10G Ethernet/SFP are failing Multi-NIC Iperf3 stress testing. At IPERF server, a message “the client has terminated” will pop out on screen. Even though the NIC interface did not lost connection because it still responds to a ping command as seen from IPERF server. The testing will not continue and it will terminate unexpectedly. This error always happened during Test No. 2 – when testing the 2nd port connection of any NIC cards (either 1G or 10G interfaces). IPERF output as follows:
tcpi_snd_cwnd 10 tcpi_snd_mss 1448
[ 5] 151.00-152.00 sec 1.09 GBytes 9.36 Gbits/sec
tcpi_snd_cwnd 10 tcpi_snd_mss 1448
[ 5] 152.00-153.00 sec 1.09 GBytes 9.38 Gbits/sec
tcpi_snd_cwnd 10 tcpi_snd_mss 1448
[ 5] 153.00-154.00 sec 1.09 GBytes 9.39 Gbits/sec
tcpi_snd_cwnd 10 tcpi_snd_mss 1448
[ 5] 154.00-155.00 sec 1.09 GBytes 9.39 Gbits/sec
tcpi_snd_cwnd 10 tcpi_snd_mss 1448
[ 5] 155.00-156.00 sec 1.09 GBytes 9.36 Gbits/sec
tcpi_snd_cwnd 10 tcpi_snd_mss 1448
[ 5] 156.00-157.00 sec 1.10 GBytes 9.41 Gbits/sec
tcpi_snd_cwnd 10 tcpi_snd_mss 1448
[ 5] 157.00-158.00 sec 1.08 GBytes 9.32 Gbits/sec
tcpi_snd_cwnd 10 tcpi_snd_mss 1448
[ 5] 158.00-159.00 sec 1.10 GBytes 9.41 Gbits/sec
[ 5] 158.00-159.00 sec 1.10 GBytes 9.41 Gbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
Test Complete. Summary Results:
[ ID] Interval Transfer Bandwidth
[ 5] 0.00-159.00 sec 0.00 Bytes 0.00 bits/sec sender
[ 5] 0.00-159.00 sec 174 GBytes 9.41 Gbits/sec receiver
CPU Utilization: local/receiver 29.9% (1.6%u/28.4%s), remote/sender 23.2% (0.6%u/22.6%s)
iperf3: the client has terminated
iperf 3.0.11
Linux iperf-Super-Server 4.13.0-38-generic #43~16.04.1-Ubuntu SMP Wed Mar 14 17:48:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------
Server listening on 5201
-----------------------------------------------------------

My configuration as shown below. Other information in the attached files:
Description: Ubuntu Bionic Beaver (development branch)
Release: 18.04
Codename: bionic
Kernel rev 4.15.0-19-generic

Alec Duroy (acduroy) wrote :
Alec Duroy (acduroy) wrote :

syslog and dmesg added

Alec Duroy (acduroy) wrote :

syslog added

Alec Duroy (acduroy) wrote :

Additional test, using the standalone iperf3 tool, run network performance test between the node and IPERF server. To imitate the server certification test, run 4 iterations each at 900 seconds time interval of iperf3 testing at the node side. The following command as shown below:

iperf3 -c <IPERF_SERVER_IP_ADDRESS> -t <TIME_INTERVAL>

Below is the sample output. As a summary, all NIC interfaces (1G/10G), each completed 1 hour of iperf testing without stoppage.

[ 5] 888.00-889.00 sec 1.09 GBytes 9.38 Gbits/sec 0 691 KBytes
[ 5] 889.00-890.00 sec 1.09 GBytes 9.37 Gbits/sec 0 691 KBytes
[ 5] 890.00-891.00 sec 1.09 GBytes 9.38 Gbits/sec 0 691 KBytes
[ 5] 891.00-892.00 sec 1.09 GBytes 9.38 Gbits/sec 0 699 KBytes
[ 5] 892.00-893.00 sec 1.09 GBytes 9.37 Gbits/sec 0 721 KBytes
[ 5] 893.00-894.00 sec 1.09 GBytes 9.34 Gbits/sec 0 721 KBytes
[ 5] 894.00-895.00 sec 1.09 GBytes 9.33 Gbits/sec 0 734 KBytes
[ 5] 895.00-896.00 sec 1.09 GBytes 9.40 Gbits/sec 0 737 KBytes
[ 5] 896.00-897.00 sec 1.09 GBytes 9.32 Gbits/sec 58 612 KBytes
[ 5] 897.00-898.00 sec 1.09 GBytes 9.40 Gbits/sec 0 669 KBytes
[ 5] 898.00-899.00 sec 1.07 GBytes 9.15 Gbits/sec 15 724 KBytes
[ 5] 899.00-900.00 sec 1.09 GBytes 9.33 Gbits/sec 0 724 KBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval Transfer Bandwidth Retr
[ 5] 0.00-900.00 sec 980 GBytes 9.35 Gbits/sec 6677 sender
[ 5] 0.00-900.00 sec 980 GBytes 9.35 Gbits/sec receiver

iperf Done.

Rod Smith (rodsmith) on 2018-04-27
affects: speedsta → checkbox
Rod Smith (rodsmith) wrote :

First, could you please clarify the following statements, which seem contradictory: "All NIC cards either 1G or 10G Ethernet/SFP are failing Multi-NIC Iperf3 stress testing" and "this error always happened during Test No. 2 – when testing the 2nd port connection of any NIC cards." The first clearly states a failure of all interfaces, but the second suggests that the first device tested passes while the second and subsequent tests fail.

If the first device passes, then chances are this bug is a duplicate of bug #1766330, which is now fixed. (The fix was released only a day or two ago, though.)

If NONE of the devices are passing, then this is something else. I seem to recall a screen shot in an e-mail from you showing that the iperf3 server was on a different network segment from the SUT. This won't work; the iperf3 server MUST be on the same network segment as the SUT. The reason is that the network tests take down all the network interfaces except the one being tested, since that's the only reliable way to restrict network traffic to the one interface being tested. Taking down these network devices usually takes down default routes, too, so the iperf3 target must be on the same network segment as the SUT. If I'm not remembering this correctly, then please say so.

Jeff Lane (bladernr) on 2018-04-27
Changed in checkbox:
status: New → Incomplete
Alec Duroy (acduroy) wrote :

This case is not the same as described on bug #1766330. All NIC devices installed on SUT have failed the Multi-NIC iperf3 stress testing on 18.04 beta release. The test stopped and exited with messages “packet_write_wait: Connection to <ip_address_primary-network>: Broken pipe” on the 2nd iteration of iperf3 testing. At IPERF server, a message popped-out on screen “the client has terminated”.

Most system have at least 4 NICs devices installed. Two 1Gb/s Ethernet (eno1 and eno2) and two 10Gb/s either Ethernet or SFP (in this case ens2s0f0, and ens2s0f1). Due to different socket type, two IPERF subnets created - one for 1G/10G Ethernet NIC devices and the other one is for 10G SFP NIC devices. Below is the networked logical diagram of my current configuration.

Rod Smith (rodsmith) wrote :

Per e-mail on 30 April, 2018, the problem seems to have been caused by an SSH session disconnecting during the test (which is common) combined with NOT using the "screen" utility. Under these circumstances, the test suite terminated the test run, causing the failure. The solution is to use "screen" or run the tests at the console. I'm going to leave this bug report open, since better robustness to this condition is desirable.

summary: - Multi-NIC Iperf3 stress testing failure on 18.04 beta
+ Checkbox test run terminates if SSH session is disconnected without use
+ of "screen" utility
description: updated
Rod Smith (rodsmith) on 2018-05-01
Changed in checkbox:
importance: Undecided → Low
status: Incomplete → Triaged
Maciej Kisielewski (kissiel) wrote :

This problem is one of many that checkbox-remote aims to solve.
In checkbox-remote when the controller is disconnected from SUT, it tries to reconnect and resume testing.

@Rod: If you think that will solve this particular bug, feel free to assign it to me. Also it will serve as an additional channel to broadcast this feature's landing.

Rod Smith (rodsmith) wrote :

Maciej, yes, that sounds like it would fix the problem, so I've assigned this bug to you. Feel free to ping me if you want me to test this, since I've reproduced this specific bug myself.

Changed in checkbox:
assignee: nobody → Maciej Kisielewski (kissiel)

I've been playing with remote... I think it may well solve the
problem, but it's not mature yet.

Maciej, is there a timeline for when this work will be complete and
ready for use by the general public? Considering to implement this
we'll have to make some major changes to the docs we maintain as well
as spend a period of educating the customers who use this stuff in
their test labs around the world.

I'm still going to address this issue with screen as well, until
Remote is ready, but ultimately I definitely think remote will add a
lot of value and ease to our process.

Jeff

--
Jeff Lane
Technical Partnership and Server Certification Programmes

"Entropy isn't what it used to be."

Maciej Kisielewski (kissiel) wrote :

Jeff,
There's one big branch awaiting landing. There need to be some UX-related branch that I feel requires a week of work + documentation (1-2 days, tops).
The thing is I need to devote some time to other projects so cannot do any promises. But by saying it can solve some problems it surely bumps the priority.

I'll keep you posted.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers