lmbench tcp lib backlog reservation too small, can cause hang.

Bug #1706735 reported by Shay Gal-On
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lmbench (Ubuntu)
Fix Released
Undecided
dann frazier
Xenial
Fix Released
Undecided
dann frazier
Zesty
Fix Released
Undecided
dann frazier

Bug Description

[Impact]
Network related tests will hang on high-core count systems.

lib_tcp reserves backlog of 100. If there are more then 100 threads on the system, the backlog is pretty much guaranteed to fail. The clients will then fail to connect and lmbench will hang. Potential patch below works around this by reserving backlog for at least 4x number of processors reported in the system, assuming that common use case will be to run lmbench with as many threads as there are cpus reported from the OS. Alternatively, backlog reservation can be made into a config option.

[Test Case]
ubuntu@ubuntu:~$ /usr/lib/lmbench/bin/lat_select -P 128 tcp
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out

With the patch:
ubuntu@ubuntu:~/lmbench$ /usr/lib/lmbench/bin/lat_select -P 128 tcpSelect on 200 tcp fd's:
XX.XXXX microseconds

[Regression Risk]
The patch is careful to preserve the previous behavior for lower core count systems. Perhaps if there were a bug in the platform's sysconf(_SC_NPROCESSORS_ONLN) function, we could end up with a regression that skewed results.

Raghuram Kota (rkota)
tags: added: cn99xx
Revision history for this message
Shay Gal-On (sgalon) wrote :

small change, to maintain the same behavior as prev with <100 cores:
int sock,np,backlog;
should be
int sock,np,backlog=100;

Revision history for this message
dann frazier (dannf) wrote :

@Shay: thanks. Do you happen to have a reproducer that just involves running an lmbench subtest? I tried the following, but didn't observe a hang:

Host A:
ubuntu@ubuntu:~$ /usr/lib/lmbench/bin/bw_tcp -s

Host B:
ubuntu@ubuntu:~$ /usr/lib/lmbench/bin/bw_tcp -P 224 <HostA> 2048M
0.065536 82.65 MB/sec

Changed in lmbench (Ubuntu):
status: New → Incomplete
Revision history for this message
Shay Gal-On (sgalon) wrote :

try lat_select -P 128 tcp

dann frazier (dannf)
Changed in lmbench (Ubuntu):
status: Incomplete → Confirmed
dann frazier (dannf)
Changed in lmbench (Ubuntu):
status: Confirmed → In Progress
Revision history for this message
dann frazier (dannf) wrote :

Thanks. I can reproduce:

ubuntu@ubuntu:~$ /usr/lib/lmbench/bin/lat_select -P 128 tcp
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out

With the patch:
ubuntu@ubuntu:~/lmbench$ /usr/lib/lmbench/bin/lat_select -P 128 tcpSelect on 200 tcp fd's:
XX.XXXX microseconds

dann frazier (dannf)
Changed in lmbench (Ubuntu):
status: In Progress → Fix Committed
assignee: nobody → dann frazier (dannf)
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lmbench - 3.0-a9+debian.1-1

---------------
lmbench (3.0-a9+debian.1-1) unstable; urgency=medium

  * Acknowledge NMUs, thanks Andreas!
  * Use upstream tarball as orig.tar.gz. This appends a +debian.1 to the
    upstream version to avoid conflicting with the previous orig.tar.gz.
    The previous orig.tar.gz was a tarball that contained the upstream
    tarball, which had to be unpacked prior to build in debian/rules.
  * Switch to dpkg-source 3.0 (quilt) format from dpatch, which is
    deprecated. (Closes: #875644)
  * Add myself to Uploaders.
  * debian/rules: Remove unused H_ARCH/OPTIMIZATION variables.
  * debian/info: Remove. The listed file does not exist in the source.
  * Turn debhelper compat level up to 11.
  * Allow lmbench-docs to install most documentation in the main package,
    as recommended by §12.3 since 3.9.7.
  * Do not generate lmbench-run manpage if DEB_BUILD_OPTIONS contains
    'nodocs'.
  * Add a "lmbench-" prefix to manpages that conflict with other packages.
    Since this allows us to include all of the manpages in lmbench, stop
    including separate copies in lmbench-doc.
  * d/p/dynamic-tcp-backlog.patch: Dynamically increase the TCP backlog on
    high core-count systems. (LP: #1706735)
  * Bump Standards-Version to 4.1.0.

 -- dann frazier <email address hidden> Wed, 13 Sep 2017 16:39:57 -0600

Changed in lmbench (Ubuntu):
status: Fix Committed → Fix Released
dann frazier (dannf)
Changed in lmbench (Ubuntu Xenial):
status: New → In Progress
assignee: nobody → dann frazier (dannf)
Changed in lmbench (Ubuntu Zesty):
assignee: nobody → dann frazier (dannf)
status: New → In Progress
description: updated
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Shay, or anyone else affected,

Accepted lmbench into zesty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/lmbench/3.0-a9-1.3ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-zesty to verification-done-zesty. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-zesty. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in lmbench (Ubuntu Zesty):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-zesty
Changed in lmbench (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed-xenial
Revision history for this message
Brian Murray (brian-murray) wrote :

Hello Shay, or anyone else affected,

Accepted lmbench into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/lmbench/3.0-a9-1.1ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Revision history for this message
Shay Gal-On (sgalon) wrote :

Tested on xenial on ThunderX2:

shay@t99-1s-03:~
$ /usr/lib/lmbench/bin/lat_select -P 128 tcp
Select on 200 tcp fd's: 31.0622 microseconds
shay@t99-1s-03:~
$ apt-cache policy lmbench
lmbench:
  Installed: 3.0-a9-1.1ubuntu0.1
  Candidate: 3.0-a9-1.1ubuntu0.1
  Version table:
 *** 3.0-a9-1.1ubuntu0.1 500
        500 http://ports.ubuntu.com/ubuntu-ports xenial-proposed/multiverse arm64 Packages
        100 /var/lib/dpkg/status
     3.0-a9-1.1 500
        500 http://ports.ubuntu.com/ubuntu-ports xenial/multiverse arm64 Packages

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
dann frazier (dannf) wrote :

Verified in zesty as well:

ubuntu@ubuntu:~$ /usr/lib/lmbench/bin/lat_select -P 110 tcp
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
connect: Connection timed out
^C
ubuntu@ubuntu:~$ sudo apt install lmbench -y
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following packages will be upgraded:
  lmbench
1 upgraded, 0 newly installed, 0 to remove and 6 not upgraded.
Need to get 329 kB of archives.
After this operation, 1,024 B of additional disk space will be used.
Get:1 http://ports.ubuntu.com/ubuntu-ports zesty-proposed/multiverse arm64 lmbench arm64 3.0-a9-1.3ubuntu0.1 [329 kB]
Fetched 329 kB in 0s (818 kB/s)
(Reading database ... 64393 files and directories currently installed.)
Preparing to unpack .../lmbench_3.0-a9-1.3ubuntu0.1_arm64.deb ...
Unpacking lmbench (3.0-a9-1.3ubuntu0.1) over (3.0-a9-1.3) ...
Setting up lmbench (3.0-a9-1.3ubuntu0.1) ...
Processing triggers for man-db (2.7.6.1-2) ...
ubuntu@ubuntu:~$ /usr/lib/lmbench/bin/lat_select -P 110 tcp
Select on 200 tcp fd's: 17.5206 microseconds

tags: added: verification-done verification-done-zesty
removed: verification-needed verification-needed-zesty
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lmbench - 3.0-a9-1.3ubuntu0.1

---------------
lmbench (3.0-a9-1.3ubuntu0.1) zesty; urgency=medium

  * d/p/dynamic-tcp-backlog.dpatch: Dynamically increase the TCP backlog on
    high core-count systems. (LP: #1706735)

 -- dann frazier <email address hidden> Mon, 16 Oct 2017 15:37:03 -0600

Changed in lmbench (Ubuntu Zesty):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for lmbench has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package lmbench - 3.0-a9-1.1ubuntu0.1

---------------
lmbench (3.0-a9-1.1ubuntu0.1) xenial; urgency=medium

  * d/p/dynamic-tcp-backlog.dpatch: Dynamically increase the TCP backlog on
    high core-count systems. (LP: #1706735)

 -- dann frazier <email address hidden> Mon, 16 Oct 2017 15:37:03 -0600

Changed in lmbench (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.