Chrony test hangs with libpcap2 1:2.31-1

Bug #1863590 reported by Christian Ehrhardt 
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
bubblewrap (Debian)
Fix Released
Unknown
bubblewrap (Ubuntu)
Fix Released
High
Unassigned
libcap2 (Debian)
Fix Released
Unknown
libcap2 (Ubuntu)
Fix Released
High
Unassigned

Bug Description

Tests are blocking and failing eventually with a timeout.
Hits all architectures and is reproducible.

Locally reproducible in autopkgtest:

sudo ~/work/autopkgtest/autopkgtest/runner/autopkgtest --no-built-binaries --apt-upgrade --apt-pocket=proposed=src:libpcap2 --shell-fail chrony_3.5-5ubuntu1.dsc -- qemu --qemu-options='-cpu host' --ram-size=2048 --cpus 2 ~/work/autopkgtest-focal-amd64.img

Seems to be a block on network on __skb_wait_for_more_packets

$cat /proc/3833/wchan
__skb_wait_for_more_packets

From strace POV that is a wait in accept that does not unblock:

$ sudo strace -rT -p 3833
strace: Process 3833 attached
     0.000000 accept(6, NULL, NULL
<hangs>

The logs are no help:
/tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stderr => empty
/tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stdout => matches the console

$ cat /tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stdout
make: Entering directory '/tmp/autopkgtest.epy1pq/autopkgtest_tmp'
cc -O2 -Wall -g -fPIC -c -o client.o client.c
cc -O2 -Wall -g -fPIC -shared -o clknetsim.so client.o -ldl -lm
g++ -O2 -Wall -g -fPIC -c -o clock.o clock.cc
g++ -O2 -Wall -g -fPIC -c -o node.o node.cc
g++ -O2 -Wall -g -fPIC -c -o generator.o generator.cc
g++ -O2 -Wall -g -fPIC -c -o network.o network.cc
g++ -O2 -Wall -g -fPIC -c -o server.o server.cc
g++ -O2 -Wall -g -fPIC -c -o stats.o stats.cc
g++ -O2 -Wall -g -fPIC -o clknetsim clock.o node.o generator.o network.o server.o stats.o
make: Leaving directory '/tmp/autopkgtest.epy1pq/autopkgtest_tmp'
001-defaults
<hang>

Noisy PS output:

4 0 506 1 20 0 2600 1920 - Ss ttyS1 0:00 /bin/sh
1 0 3690 506 20 0 2600 128 - S+ ttyS1 0:00 \_ /bin/sh
4 0 3691 3690 20 0 8172 3660 - S+ ttyS1 0:00 \_ su -s /bin/bash ubuntu -c set -e; export USER=`id -nu`; . /etc/profile >/dev/null 2>&1 || true; . ~/.profile
 >/dev/null 2>&1 || true; buildtree="/tmp/autopkgtest.epy1pq/build.KhE/src"; mkdir -p -m 1777 -- "/tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-artifacts"; export AUTOPKGTEST_ARTIFA
CTS="/tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-artifacts"; export ADT_ARTIFACTS="$AUTOPKGTEST_ARTIFACTS"; mkdir -p -m 755 "/tmp/autopkgtest.epy1pq/autopkgtest_tmp"; export AUTOP
KGTEST_TMP="/tmp/autopkgtest.epy1pq/autopkgtest_tmp"; export ADTTMP="$AUTOPKGTEST_TMP"; export DEBIAN_FRONTEND=noninteractive; export LANG=C.UTF-8; export DEB_BUILD_OPTIONS=parallel=2; unset
 LANGUAGE LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION LC_ALL;rm -f /tmp/autopkgtest_script_pi
d; set -C; echo $$ > /tmp/autopkgtest_script_pid; set +C; trap "rm -f /tmp/autopkgtest_script_pid" EXIT INT QUIT PIPE; cd "$buildtree"; chmod +x /tmp/autopkgtest.epy1pq/build.KhE/src/debian/
tests/upstream-simulation-test-suite; touch /tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stdout /tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stderr; /tmp/autopkgtest.epy1
pq/build.KhE/src/debian/tests/upstream-simulation-test-suite 2> >(tee -a /tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stderr >&2) > >(tee -a /tmp/autopkgtest.epy1pq/upstream-simula
tion-test-suite-stdout);
4 1000 3720 3691 20 0 8608 3852 do_wai Ss ? 0:00 \_ bash -c set -e; export USER=`id -nu`; . /etc/profile >/dev/null 2>&1 || true; . ~/.profile >/dev/null 2>
&1 || true; buildtree="/tmp/autopkgtest.epy1pq/build.KhE/src"; mkdir -p -m 1777 -- "/tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-artifacts"; export AUTOPKGTEST_ARTIFACTS="/tmp/auto
pkgtest.epy1pq/upstream-simulation-test-suite-artifacts"; export ADT_ARTIFACTS="$AUTOPKGTEST_ARTIFACTS"; mkdir -p -m 755 "/tmp/autopkgtest.epy1pq/autopkgtest_tmp"; export AUTOPKGTEST_TMP="/t
mp/autopkgtest.epy1pq/autopkgtest_tmp"; export ADTTMP="$AUTOPKGTEST_TMP"; export DEBIAN_FRONTEND=noninteractive; export LANG=C.UTF-8; export DEB_BUILD_OPTIONS=parallel=2; unset LANGUAGE LC_C
TYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION LC_ALL;rm -f /tmp/autopkgtest_script_pid; set -C; ech
o $$ > /tmp/autopkgtest_script_pid; set +C; trap "rm -f /tmp/autopkgtest_script_pid" EXIT INT QUIT PIPE; cd "$buildtree"; chmod +x /tmp/autopkgtest.epy1pq/build.KhE/src/debian/tests/upstream
-simulation-test-suite; touch /tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stdout /tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stderr; /tmp/autopkgtest.epy1pq/build.KhE/s
rc/debian/tests/upstream-simulation-test-suite 2> >(tee -a /tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stderr >&2) > >(tee -a /tmp/autopkgtest.epy1pq/upstream-simulation-test-suit
e-stdout);
0 1000 3728 3720 20 0 2600 732 do_wai S ? 0:00 \_ /bin/sh /tmp/autopkgtest.epy1pq/build.KhE/src/debian/tests/upstream-simulation-test-suite
1 1000 3729 3728 20 0 8608 1896 do_wai S ? 0:00 \_ bash -c set -e; export USER=`id -nu`; . /etc/profile >/dev/null 2>&1 || true; . ~/.profile >/dev
/null 2>&1 || true; buildtree="/tmp/autopkgtest.epy1pq/build.KhE/src"; mkdir -p -m 1777 -- "/tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-artifacts"; export AUTOPKGTEST_ARTIFACTS="/
tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-artifacts"; export ADT_ARTIFACTS="$AUTOPKGTEST_ARTIFACTS"; mkdir -p -m 755 "/tmp/autopkgtest.epy1pq/autopkgtest_tmp"; export AUTOPKGTEST
_TMP="/tmp/autopkgtest.epy1pq/autopkgtest_tmp"; export ADTTMP="$AUTOPKGTEST_TMP"; export DEBIAN_FRONTEND=noninteractive; export LANG=C.UTF-8; export DEB_BUILD_OPTIONS=parallel=2; unset LANGU
AGE LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION LC_ALL;rm -f /tmp/autopkgtest_script_pid; set
 -C; echo $$ > /tmp/autopkgtest_script_pid; set +C; trap "rm -f /tmp/autopkgtest_script_pid" EXIT INT QUIT PIPE; cd "$buildtree"; chmod +x /tmp/autopkgtest.epy1pq/build.KhE/src/debian/tests/
upstream-simulation-test-suite; touch /tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stdout /tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stderr; /tmp/autopkgtest.epy1pq/bui
ld.KhE/src/debian/tests/upstream-simulation-test-suite 2> >(tee -a /tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stderr >&2) > >(tee -a /tmp/autopkgtest.epy1pq/upstream-simulation-t
est-suite-stdout);
0 1000 3730 3729 20 0 7228 848 pipe_w S ? 0:00 | \_ tee -a /tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stderr
1 1000 3731 3728 20 0 8608 1756 do_wai S ? 0:00 \_ bash -c set -e; export USER=`id -nu`; . /etc/profile >/dev/null 2>&1 || true; . ~/.profile >/dev
/null 2>&1 || true; buildtree="/tmp/autopkgtest.epy1pq/build.KhE/src"; mkdir -p -m 1777 -- "/tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-artifacts"; export AUTOPKGTEST_ARTIFACTS="/
tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-artifacts"; export ADT_ARTIFACTS="$AUTOPKGTEST_ARTIFACTS"; mkdir -p -m 755 "/tmp/autopkgtest.epy1pq/autopkgtest_tmp"; export AUTOPKGTEST
_TMP="/tmp/autopkgtest.epy1pq/autopkgtest_tmp"; export ADTTMP="$AUTOPKGTEST_TMP"; export DEBIAN_FRONTEND=noninteractive; export LANG=C.UTF-8; export DEB_BUILD_OPTIONS=parallel=2; unset LANGU
AGE LC_CTYPE LC_NUMERIC LC_TIME LC_COLLATE LC_MONETARY LC_MESSAGES LC_PAPER LC_NAME LC_ADDRESS LC_TELEPHONE LC_MEASUREMENT LC_IDENTIFICATION LC_ALL;rm -f /tmp/autopkgtest_script_pid; set
 -C; echo $$ > /tmp/autopkgtest_script_pid; set +C; trap "rm -f /tmp/autopkgtest_script_pid" EXIT INT QUIT PIPE; cd "$buildtree"; chmod +x /tmp/autopkgtest.epy1pq/build.KhE/src/debian/tests/
upstream-simulation-test-suite; touch /tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stdout /tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stderr; /tmp/autopkgtest.epy1pq/bui
ld.KhE/src/debian/tests/upstream-simulation-test-suite 2> >(tee -a /tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stderr >&2) > >(tee -a /tmp/autopkgtest.epy1pq/upstream-simulation-t
est-suite-stdout);
0 1000 3732 3731 20 0 7228 780 pipe_w S ? 0:00 | \_ tee -a /tmp/autopkgtest.epy1pq/upstream-simulation-test-suite-stdout
0 1000 3794 3728 20 0 8608 3300 do_wai S ? 0:00 \_ /bin/bash ./run -i 20 -m 2
0 1000 3798 3794 20 0 9136 3992 do_wai S ? 0:00 \_ /bin/bash ./001-defaults
0 1000 3833 3798 20 0 5928 1864 skb_wa S ? 0:00 \_ /tmp/autopkgtest.epy1pq/autopkgtest_tmp/clknetsim -o tmp/log.offset -f tmp/log.freq -p tmp/log.packets -R 1 -r 210 -l 10000 -s tmp/sock tmp/conf 2

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

The clknetsim hangs here:
(gdb) bt
#0 0x00007f5061169307 in __libc_accept (fd=fd@entry=6, addr=addr@entry=..., len=len@entry=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
#1 0x0000564411e91dad in Network::prepare_clients (this=0x564413923eb0) at network.cc:131
#2 0x0000564411e8ceff in main (argc=<optimized out>, argv=0x7fff72e0eb08) at server.cc:206

tags: added: update-excuse
description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Installed two test environments with proposed pinned so that only what I want from proposed comes from there.

Then I was installing dependencies and the test as d/t/control and the test script would run it. Therefore the difference between good/bad environment is just:

$ sudo apt install libcap-dev/focal-proposed libcap2/focal-proposed libcap2-bin/focal-proposed

The test rebuilds chrony and it does so with the new libcap2, so the issue might be on that side and not on clknetsim which doesn't even link to libcap2. I build it the same (after having the new libcap2 installed) and ran the test manually.

$ export CLKNETSIM_PATH="/home/ubuntu/clknetsim-156b8e4f63fd8174104e3dd05bda19166b5ab0ac/"
$ ./run -i 1 -m 0 001-defaults

Good case:
$ ./run -i 1 -m 0 001-defaults
001-defaults Testing default test settings:
  network with 1*1 servers and 1 clients:
    non-default settings:
    starting node 1: OK
    starting node 2: OK
    running simulation: OK
    checking chronyd exit:
      node 1: OK
      node 2: OK
    checking source selection:
      node 2: OK
    checking mean/min incoming/outgoing packet interval:
      node 1: 2.06e+02 2.06e+02 6.41e+01 6.41e+01 OK
      node 2: 2.06e+02 2.06e+02 6.41e+01 6.41e+01 OK
    checking clock sync time, max/rms time/freq error:
      node 2: 131 7.06e-05 1.61e-06 3.36e-05 1.89e-07 OK
PASS

SUMMARY:
  TOTAL 1
  PASSED 1
  FAILED 0 () ()
  SKIPPED 0 ()

Bad-Case:
$ ./run -i 1 -m 0 001-defaults
001-defaults Testing default test settings:
  network with 1*1 servers and 1 clients:
    non-default settings:
    starting node 1: OK
    starting node 2: OK
    running simulation:

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Formerly broken system with libcap2 1:2.31-1
-> Reverting to former libcap2 1:2.27-1
   $ v=1:2.27-1; sudo apt install libcap-dev=$v libcap2=$v libcap2-bin=$v
   (without rebuilding chrony or the test which was built against 1:2.31-1)
=> Working now

Formerly working system with libcap2 1:2.27-1
-> upgrading to new libcap2 1:2.31-1
   $ sudo apt install libcap-dev/focal-proposed libcap2/focal-proposed libcap2-bin/focal-proposed
   (without rebuilding chrony or the test which was built against 1:2.31-1)
=> Now hanging

TL;DR: it does not depend which version clknetsim or chrony was built against, it just matters which version of libcap2 is installed

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Test generates a config like:

$ cat tmp/conf.1
pidfile tmp/pidfile.1
allow
cmdallow
bindcmdaddress 0.0.0.0
local stratum 1

and

$ cat tmp/conf.2
pidfile tmp/pidfile.2
allow
cmdallow
bindcmdaddress 0.0.0.0
server 192.168.123.1

And then runs the daemons:
$ chronyd -d -f tmp/conf.1
$ chronyd -d -f tmp/conf.2

reproducing that both initialize.

And finally the test
$ /home/ubuntu/clknetsim-156b8e4f63fd8174104e3dd05bda19166b5ab0ac//clknetsim -o tmp/log.offset -f tmp/log.freq -p tmp/log.packets -R 1 -r 210 -l 10000 -s tmp/sock tmp/conf 2
Waiting for 2 clients...

Not saying/doing anything at this stage - but where to debug ...?

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I added some
  read -p "Press enter to continue"
into the test to see if the chronyd processes exist in the real test before reaching some stage.

It starts/hangs/crashes inside this stage:
run_test || test_fail

Debugging into this function leads to start_client from clknetsim

It does all kind of stuff including LD_PRELOAD

LD_PRELOAD=/home/ubuntu/clknetsim-156b8e4f63fd8174104e3dd05bda19166b5ab0ac//clknetsim.so CLKNETSIM_NODE=1 CLKNETSIM_SOCKET=tmp/sock chronyd -d -f tmp/conf.1
cho LD_PRELOAD=/home/ubuntu/clknetsim-156b8e4f63fd8174104e3dd05bda19166b5ab0ac//clknetsim.so CLKNETSIM_NODE=2 CLKNETSIM_SOCKET=tmp/sock chronyd -d -f tmp/conf.2

Here I finally hit the early init fail:

$ export PATH="../../:$PATH"
LD_PRELOAD=/home/ubuntu/clknetsim-156b8e4f63fd8174104e3dd05bda19166b5ab0ac//clknetsim.so CLKNETSIM_NODE=1 CLKNETSIM_SOCKET=tmp/sock chronyd -d -f tmp/conf.1
chronyd: client.c:2162: syscall: Assertion `0' failed.
Aborted (core dumped)

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

$ gdb ../../chronyd
(gdb) set environment LD_PRELOAD /home/ubuntu/clknetsim-156b8e4f63fd8174104e3dd05bda19166b5ab0ac//clknetsim.so
(gdb) set environment CLKNETSIM_NODE 1
(gdb) set environment CLKNETSIM_SOCKET tmp/sock
(gdb) run -d -f tmp/conf.1

But then we found it:

1. avoidance for now via -DCLKNETSIM_DISABLE_SYSCALL
2. long term libcap2 2.32 fixes the issue

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

libcap2 breaks it, libcap2 has a fix.
I'd prefer that we get libcap2 2.32 to fix it.

References:
- https://sites.google.com/site/fullycapable/release-notes-for-libcap
- https://bugzilla.kernel.org/show_bug.cgi?id=206539

Changed in chrony (Ubuntu):
status: New → Confirmed
Changed in libcap2 (Ubuntu):
importance: Undecided → High
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Since it is a sync I filed a Debian bug for it, but due to our FF it depends how they respond.

=> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=951492

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI: 2.32 was uploaded to Debian, should resolve via autosync&build

Changed in libcap2 (Debian):
status: Unknown → Fix Released
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI - libcap2 2.32 is blocking on bubblewrap (bug 1863733) that needs the tests updated (https://github.com/containers/bubblewrap/issues/353) to pass proposed migration.

FYI No current chrony tests in excuses that are blocking anyone (and if so testing vs libcap2 from proposed makes it work) - we can wait on bubblewrap a few days (it is a test-only bugfix, that isn't feature freeze related).

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Both migrated, issue resolved.
 libcap2 | 1:2.32-1 | focal | source, amd64, arm64, armhf, i386, ppc64el, s390x
 bubblewrap | 0.4.0-1ubuntu3 | focal | source, amd64, arm64, armhf, i386, ppc64el, s390x

Changed in libcap2 (Ubuntu):
status: New → Fix Released
Changed in chrony (Ubuntu):
status: Confirmed → Fix Released
Changed in bubblewrap (Debian):
status: Unknown → Fix Released
Revision history for this message
Mathew Hodson (mhodson) wrote :

bubblewrap (0.4.0-1ubuntu3) focal; urgency=medium

  * d/p/update-output-patterns-libcap-2.29.patch: cherry-pick fix proposed
    fix to capability drop-related tests, which broke with newer libcap2.

 -- Łukasz 'sil2100' Zemczak <email address hidden> Wed, 26 Feb 2020 21:39:11 +0100

no longer affects: chrony (Ubuntu)
Changed in bubblewrap (Ubuntu):
importance: Undecided → High
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.