Activity log for bug #2002910

Date Who What changed Old value New value Message
2023-01-15 17:31:03 Bryce Harrington bug added bug
2023-01-15 17:36:13 Bryce Harrington description On armhf (only), chrony's autopkgtest intermittently fails, due to the 007-cmdmon test case of the 099-scfilter test. Typically after 1-3 retries it'll work properly and the test will pass. This same failure occurs with 4.3-1ubuntu1 and 4.2-2ubuntu2 in lunar, and on a few jammy versions. There have been recurring flaky test issues on focal and other releases as well but the test output differs, indicating test timeouts. While retrying has been an effective brute force way of getting the package migrated, this bug report aims to understand why it is failing and find a better solution to work around or fix it. From the logs, the failing test output is as follows: 099-scfilter Testing system call filter in non-destructive tests: level -1: 001-minimal OK 002-extended OK 003-memlock OK 004-priority OK 006-privdrop OK 007-cmdmon BAD FAIL The 099-scfilter is a wrapper program that runs the 007-cmdmon script with different values for $TEST_SCFILTER. If there is any sort of failure in 007-cmdmon it is marked BAD and the 099-scfilter test set to FAIL. (Note there are 10 total subtest cases, but it bails out immediately when 007 fails.) What 007-cmdmon does is invoke chronyc for a number of different commands such as 'refresh', 'add server', 'allow', 'reset sources', 'serverstats', 'settime now', etc., verify the command's exit code, and check that the command's output is as expected, such as "200 OK", etc. Debian does not show this failure on armhf, but does hit it (or something quite similar) on s390x and ppc64el: * https://ci.debian.net/data/autopkgtest/unstable/s390x/c/chrony/28870427/log.gz * https://ci.debian.net/data/autopkgtest/unstable/ppc64el/c/chrony/28870171/log.gz Debian does not have a bug report about this test failure. There is one existing bug report about a UNIX socket issue when running cmd allow / deny that might be relevant: * https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995201 Redhat also hits this failure in their CI: * https://bugzilla.redhat.com/show_bug.cgi?id=1990589 "It seems this due to the glibc update. New system calls need to be allowed in the chrony seccomp filter. https://git.tuxfamily.org/chrony/chrony.git/commit/?id=bbbd80bf03223f181d4abf5c8e5fe6136ab6129a" * However, the patch they flagged as fixing it, is already present in chrony 4.3. * But it suggests there may be a similar issue due to a new syscall in glibc 2.36? Note that without more log detail as to what is failing inside 007-cmdmon, it's certainly possible that two 007-cmdmon failures could be failing on completely different commands, and thus may be entirely unrelated issues. It may be necessary to run the test case locally on armhf to reproduce, and if necessary add debugging to isolate where exactly the failure occurs. On armhf (only), chrony's autopkgtest intermittently fails, due to the 007-cmdmon test case of the 099-scfilter test. Typically after 1-3 retries it'll work properly and the test will pass. This same failure occurs with 4.3-1ubuntu1 and 4.2-2ubuntu2 in lunar, and on a few jammy versions. There have been recurring flaky test issues on focal and other releases as well but the test output differs, indicating test timeouts. While retrying has been an effective brute force way of getting the package migrated, this bug report aims to understand why it is failing and find a better solution to work around or fix it. From the logs, the failing test output is as follows: 099-scfilter Testing system call filter in non-destructive tests:   level -1:     001-minimal OK     002-extended OK     003-memlock OK     004-priority OK     006-privdrop OK     007-cmdmon BAD FAIL The 099-scfilter is a wrapper program that runs the 007-cmdmon script with different values for $TEST_SCFILTER. If there is any sort of failure in 007-cmdmon it is marked BAD and the 099-scfilter test set to FAIL. (Note there are 10 total subtest cases, but it bails out immediately when 007 fails.) What 007-cmdmon does is invoke chronyc for a number of different commands such as 'refresh', 'add server', 'allow', 'reset sources', 'serverstats', 'settime now', etc., verify the command's exit code, and check that the command's output is as expected, such as "200 OK", etc. Debian does not show this failure on armhf, but does hit it (or something quite similar) on s390x and ppc64el:   * https://ci.debian.net/data/autopkgtest/unstable/s390x/c/chrony/28870427/log.gz   * https://ci.debian.net/data/autopkgtest/unstable/ppc64el/c/chrony/28870171/log.gz Debian does not have a bug report about this test failure. There is one existing bug report about a UNIX socket issue when running cmd allow / deny that might be relevant:   * https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=995201 Redhat also hits this failure in their CI:   * https://bugzilla.redhat.com/show_bug.cgi?id=1990589     "It seems this due to the glibc update. New system calls need to be allowed in the chrony seccomp filter. https://git.tuxfamily.org/chrony/chrony.git/commit/?id=bbbd80bf03223f181d4abf5c8e5fe6136ab6129a"   * However, the patch they flagged as fixing it, is already present in chrony 4.3.   * But it suggests there may be a similar issue due to a new syscall in glibc 2.36? https://lists.gnu.org/archive/html/info-gnu/2022-08/msg00000.html Note that without more log detail as to what is failing inside 007-cmdmon, it's certainly possible that two 007-cmdmon failures could be failing on completely different commands, and thus may be entirely unrelated issues. It may be necessary to run the test case locally on armhf to reproduce, and if necessary add debugging to isolate where exactly the failure occurs.