context test in ubuntu_stress_smoke_test failed with M-6.5 riscv / starfive instances

Bug #2042388 reported by Po-Hsu Lin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
New
Undecided
Unassigned
glibc (Ubuntu)
Invalid
Undecided
Unassigned
Mantic
Invalid
Undecided
Unassigned
linux-riscv (Ubuntu)
Invalid
Undecided
Unassigned
Mantic
Fix Released
Undecided
Unassigned
linux-starfive (Ubuntu)
Invalid
Undecided
Unassigned
Mantic
Fix Released
Undecided
Unassigned

Bug Description

This issue can be found from the very beginning of these two kernels
* mantic/linux-starfive/6.5.0-1001.2
* mantic/linux-riscv/6.5.0-7.7.2

Test failed with:
 context STARTING
 context RETURNED 2
 context FAILED
 stress-ng: debug: [12644] invoked with './stress-ng -v -t 5 --context 4 --context-ops 3000 --ignite-cpu --syslog --verbose --verify --oomable' by user 0 'root'
 stress-ng: debug: [12644] stress-ng 0.16.05 gaea6f3306f46
 stress-ng: debug: [12644] system: Linux mantic-starfive-riscv64 6.5.0-1001-starfive #2-Ubuntu SMP Fri Oct 6 12:08:59 UTC 2023 riscv64, gcc 13.2.0, glibc 2.38
 stress-ng: debug: [12644] RAM total: 7.7G, RAM free: 6.6G, swap free: 1024.0M
 stress-ng: debug: [12644] temporary file path: '/home/ubuntu/autotest/client/tmp/ubuntu_stress_smoke_test/src/stress-ng', filesystem type: ext2 (4617314 blocks available)
 stress-ng: debug: [12644] 8 processors online, 8 processors configured
 stress-ng: info: [12644] setting to a 5 secs run per stressor
 stress-ng: debug: [12644] cache allocate: using defaults, cannot determine cache level details
 stress-ng: debug: [12644] cache allocate: shared cache buffer size: 2048K
 stress-ng: info: [12644] dispatching hogs: 4 context
 stress-ng: debug: [12644] starting stressors
 stress-ng: debug: [12644] 4 stressors started
 stress-ng: debug: [12645] context: [12645] started (instance 0 on CPU 2)
 stress-ng: debug: [12647] context: [12647] started (instance 2 on CPU 5)
 stress-ng: debug: [12648] context: [12648] started (instance 3 on CPU 7)
 stress-ng: debug: [12646] context: [12646] started (instance 1 on CPU 4)
 stress-ng: debug: [12644] context: [12645] terminated on signal: 11 (Segmentation fault)
 stress-ng: debug: [12644] context: [12645] terminated (success)
 stress-ng: debug: [12644] context: [12646] terminated on signal: 11 (Segmentation fault)
 stress-ng: debug: [12644] context: [12646] terminated (success)
 stress-ng: debug: [12644] context: [12647] terminated on signal: 11 (Segmentation fault)
 stress-ng: debug: [12644] context: [12647] terminated (success)
 stress-ng: debug: [12644] context: [12648] terminated on signal: 11 (Segmentation fault)
 stress-ng: debug: [12644] context: [12648] terminated (success)
 stress-ng: warn: [12644] metrics-check: all bogo-op counters are zero, data may be incorrect
 stress-ng: debug: [12644] metrics-check: all stressor metrics validated and sane
 stress-ng: info: [12644] skipped: 0
 stress-ng: info: [12644] passed: 4: context (4)
 stress-ng: info: [12644] failed: 0
 stress-ng: info: [12644] metrics untrustworthy: 0
 stress-ng: info: [12644] unsuccessful run completed in 9.98 secs

Looks like the tests have passed. But marked as failed with a non-zero return code.

Tested with stress-ng V0.16.05 and V0.17.00, they all failed with the same issue.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Tested with the latest stress-ng from upstream (stress-ng 0.17.00 g670cbc3f52a7), this issue still exists. Issue reported upstream.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

The stress-ng maintainer, Colin, has investigated this, allow me to copy his comment:

Seems like a signal handler occurring in a swapped context is causing a SIGSEGV when we're using an alternative stack (via sigaltstack). Disabling the alternative stack allows the test to run successfully. I've experimented with also using ss_flags=SS_AUTODISARM when setting up the alternative stack and this also breaks with a SIGSEGV.

It seems there maybe historic issues with linux and alternative stacks when executing in a swap context, as sigaltstack man page states:

       ss.ss_flags
              This field contains either 0, or the following flag:

              SS_AUTODISARM (since Linux 4.7)
                     Clear the alternate signal stack settings on entry to the signal handler. When the signal handler
                     returns, the previous alternate signal stack settings are restored.

                     This flag was added in order to make it safe to switch away from the signal handler with swapcon‐
                     text(3). Without this flag, a subsequently handled signal will corrupt the state of the switched-
                     away signal handler. On kernels where this flag is not supported, sigaltstack() fails with the error
                     EINVAL when this flag is supplied.

since this is a regression in behaviour I think somebody with riscv libc know-how should investigate this further. Meanwhile I'll push a change that disables the use of the alternative stack for the context switch stressor.

https://github.com/ColinIanKing/stress-ng/issues/331

Changed in glibc (Ubuntu):
status: New → Invalid
Revision history for this message
Emil Renner Berthing (esmil) wrote :

This sounds like something that might be fixed by this:
https://git.launchpad.net/~esmil/ubuntu/+source/linux-riscv/+git/mantic/commit/?id=86b8ae081f5e18f4d823453a5ae937f4a1fb16a4

That patch has made it to the 6.5 stable tree, but I don't think it's in any of our released kernels yet.
Any chance you could try this kernel:
https://esmil.dk/linux-riscv-6.5.0-9.9.2.tar.gz

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Hi Emil,
I just give your test kernel a try, it looks like it can solve the issue! Thanks!

$ sudo ./stress-ng -v -t 5 --context 4 --context-ops 3000 --ignite-cpu --syslog --verbose --verify --oomable
stress-ng: debug: [1163] invoked with './stress-ng -v -t 5 --context 4 --context-ops 3000 --ignite-cpu --syslog --verbose --verify --oomable' by user 0 'root'
stress-ng: debug: [1163] stress-ng 0.17.00 gb5f33cfd1d9d
stress-ng: debug: [1163] system: Linux riscv64-mantic 6.5.0-9-generic #9.2 SMP Thu Nov 2 16:20:40 UTC 2023 riscv64, gcc 13.2.0, glibc 2.38
stress-ng: debug: [1163] RAM total: 7.7G, RAM free: 7.2G, swap free: 0.0
stress-ng: debug: [1163] temporary file path: '/home/ubuntu/autotest/client/tmp/ubuntu_stress_smoke_test/src/stress-ng', filesystem type: ext2 (1505198 blocks available)
stress-ng: debug: [1163] 8 processors online, 8 processors configured
stress-ng: info: [1163] setting to a 5 secs run per stressor
stress-ng: debug: [1163] cache allocate: using defaults, cannot determine cache level details
stress-ng: debug: [1163] cache allocate: shared cache buffer size: 2048K
stress-ng: info: [1163] dispatching hogs: 4 context
stress-ng: debug: [1163] starting stressors
stress-ng: debug: [1163] 4 stressors started
stress-ng: debug: [1164] context: [1164] started (instance 0 on CPU 7)
stress-ng: debug: [1165] context: [1165] started (instance 1 on CPU 5)
stress-ng: debug: [1166] context: [1166] started (instance 2 on CPU 0)
stress-ng: debug: [1167] context: [1167] started (instance 3 on CPU 2)
stress-ng: debug: [1165] context: [1165] exited (instance 1 on CPU 5)
stress-ng: debug: [1164] context: [1164] exited (instance 0 on CPU 7)
stress-ng: debug: [1166] context: [1166] exited (instance 2 on CPU 0)
stress-ng: debug: [1163] context: [1164] terminated (success)
stress-ng: debug: [1167] context: [1167] exited (instance 3 on CPU 2)
stress-ng: debug: [1163] context: [1165] terminated (success)
stress-ng: debug: [1163] context: [1166] terminated (success)
stress-ng: debug: [1163] context: [1167] terminated (success)
stress-ng: debug: [1163] metrics-check: all stressor metrics validated and sane
stress-ng: info: [1163] skipped: 0
stress-ng: info: [1163] passed: 4: context (4)
stress-ng: info: [1163] failed: 0
stress-ng: info: [1163] metrics untrustworthy: 0
stress-ng: info: [1163] successful run completed in 5.05 secs
$ echo $?
0
$ uname -a
Linux riscv64-mantic 6.5.0-9-generic #9.2 SMP Thu Nov 2 16:20:40 UTC 2023 riscv64 riscv64 riscv64 GNU/Linux

Revision history for this message
Colin Ian King (colin-king) wrote (last edit ):

I'll improve the test in stress-ng for the December release so it can detect these automatically and report an issue rather than just segfaulting with no explanation why. See https://github.com/ColinIanKing/stress-ng/issues/334

Revision history for this message
Emil Renner Berthing (esmil) wrote :

Great, thanks! I'll see if I can push that fix to the RISC-V kernels faster than our regular stable updates.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-starfive/6.5.0-1005.6 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-mantic-linux-starfive' to 'verification-done-mantic-linux-starfive'. If the problem still exists, change the tag 'verification-needed-mantic-linux-starfive' to 'verification-failed-mantic-linux-starfive'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-mantic-linux-starfive-v2 verification-needed-mantic-linux-starfive
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is awaiting verification that the linux-riscv/6.5.0-14.14.1 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-mantic-linux-riscv' to 'verification-done-mantic-linux-riscv'. If the problem still exists, change the tag 'verification-needed-mantic-linux-riscv' to 'verification-failed-mantic-linux-riscv'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: kernel-spammed-mantic-linux-riscv-v2 verification-needed-mantic-linux-riscv
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Verified with starfive 6.5.0-1005.6 and riscv 6.5.0-14.14.1, this test has passed on these 2 kernels. Thanks!

Changed in glibc (Ubuntu Mantic):
status: New → Invalid
tags: added: verification-done-mantic-linux-riscv verification-done-mantic-linux-starfive
removed: verification-needed-mantic-linux-riscv verification-needed-mantic-linux-starfive
Po-Hsu Lin (cypressyew)
Changed in linux-starfive (Ubuntu):
status: New → Invalid
Changed in linux-starfive (Ubuntu Mantic):
status: New → Fix Committed
Changed in linux-riscv (Ubuntu):
status: New → Invalid
Changed in linux-riscv (Ubuntu Mantic):
status: New → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (5.0 KiB)

This bug was fixed in the package linux-riscv - 6.5.0-14.14.1

---------------
linux-riscv (6.5.0-14.14.1) mantic; urgency=medium

  * mantic/linux-riscv: 6.5.0-14.14.1 -proposed tracker (LP: #2041534)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync git-ubuntu-log
    - [Packaging] resync update-dkms-versions helper
    - debian/dkms-versions -- update from kernel-versions (main/2023.10.30)

  * disable shiftfs (LP: #2038522)
    - [Config] riscv: disable shiftfs

  * context test in ubuntu_stress_smoke_test failed with M-6.5 riscv / starfive
    instances (LP: #2042388)
    - riscv: signal: fix sigaltstack frame size checking

  [ Ubuntu: 6.5.0-14.14 ]

  * mantic/linux: 6.5.0-14.14 -proposed tracker (LP: #2042660)
  * Boot log print hang on screen, no login prompt on Aspeed 2600 rev 52 BMC
    (LP: #2042850)
    - drm/ast: Add BMC virtual connector
  * arm64 atomic issues cause disk corruption (LP: #2042573)
    - locking/atomic: scripts: fix fallback ifdeffery
  * Packaging resync (LP: #1786013)
    - [Packaging] update annotations scripts

  [ Ubuntu: 6.5.0-12.12 ]

  * mantic/linux: 6.5.0-12.12 -proposed tracker (LP: #2041536)
  * Packaging resync (LP: #1786013)
    - [Packaging] update annotations scripts
    - [Packaging] update helper scripts
    - debian/dkms-versions -- update from kernel-versions (main/2023.10.30)
  * CVE-2023-5633
    - drm/vmwgfx: Keep a gem reference to user bos in surfaces
  * CVE-2023-5345
    - fs/smb/client: Reset password pointer to NULL
  * CVE-2023-39189
    - netfilter: nfnetlink_osf: avoid OOB read
  * CVE-2023-4244
    - netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction
  * apparmor restricts read access of user namespace mediation sysctls to root
    (LP: #2040194)
    - SAUCE: apparmor: open userns related sysctl so lxc can check if restriction
      are in place
  * AppArmor spams kernel log with assert when auditing (LP: #2040192)
    - SAUCE: apparmor: fix request field from a prompt reply that denies all
      access
  * apparmor notification files verification (LP: #2040250)
    - SAUCE: apparmor: fix notification header size
  * apparmor oops when racing to retrieve a notification (LP: #2040245)
    - SAUCE: apparmor: fix oops when racing to retrieve notification
  * SMC stats: Wrong bucket calculation for payload of exactly 4096 bytes
    (LP: #2039575)
    - net/smc: Fix pos miscalculation in statistics
  * Support mipi camera on Intel Meteor Lake platform (LP: #2031412)
    - SAUCE: iommu: intel-ipu: use IOMMU passthrough mode for Intel IPUs on Meteor
      Lake
    - SAUCE: platform/x86: int3472: Add handshake GPIO function
  * CVE-2023-45898
    - ext4: fix slab-use-after-free in ext4_es_insert_extent()
  * CVE-2023-31085
    - ubi: Refuse attaching if mtd's erasesize is 0
  * CVE-2023-5717
    - perf: Disallow mis-matched inherited group reads
  * CVE-2023-5178
    - nvmet-tcp: Fix a possible UAF in queue intialization setup
  * CVE-2023-5158
    - vringh: don't use vringh_kiov_advance() in vringh_iov_xfer()
  * CVE-2023-5090
    - x86: KVM: SVM: always update the x2avic msr interception
  * [SRU][J/L/M] UBUNTU: [Packaging] Make WWAN driver a loada...

Read more...

Changed in linux-riscv (Ubuntu Mantic):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (5.0 KiB)

This bug was fixed in the package linux-starfive - 6.5.0-1005.6

---------------
linux-starfive (6.5.0-1005.6) mantic; urgency=medium

  * mantic/linux-starfive: 6.5.0-1005.6 -proposed tracker (LP: #2041535)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync git-ubuntu-log
    - [Packaging] resync update-dkms-versions helper
    - debian/dkms-versions -- update from kernel-versions (main/2023.10.30)

  * disable shiftfs (LP: #2038522)
    - [Config] starfive: disable shiftfs

  * context test in ubuntu_stress_smoke_test failed with M-6.5 riscv / starfive
    instances (LP: #2042388)
    - riscv: signal: fix sigaltstack frame size checking

  [ Ubuntu: 6.5.0-14.14 ]

  * mantic/linux: 6.5.0-14.14 -proposed tracker (LP: #2042660)
  * Boot log print hang on screen, no login prompt on Aspeed 2600 rev 52 BMC
    (LP: #2042850)
    - drm/ast: Add BMC virtual connector
  * arm64 atomic issues cause disk corruption (LP: #2042573)
    - locking/atomic: scripts: fix fallback ifdeffery
  * Packaging resync (LP: #1786013)
    - [Packaging] update annotations scripts

  [ Ubuntu: 6.5.0-12.12 ]

  * mantic/linux: 6.5.0-12.12 -proposed tracker (LP: #2041536)
  * Packaging resync (LP: #1786013)
    - [Packaging] update annotations scripts
    - [Packaging] update helper scripts
    - debian/dkms-versions -- update from kernel-versions (main/2023.10.30)
  * CVE-2023-5633
    - drm/vmwgfx: Keep a gem reference to user bos in surfaces
  * CVE-2023-5345
    - fs/smb/client: Reset password pointer to NULL
  * CVE-2023-39189
    - netfilter: nfnetlink_osf: avoid OOB read
  * CVE-2023-4244
    - netfilter: nft_set_rbtree: skip sync GC for new elements in this transaction
  * apparmor restricts read access of user namespace mediation sysctls to root
    (LP: #2040194)
    - SAUCE: apparmor: open userns related sysctl so lxc can check if restriction
      are in place
  * AppArmor spams kernel log with assert when auditing (LP: #2040192)
    - SAUCE: apparmor: fix request field from a prompt reply that denies all
      access
  * apparmor notification files verification (LP: #2040250)
    - SAUCE: apparmor: fix notification header size
  * apparmor oops when racing to retrieve a notification (LP: #2040245)
    - SAUCE: apparmor: fix oops when racing to retrieve notification
  * SMC stats: Wrong bucket calculation for payload of exactly 4096 bytes
    (LP: #2039575)
    - net/smc: Fix pos miscalculation in statistics
  * Support mipi camera on Intel Meteor Lake platform (LP: #2031412)
    - SAUCE: iommu: intel-ipu: use IOMMU passthrough mode for Intel IPUs on Meteor
      Lake
    - SAUCE: platform/x86: int3472: Add handshake GPIO function
  * CVE-2023-45898
    - ext4: fix slab-use-after-free in ext4_es_insert_extent()
  * CVE-2023-31085
    - ubi: Refuse attaching if mtd's erasesize is 0
  * CVE-2023-5717
    - perf: Disallow mis-matched inherited group reads
  * CVE-2023-5178
    - nvmet-tcp: Fix a possible UAF in queue intialization setup
  * CVE-2023-5158
    - vringh: don't use vringh_kiov_advance() in vringh_iov_xfer()
  * CVE-2023-5090
    - x86: KVM: SVM: always update the x2avic msr interception
  * [SRU][J/L/M] UBUNTU: [Packaging] Make WWAN drive...

Read more...

Changed in linux-starfive (Ubuntu Mantic):
status: Fix Committed → Fix Released
Revision history for this message
thejpster (ubuntu-thejpster) wrote :

For what it's worth, I think https://github.com/rust-lang/rust/issues/117022 was also caused by this issue. Updating to 6.5.0-14 seems to fix it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.