glibc 2.32-0ubuntu5 ADT test failure with linux 5.10.0-7.8

Bug #1907298 reported by Seth Forshee
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
GLibC
Unknown
Medium
glibc (Ubuntu)
Fix Released
Undecided
Unassigned
linux (Ubuntu)
Incomplete
Undecided
Unassigned

Bug Description

Testing failed on:
    ppc64el: https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-hirsute-canonical-kernel-team-bootstrap/hirsute/ppc64el/g/glibc/20201208_192946_04f11@/log.gz

----------
FAIL: misc/tst-sigcontext-get_pc
original exit status 1
info: address in signal handler: 0x737faa11db44
info: call stack entry 0: 0x737faa311f58
info: call stack entry 1: 0x737faa3404c4
info: call stack entry 2: 0x0
info: call stack entry 3: 0x737faa312144
info: call stack entry 4: 0x737faa312870
info: call stack entry 5: 0x737faa313264
info: call stack entry 6: 0x737faa311c40
info: call stack entry 7: 0x737faa0f9e5c
info: call stack entry 8: 0x737faa0fa040
error: ../sysdeps/unix/sysv/linux/tst-sigcontext-get_pc.c:60: not true: found
error: 1 test failures
----------

Seth Forshee (sforshee)
tags: added: kernel-adt-failure
description: updated
Revision history for this message
Seth Forshee (sforshee) wrote :

Looking through powerpc changes between Linux 5.8 and 5.10, I see a handful of changes to signal-related code. This in particular jumps out as potentially affecting the register state for signal handling:

commit 0138ba5783ae0dcc799ad401a1e8ac8333790df9
Author: Nicholas Piggin <email address hidden>
Date: Mon May 11 20:19:52 2020 +1000

    powerpc/64/signal: Balance return predictor stack in signal trampoline

    Returning from an interrupt or syscall to a signal handler currently
    begins execution directly at the handler's entry point, with LR set to
    the address of the sigreturn trampoline. When the signal handler
    function returns, it runs the trampoline. It looks like this:

        # interrupt at user address xyz
        # kernel stuff... signal is raised
        rfid
        # void handler(int sig)
        addis 2,12,.TOC.-.LCF0@ha
        addi 2,2,.TOC.-.LCF0@l
        mflr 0
        std 0,16(1)
        stdu 1,-96(1)
        # handler stuff
        ld 0,16(1)
        mtlr 0
        blr
        # __kernel_sigtramp_rt64
        addi r1,r1,__SIGNAL_FRAMESIZE
        li r0,__NR_rt_sigreturn
        sc
        # kernel executes rt_sigreturn
        rfid
        # back to user address xyz

    Note the blr with no matching bl. This can corrupt the return
    predictor.

    Solve this by instead resuming execution at the signal trampoline
    which then calls the signal handler. qtrace-tools link_stack checker
    confirms the entire user/kernel/vdso cycle is balanced after this
    patch, whereas it's not upstream.

    Alan confirms the dwarf unwind info still looks good. gdb still
    recognises the signal frame and can step into parent frames if it
    break inside a signal handler.

    Performance is pretty noisy, not a very significant change on a POWER9
    here, but branch misses are consistently a lot lower on a
    microbenchmark:

     Performance counter stats for './signal':

           13,085.72 msec task-clock # 1.000 CPUs utilized
      45,024,760,101 cycles # 3.441 GHz
      65,102,895,542 instructions # 1.45 insn per cycle
      11,271,673,787 branches # 861.372 M/sec
          59,468,979 branch-misses # 0.53% of all branches

           12,989.09 msec task-clock # 1.000 CPUs utilized
      44,692,719,559 cycles # 3.441 GHz
      65,109,984,964 instructions # 1.46 insn per cycle
      11,282,136,057 branches # 868.585 M/sec
          39,786,942 branch-misses # 0.35% of all branches

    Signed-off-by: Nicholas Piggin <email address hidden>
    Signed-off-by: Michael Ellerman <email address hidden>
    Link: https://<email address hidden>

Revision history for this message
Seth Forshee (sforshee) wrote :

Reverting "powerpc/64/signal: Balance return predictor stack in signal trampoline" gets the test passing, so that does appear to be the cause. I'm not sure what needs to happen here to resolve the test failure though.

Revision history for this message
Balint Reczey (rbalint) wrote :

OK, I'll XFAIL this test on ppc64el in next upload, in ~2 weeks.

Changed in glibc (Ubuntu):
status: New → Confirmed
Revision history for this message
In , Balint Reczey (rbalint) wrote :

Linux 5.10 includes an optimization that changes signal handling breaking sigcontext_get_pc, which is detected by misc/tst-sigcontext-get_pc:

commit 0138ba5783ae0dcc799ad401a1e8ac8333790df9
Author: Nicholas Piggin <email address hidden>
Date: Mon May 11 20:19:52 2020 +1000

    powerpc/64/signal: Balance return predictor stack in signal trampoline

    Returning from an interrupt or syscall to a signal handler currently
    begins execution directly at the handler's entry point, with LR set to
    the address of the sigreturn trampoline. When the signal handler
    function returns, it runs the trampoline. It looks like this:

        # interrupt at user address xyz
        # kernel stuff... signal is raised
        rfid
        # void handler(int sig)
        addis 2,12,.TOC.-.LCF0@ha
        addi 2,2,.TOC.-.LCF0@l
        mflr 0
        std 0,16(1)
        stdu 1,-96(1)
        # handler stuff
        ld 0,16(1)
        mtlr 0
        blr
        # __kernel_sigtramp_rt64
        addi r1,r1,__SIGNAL_FRAMESIZE
        li r0,__NR_rt_sigreturn
        sc
        # kernel executes rt_sigreturn
        rfid
        # back to user address xyz

    Note the blr with no matching bl. This can corrupt the return
    predictor.

    Solve this by instead resuming execution at the signal trampoline
    which then calls the signal handler. qtrace-tools link_stack checker
    confirms the entire user/kernel/vdso cycle is balanced after this
    patch, whereas it's not upstream.

    Alan confirms the dwarf unwind info still looks good. gdb still
    recognises the signal frame and can step into parent frames if it
    break inside a signal handler.

    Performance is pretty noisy, not a very significant change on a POWER9
    here, but branch misses are consistently a lot lower on a
    microbenchmark:

     Performance counter stats for './signal':

           13,085.72 msec task-clock # 1.000 CPUs utilized
      45,024,760,101 cycles # 3.441 GHz
      65,102,895,542 instructions # 1.45 insn per cycle
      11,271,673,787 branches # 861.372 M/sec
          59,468,979 branch-misses # 0.53% of all branches

           12,989.09 msec task-clock # 1.000 CPUs utilized
      44,692,719,559 cycles # 3.441 GHz
      65,109,984,964 instructions # 1.46 insn per cycle
      11,282,136,057 branches # 868.585 M/sec
          39,786,942 branch-misses # 0.35% of all branches

    Signed-off-by: Nicholas Piggin <email address hidden>
    Signed-off-by: Michael Ellerman <email address hidden>
    Link: https://<email address hidden>

---

----------
FAIL: misc/tst-sigcontext-get_pc
original exit status 1
info: address in signal handler: 0x737faa11db44
info: call stack entry 0: 0x737faa311f58
info: call stack entry 1: 0x737faa3404c4
info: call stack entry 2: 0x0
info: call stack entry 3: 0x737faa312144
info: call stack entry 4: 0x737faa312870
info: call stack entry 5: 0x737faa313264
info: call stack entry 6: 0x737faa311c40
info: call stack entry 7: 0x737faa0f9e5c
info: call stack entry 8: 0x737faa0fa040
error: ../sysdeps/unix/sysv/linux/tst-sigcontext-get_pc.c:60: not true: found
error: 1 test failures

Revision history for this message
Balint Reczey (rbalint) wrote :

Since sigcontextinfo is used internally by glibc only for nice segfault handling the regression I think it is ok to just XFAIL the test or adjust sigcontext_get_pc implementation to post-5.10 behaviour, but upstream may have a different opinion.

Revision history for this message
In , Andreas Schwab (schwab-linux-m68k) wrote :

Please report that to <email address hidden>.

Changed in glibc:
importance: Unknown → Medium
status: Unknown → New
Revision history for this message
In , Florian Weimer (fweimer) wrote :
Changed in glibc:
status: New → Unknown
Revision history for this message
Balint Reczey (rbalint) wrote :

It seems there may be a kernel-side fix: https://lists.ozlabs.org/pipermail/linuxppc-dev/2021-January/223199.html

I've committed XFAIL-ing the test for now, to be landed in Hirsute this/next week.

Changed in glibc (Ubuntu):
status: Confirmed → Fix Committed
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1907298

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package glibc - 2.33-0ubuntu2

---------------
glibc (2.33-0ubuntu2) hirsute; urgency=medium

  * debian/patches/all/local-ldd.diff: Adjust extra safety check
    for changed ld-linux.so return value. LP: #1914860.

 -- Matthias Klose <email address hidden> Sat, 06 Feb 2021 13:32:05 +0100

Changed in glibc (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.