Signals and Solaris

Bug #1248181 reported by Stas Boukarev
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
Triaged
Medium
Unassigned

Bug Description

Solaris (both SPARC and x86) can present signal handlers with a
wrong ucontext_t structure when receiving more than one signal at a
time. The correct structure can be found by following
ucontext->uc_link.

This presents particular problems on SPARC, because it uses a trap to
signal that it needs to call a C allocation routine, and it uses the
preceding OR instruction to encode the arguments. Running sb-sprof and
consing at the same time is bound to make the allocation trap handler
to receive bad ucontext.

The C test in http://paste.lisp.org/display/139777 replicates that result.

I was able to get sb-sprof running by following uc_link and comparing
the value of PC register with siginfo->si_addr, if it matches, that's
the current context.

But this is extremely fishy, I haven't found any documentation saying
what is the correct thing to do. And some signal handler receive NULL
as siginfo, so there's nothing to compare against.

Potentially, any usage of context on Solaris is susceptible to this
problem (or even on other OSes).

Attached is the patch which make sb-sprof work.

Tags: os-sunos
Revision history for this message
Stas Boukarev (stassats) wrote :
Revision history for this message
Joshua M. Clulow (jclulow) wrote :
Download full text (3.9 KiB)

Hello!

I work on the illumos project, the open source continuation of OpenSolaris.
I've had a look at your test program, and read through your description of the
signal handling behaviour that you're seeing.

I think there are a few things going on here, so I'll try and lay out a few
suggestions and ask a few questions. I've put some links to our online manual
pages at the end.

1. I'm not sure under what conditions you'll receive a NULL siginfo, but
   the sigaction(2) manual page definitely suggests that it might be NULL
   sometimes. Do you recall which specific signals (e.g., SIGCHLD, etc)
   you were handling when siginfo was NULL?

2. As you've noted, when multiple signals arrive at around the same time,
   their delivery may overlap. When signals overlap, we do not always
   completely unwind the signal handling machinery in libc before
   delivering subsequent signals. In these cases, the context object
   may refer to the state of the signal delivery parts of libc which
   were interrupted by the nested signal, rather than to the part of
   your main program that was interrupted.

   In ucontext.h(3HEAD), the "uc_link" member from the context object is
   described as follows:

       The uc_link member is a pointer to the context that to be resumed
       when this context returns. If uc_link is equal to 0, this context
       is the main context and the process exits when this context returns.

   It sounds like when handling SIGPROF, you're interested in that "main
   context", rather than in any of the signal handling code. This context
   represents the state we preserved when taking the first of the coincident
   signals, and in order to make sure you find it you always need to walk up
   the context chain until "uc_link" is NULL.

3. As noted in siginfo.h(3HEAD), "si_addr" is populated with the address
   of the faulting instruction for SIGILL signals. That's not true of other
   signals, though; e.g., it isn't true for SIGPROF and setitimer(2).
   For SPARC systems where you are using an undefined instruction to
   generate a trap for allocation, using "si_addr" when handling SIGILL is
   definitely the right way to find the instruction in question. If nested
   signal delivery has occurred, the context object you get in your SIGILL
   handler will not match up with the "si_addr" value.

   The context in the nested delivery case will restore execution to the
   previously running signal handler, rather than to your main program.
   If you only need the program counter value, use "si_addr". If you
   need the rest of the context to correctly handle the trap, you'll have
   to walk the "uc_link" chain out to find it. There are two options:

      - Walk up to a context where the program counter matches "si_addr".
      - Walk up to the main context, where "uc_link" is NULL.

   The option that is most correct will depend on the structure of your
   program; e.g., do you expect to generate a SIGILL in any of the
   signal handlers, or just in the main program?

4. To reiterate, the context that signal handlers receive is only
   guaranteed to be right for one purpose: the restoration of execution
   state from before ...

Read more...

Revision history for this message
Stas Boukarev (stassats) wrote :

I don't remember the order of which the signals hit each other to cause the problem.
If uc_link is a reliable path to the original synchronous interrupt, then that might do.

Revision history for this message
Stas Boukarev (stassats) wrote :

But that's only for sigill with its si_addr that can be matched with REG_PC, but we also might need to do the same for sigsegv.

Revision history for this message
Rich Lowe (richlowe) wrote :

But you're not getting a wrong context, you're getting nested signals (presumably in the real case SIGPROF is the first signal, because the sigaction looks like it'd mask prof, but prof not mask anything else.)

I'm not sure you need to match the program counter in the context with anything?
Is SBCL assuming the os_context_t refers to user code, and is just absolutely unprepared to handle nested signals? If that's the case, it needs to ensure they don't happen (not just on SunOS, but on any unix)

Revision history for this message
Rich Lowe (richlowe) wrote :

Oh wow. I missed that the OS plays all kinds of games here to attempt to deliver SIGPROF fast, most of those games involve breaking the rules somewhat terrifyingly.

This means my comment above is almost certainly _super_ wrong, I'm sorry.

Revision history for this message
Stas Boukarev (stassats) wrote :

Does that happen only with SIGPROF? Could SIGUSR2 have the same problem? which is used for stopping threads.

Revision history for this message
Joshua M. Clulow (jclulow) wrote :

Hello again!

I looked into this a bunch more with Rich, and I also apologise for my original wall of text! This does appear to be a SIGPROF-specific bug, which I've filed: https://www.illumos.org/issues/11494

Because SIGUSR2 is generally sent with kill(1)/kill(2) from other processes, it will be process-directed and thus the thread-directed SIGSEGV/SIGILL will always win out and you'll still get the right context.

The easiest way to work around this is to mask all signals when you set up your SIGPROF handler with sigaction(2), but based on the structure of the interpreter it seems like that won't be possible -- given that you need to handle SIGSEGVs from the profiler itself.

I think to work around this, you'll want to do something a bit like this in your SIGSEGV/SIGILL/etc handlers:

        if (uc->uc_link != NULL) {
                Dl_info_t dli;
                void *pc = (void *)(uintptr_t)uc->uc_mcontext.gregs[REG_PC];

                if (dladdr(pc, &dli) != 0) {
                        if (strstr(dli.dli_fname, "libc.so.1") != NULL) {
                                uc = uc->uc_link;
                        }
                }
        }

This will kick in only if there's another context in the chain, and will almost certainly detect the issue: that SIGPROF has been incorrectly delivered first, and the SIGSEGV was delivered shortly after while we were still stuck in the libc handling code. When that condition is detected, we'll basically discard the top-most context in favour of what is likely the correct one.

We'll look at fixing this in the OS as well, but it will presumably be a while before this is fixed on all of the machines you'd like to run on -- and it may _never_ be fixed for legacy bits like Solaris 10.

Apologies for the bugs!

Revision history for this message
Stas Boukarev (stassats) wrote :

If that's only related to SIGPROF then that's not that big of a deal, it's only an optional profiler.

Changed in sbcl:
importance: High → Medium
Revision history for this message
Douglas Katzman (dougk) wrote :

It's likely that sprof.c does not have the same problem that the lisp data recorder had.
Personally I think we should drop support for the Sparc architecture, and sunOS.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.