Comment 8 for bug 1248181

Revision history for this message
Joshua M. Clulow (jclulow) wrote :

Hello again!

I looked into this a bunch more with Rich, and I also apologise for my original wall of text! This does appear to be a SIGPROF-specific bug, which I've filed: https://www.illumos.org/issues/11494

Because SIGUSR2 is generally sent with kill(1)/kill(2) from other processes, it will be process-directed and thus the thread-directed SIGSEGV/SIGILL will always win out and you'll still get the right context.

The easiest way to work around this is to mask all signals when you set up your SIGPROF handler with sigaction(2), but based on the structure of the interpreter it seems like that won't be possible -- given that you need to handle SIGSEGVs from the profiler itself.

I think to work around this, you'll want to do something a bit like this in your SIGSEGV/SIGILL/etc handlers:

        if (uc->uc_link != NULL) {
                Dl_info_t dli;
                void *pc = (void *)(uintptr_t)uc->uc_mcontext.gregs[REG_PC];

                if (dladdr(pc, &dli) != 0) {
                        if (strstr(dli.dli_fname, "libc.so.1") != NULL) {
                                uc = uc->uc_link;
                        }
                }
        }

This will kick in only if there's another context in the chain, and will almost certainly detect the issue: that SIGPROF has been incorrectly delivered first, and the SIGSEGV was delivered shortly after while we were still stuck in the libc handling code. When that condition is detected, we'll basically discard the top-most context in favour of what is likely the correct one.

We'll look at fixing this in the OS as well, but it will presumably be a while before this is fixed on all of the machines you'd like to run on -- and it may _never_ be fixed for legacy bits like Solaris 10.

Apologies for the bugs!