gu_abort in gcs_core_recv during SST leads to dangling child processes

Bug #1292992 reported by Raghavendra D Prabhu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
Invalid
Undecided
Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Confirmed
Critical
Unassigned

Bug Description

Pasting verbatim from https://bugs.launchpad.net/galera/+bug/1284245/comments/13 (where it was discussed):
==============================================================================================

140316 2:58:09 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 235864)
140316 2:58:09 [Note] WSREP: Requesting state transfer: success, donor: 0
140316 2:58:09 [Warning] WSREP: 0 (Arch1): State transfer to 1 (Arch2) failed: -2 (No such file or directory)
140316 2:58:09 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():719: Will never receive state. Need to abort.
140316 2:58:09 [Note] WSREP: gcomm: terminating thread
140316 2:58:09 [Note] WSREP: gcomm: joining thread
140316 2:58:09 [Note] WSREP: gcomm: closing backend
140316 2:58:10 [Note] WSREP: view(view_id(NON_PRIM,06fbb0eb-ac88-11e3-a0cf-d7c3069129e3,8) memb {
        b22e0783-ac88-11e3-a75a-46c514882f6c,
} joined {
} left {
} partitioned {
        06fbb0eb-ac88-11e3-a0cf-d7c3069129e3,
})
140316 2:58:10 [Note] WSREP: view((empty))
140316 2:58:10 [Note] WSREP: gcomm: closed
140316 2:58:10 [Note] WSREP: /pxcd/bin/mysqld: Terminated.

This is an issue in gcs of galera. It aborts the process without doing any shutdown of mysqld (not even ungraceful shutdown) with gu_abort.

void
gu_abort (void)
{
    /* avoid coredump */
    struct rlimit core_limits = { 0, 0 };
    setrlimit (RLIMIT_CORE, &core_limits);

    /* restore default SIGABRT handler */
    signal (SIGABRT, SIG_DFL);

#if defined(GU_SYS_PROGRAM_NAME)
    gu_info ("%s: Terminated.", GU_SYS_PROGRAM_NAME);
#else
    gu_info ("Program terminated.");
#endif

    abort();
}

This also ensures that no signal is sent to child processes
(wsrep_sst_*), so they keep running.

It gu_aborts in gcs_core_recv

    if (ret < 0) {
        assert (recv_act->id < 0);

        if (GCS_ACT_TORDERED == recv_act->act.type && recv_act->act.buf) {
            gcs_gcache_free (conn->cache, recv_act->act.buf);
            recv_act->act.buf = NULL;
        }

        if (-ENOTRECOVERABLE == ret) {
            conn->backend.close(&conn->backend);
            gu_abort();
        }

(gdb) bt
#0 0x00007ffff637c389 in raise () from /usr/lib/libc.so.6
#1 0x00007ffff637d788 in abort () from /usr/lib/libc.so.6
#2 0x00007ffff5389477 in gu_abort () at galerautils/src/gu_abort.c:34
#3 0x00007ffff546c2a2 in gcs_core_recv (conn=0x11d3000, recv_act=recv_act@entry=0x7fffe6ffbeb0, timeout=<optimized out>) at gcs/src/gcs_core.c:1074
#4 0x00007ffff5470d6c in gcs_recv_thread (arg=0x11d2de0) at gcs/src/gcs.c:1116
#5 0x00007ffff736f0a2 in start_thread () from /usr/lib/libpthread.so.0
#6 0x00007ffff642cd1d in clone () from /usr/lib/libc.so.6
(gdb) quit
A debugging session is active.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :
Revision history for this message
Alex Yurchenko (ayurchen) wrote :

1) what does "ungraceful shutdown" (as in "not even ungraceful shutdown") mean and how is it different from abort?
2) a failure of a child process to react to the untimely death of a parent can not be a a library bug. At best it is an issue with how the child is started (ATM there is one shell too many)

I think this ticket is invalid.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

The callstack doesn't return to mysqld so that it shuts down (even without closing threads as in unireg_abort). This may not be a bug in galera but abort() doesn't terminate the child processes (in this case SST process).

This may be a bug in posix_spawnp (used in wsp::process) or with
handling of abort signals.

So, it needs to be fixed there or in galera.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Raghu, suppose mysqld process is killed with -11 or -9 while in JOINER state. If that leaves child processes running, will doing something about gu_abort() in Galera fix that?

As for "not returning", see the comment in the code - there is nowhere to return to. The even is just pushed into slave queue, which is blocked until SST is over. And how would unireg_abort() help with closing child processes anyways? Am I missing something?

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

must be fixed on the caller side.

Changed in galera:
status: New → Invalid
Changed in percona-xtradb-cluster:
status: New → Invalid
status: Invalid → Confirmed
importance: Undecided → Critical
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-932

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.