Galera

mysql shutdown is very slow in Joining state

Bug #1109969 reported by Vadim Tkachenko on 2013-01-30

This bug report is a duplicate of: Bug #1176852: IST is not interrupted properly during shutdown. Edit Remove

This bug affects 1 person

	Status	Importance	Assigned to
Galera	New	Undecided	Unassigned
MySQL patches by Codership	New	Undecided	Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	New	Undecided	Unassigned

Bug Description

If a node is in "Joining" state and we execute mysqladmin shutdown, it performs very slow.

It seems that the node still tries to apply all events from cache file, but fails, and throws a lot of spam into error.log:

130129 22:47:48 [Warning] WSREP: Failed to report last committed 7026734, -77 (File descriptor in bad state)
130129 22:47:48 [Warning] WSREP: Failed to report last committed 7026778, -77 (File descriptor in bad state)
130129 22:47:48 [Warning] WSREP: Failed to report last committed 7026823, -77 (File descriptor in bad state)
130129 22:47:48 [Warning] WSREP: Failed to report last committed 7026867, -77 (File descriptor in bad state)
130129 22:47:48 [Warning] WSREP: Failed to report last committed 7026913, -77 (File descriptor in bad state)
130129 22:47:48 [Warning] WSREP: Failed to report last committed 7026956, -77 (File descriptor in bad state)
130129 22:47:48 [Warning] WSREP: Failed to report last committed 7027011, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027059, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027103, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027145, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027197, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027235, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027296, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027335, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027389, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027435, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027480, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027529, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027574, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027617, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027668, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027719, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027759, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027808, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027849, -77 (File descriptor in bad state)
130129 22:47:49 [Warning] WSREP: Failed to report last committed 7027896, -77 (File descriptor in bad state)
130129 22:47:50 [Warning] WSREP: Failed to report last committed 7027937, -77 (File descriptor in bad state)

and it goes through all gcache.page.0000* files.

I think it is better to just cancel all apply threads when there is shutdown signal.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-02-19:

1) That is from galera worked thread in galera_service_thd.cpp

=====================
        if (!exit)
        {
            if (data.act_ & A_LAST_COMMITTED)
            {
                ssize_t const ret(st->gcs_.set_last_applied(data.last_committed_));

                if (gu_unlikely(ret < 0))
                {
                    log_warn << "Failed to report last committed "
                             << data.last_committed_ << ", " << ret
                             << " (" << strerror (-ret) << ')';
                    // @todo: figure out what to do in this case
                }
                else
                {
                    log_debug << "Reported last committed: "
                              << data.last_committed_;
                }
            }
        }

==================

kill_mysql and kill_server in turn call wsrep_stop_replication,
which call wsrep_close_client_connections(TRUE), but the TRUE
here indicates a graceful shutdown where it waits for all threads
to exit and then there is also

/* wait until appliers have stopped */
wsrep_wait_appliers_close(thd);

So, may be something can be changed here if possible. (The
unireg_abort calls wsrep_close_client_connections(FALSE) to not
wait for close).

2) Having said that, I noticed that after it transitions to
JOINER and is undergoing IST, it doesn't handle SIGQUIT/SIGKILL
signals well, until it has reached a JOINED state.

You can see it here -- http://sprunge.us/hPGU -- during " InnoDB:
DEBUG: update_statistics for sbtest/sbtest1. " it won't stop with
mysqladmin shutdown or SIGQUIT.