SST fails with misleading error and crash

Bug #808417 reported by Vadim Tkachenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Fix Released
Low
Alex Yurchenko

Bug Description

When sst fails because wsrep_sst_rsync is not found,
we have next error.log:

110710 9:36:49 [Note] WSREP: New cluster view: group UUID: 8d0cd8e8-ab08-11e0-0800-bfeb931854e6, conf# 2: Primary, number of nodes: 2, my index: 1, first seqno: 1, protocol version 1
110710 9:36:49 [Warning] WSREP: Gap in state sequence. Need state transfer.
110710 9:36:51 [Note] WSREP: Running: 'wsrep_sst_rsync 'joiner' '10.11.12.206' 'root:' '/data/bench/gal/' '21614' 2>sst.err'
110710 9:36:51 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_rsync 'joiner' '10.11.12.206' 'root:' '/data/bench/gal/' '21614' 2>sst.err
        Read: '(null)'
110710 9:36:51 [ERROR] WSREP: Process completed with error: wsrep_sst_rsync 'joiner' '10.11.12.206' 'root:' '/data/bench/gal/' '21614' 2>sst.err: 127 (Key has expired)
110710 9:36:51 [ERROR] WSREP: SST failed: 127 (Key has expired)
110710 9:36:51 [ERROR] Aborting

110710 9:36:51 [Note] WSREP: Closing send monitor...
110710 9:36:51 [ERROR] WSREP: SST preparation failed: 127 (Key has expired)
110710 9:36:51 [Note] WSREP: Closed send monitor.
110710 9:36:51 [Warning] WSREP: wsrep_notify_cmd is not defined, skipping notification.
110710 9:36:51 [Note] WSREP: gcomm: terminating thread
110710 9:36:51 [ERROR] WSREP: View callback failed: 127 (Key has expired. This is unrecoverable, restart required.
110710 9:36:51 [Note] WSREP: Closing send monitor...
110710 9:36:51 [Note] WSREP: Closed send monitor.
110710 9:36:51 [Note] WSREP: gcomm: joining thread
110710 9:36:51 [Note] WSREP: gcomm: closing backend
110710 9:36:51 [Note] WSREP: evs::proto(ce771006-ab12-11e0-0800-01b1c5bcbc77, LEAVING, view_id(REG,5502ecb9-ab0c-11e0-0800-20148a141ef2,2)) uuid 5502ecb9-ab0c-11e0-0800-20148a141ef2 missing from install message, assuming partitioned
110710 9:36:51 [Note] WSREP: GMCast::handle_stable_view: view(view_id(NON_PRIM,5502ecb9-ab0c-11e0-0800-20148a141ef2,2) memb {
        ce771006-ab12-11e0-0800-01b1c5bcbc77,
} joined {
} left {
} partitioned {
        5502ecb9-ab0c-11e0-0800-20148a141ef2,
})
110710 9:36:51 [Note] WSREP: GMCast::handle_stable_view: view((empty))
110710 9:36:51 [Note] WSREP: New COMPONENT: primary = no, my_idx = 0, memb_num = 1
110710 9:36:51 [Note] WSREP: gcomm: closed
110710 9:36:51 [Warning] WSREP: gcomm: backend already closed
110710 9:36:51 [ERROR] WSREP: gcs/src/gcs_fifo_lite.c:gcs_fifo_lite_close():70: Trying to close a closed FIFO
110710 9:36:51 [Note] WSREP: recv_thread() joined.
110710 9:36:51 [Warning] WSREP: Broken shutdown sequence: GCS connection state is PRIMARY, expected CLOSED
110710 9:36:51 [Note] WSREP: Shifting PRIMARY -> CLOSED (TO: 0)
110710 9:36:51 [Note] WSREP: Closing slave action queue.
110710 9:36:51 [Note] WSREP: libexec/mysqld: Terminated.
110710 9:36:51 [Note] WSREP: Flow-control interval: [8, 16]
110710 9:36:51 [Note] WSREP: Received NON-PRIMARY.
Aborted (core dumped)

There are two problems:
1. Error message [ERROR] WSREP: SST preparation failed: 127 (Key has expired)
is misleading, as "Key has expired" is totally confusing

2. MySQL ended up with coredump, but not proper exit. I even do not think mysqld should stop,
as later we may try to connect to another cluster or using different SST method

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Vadim,

1. Error code comes from the pclose() system call. Not really sure what to do about it. It is displayed for reference, should it be hidden instead?

2. To begin with, mysqld has not really started yet. In general if SST preparation call failed, there is no certainty about mysql data dir state. It could be wiped out and further initialization might be impossible. But the real problem is that you know that you can't use rsync or xtrabackup (or anything but mysqldump) when server is initialized. So there is little hope for recovery later and it is really better to stop mysqld before it really started.

3. Did it really dump core?

Revision history for this message
Vadim Tkachenko (vadim-tk) wrote : Re: [Bug 808417] Re: SST fails with misleading error and crash
Download full text (5.1 KiB)

Alex,

I use lp:codership-mysql revision
===============
revno: 3086
fixes bug(s): https://launchpad.net/bugs/807493
https://launchpad.net/bugs/808000
committer: Alex Yurchenko <ayurchen@void>
branch nick: trunk
timestamp: Sun 2011-07-10 00:26:39 +0300
message:
  References lp:807493, lp:808000 - merged c3099 from lp:codership-mysql/0.8
===============

Probably it is better to hide error message then show incorrect one.

I am not sure if it is cored... but it definitely aborted.

On Sun, Jul 10, 2011 at 11:33 AM, Alex Yurchenko
<email address hidden> wrote:
> Vadim,
>
> 1. Error code comes from the pclose() system call. Not really sure what
> to do about it. It is displayed for reference, should it be hidden
> instead?
>
> 2. To begin with, mysqld has not really started yet. In general if SST
> preparation call failed, there is no certainty about mysql data dir
> state. It could be wiped out and further initialization might be
> impossible. But the real problem is that you know that you can't use
> rsync or xtrabackup (or anything but mysqldump) when server is
> initialized. So there is little hope for recovery later and it is really
> better to stop mysqld before it really started.
>
> 3. Did it really dump core?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/808417
>
> Title:
>  SST fails with misleading error and crash
>
> Status in Galera Replicator:
>  New
>
> Bug description:
>  When sst fails because wsrep_sst_rsync is not found,
>  we have next error.log:
>
>  110710  9:36:49 [Note] WSREP: New cluster view: group UUID: 8d0cd8e8-ab08-11e0-0800-bfeb931854e6, conf# 2: Primary, number of nodes: 2, my index: 1, first seqno: 1, protocol version 1
>  110710  9:36:49 [Warning] WSREP: Gap in state sequence. Need state transfer.
>  110710  9:36:51 [Note] WSREP: Running: 'wsrep_sst_rsync 'joiner' '10.11.12.206' 'root:' '/data/bench/gal/' '21614' 2>sst.err'
>  110710  9:36:51 [ERROR] WSREP: Failed to read 'ready <addr>' from: wsrep_sst_rsync 'joiner' '10.11.12.206' 'root:' '/data/bench/gal/' '21614' 2>sst.err
>          Read: '(null)'
>  110710  9:36:51 [ERROR] WSREP: Process completed with error: wsrep_sst_rsync 'joiner' '10.11.12.206' 'root:' '/data/bench/gal/' '21614' 2>sst.err: 127 (Key has expired)
>  110710  9:36:51 [ERROR] WSREP: SST failed: 127 (Key has expired)
>  110710  9:36:51 [ERROR] Aborting
>
>  110710  9:36:51 [Note] WSREP: Closing send monitor...
>  110710  9:36:51 [ERROR] WSREP: SST preparation failed: 127 (Key has expired)
>  110710  9:36:51 [Note] WSREP: Closed send monitor.
>  110710  9:36:51 [Warning] WSREP: wsrep_notify_cmd is not defined, skipping notification.
>  110710  9:36:51 [Note] WSREP: gcomm: terminating thread
>  110710  9:36:51 [ERROR] WSREP: View callback failed: 127 (Key has expired. This is unrecoverable, restart required.
>  110710  9:36:51 [Note] WSREP: Closing send monitor...
>  110710  9:36:51 [Note] WSREP: Closed send monitor.
>  110710  9:36:51 [Note] WSREP: gcomm: joining thread
>  110710  9:36:51 [Note] WSREP: gcomm: closing backend
>  110710  9:36:51 [Note] WSREP: evs::proto(ce771006-ab12-11e0-0800-01b1c5bcbc77, LE...

Read more...

affects: galera → codership-mysql
Changed in codership-mysql:
assignee: nobody → Alex Yurchenko (ayurchen)
importance: Undecided → Low
status: New → Fix Committed
milestone: none → 0.8.1
Changed in codership-mysql:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.