Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

pc.recovery fails with unclear error when gvwstate.dat is empty

Bug #1669333 reported by Przemek on 2017-03-02

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.6	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.6.35-26.20
5.7	Fix Released	Undecided	Unassigned

Bug Description

In case when the gvwstate.dat ends up empty, the PC recovery procedure claims the restore from disk is successful, and later fails with confusing error. Why the file got truncated is unclear, but it was observed in some cases.
First of all - when the file is empty or not complete - it should not return "WSREP: restore pc from disk successfully" message, there should be some initial check.
Secondly, IMHO when the gvwstate.dat is wrong, pc.recovery procedure should be aborted, and normal joining should be attempted.

This is example err log sequence when the file is present, but empty (PXC 5.7.16):

2017-03-02T09:40:40.231158Z 0 [Note] WSREP: restore pc from disk successfully
2017-03-02T09:40:40.231275Z 0 [Note] WSREP: GMCast version 0
2017-03-02T09:40:40.231608Z 0 [Note] WSREP: (00000000, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2017-03-02T09:40:40.231616Z 0 [Note] WSREP: (00000000, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2017-03-02T09:40:40.231687Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 131: invalid UUID: 00000000 (FATAL)
at gcomm/src/pc.cpp:PC():267
2017-03-02T09:40:40.231702Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -131 (State not recoverable)
2017-03-02T09:40:40.231731Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1407: Failed to open channel 'galera_cluster' at 'gcomm://172.17.0.4,172.17.0.5,172.17.0.2,172.17.0.3': -131 (State not recoverable)
2017-03-02T09:40:40.231737Z 0 [ERROR] WSREP: gcs connect failed: State not recoverable
2017-03-02T09:40:40.231740Z 0 [ERROR] WSREP: wsrep::connect(gcomm://172.17.0.4,172.17.0.5,172.17.0.2,172.17.0.3) failed: 7
2017-03-02T09:40:40.231743Z 0 [ERROR] Aborting

2017-03-02T09:40:40.231747Z 0 [Note] Forcefully disconnecting 0 remaining clients
2017-03-02T09:40:40.231758Z 0 [Note] WSREP: Service disconnected.
2017-03-02T09:40:41.231829Z 0 [Note] WSREP: Some threads may fail to exit.
2017-03-02T09:40:41.231862Z 0 [Note] Binlog end
2017-03-02T09:40:41.231980Z 0 [Note] /usr/sbin/mysqld: Shutdown complete

Worth noting, that grastate.dat was still OK:

[root@11310f5a0bf9 /]# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 3efe5400-aa6d-11e6-b772-625f9abee4ba
seqno: 750385
safe_to_bootstrap: 0

Error messages are less confusing when the gvwstate.dat is simply gone, also node attempts to connect and is looking for PC to join:

2017-03-02T09:57:35.258700Z 0 [Note] WSREP: restore pc from disk failed
2017-03-02T09:57:35.259193Z 0 [Note] WSREP: GMCast version 0
2017-03-02T09:57:35.259525Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2017-03-02T09:57:35.259532Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2017-03-02T09:57:35.259838Z 0 [Note] WSREP: EVS version 0
2017-03-02T09:57:35.259927Z 0 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer '172.17.0.4:,172.17.0.5:,172.17.0.2:,172.17.0.3:'
2017-03-02T09:57:35.261257Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') connection established to 7a337039 tcp://172.17.0.3:4567
2017-03-02T09:57:35.261362Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2017-03-02T09:57:35.261391Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') connection established to 597ad126 tcp://172.17.0.2:4567
2017-03-02T09:57:35.387108Z 0 [Note] WSREP: declaring 597ad126 at tcp://172.17.0.2:4567 stable
2017-03-02T09:57:35.387144Z 0 [Note] WSREP: declaring 7a337039 at tcp://172.17.0.3:4567 stable
2017-03-02T09:57:35.387653Z 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2017-03-02T09:57:35.387690Z 0 [Note] WSREP: view(view_id(NON_PRIM,597ad126,328) memb {
        597ad126,1
        7a337039,3
        a922cb8f,0
} joined {
} left {
} partitioned {
        bc8f69cc,0
})
2017-03-02T09:57:38.761226Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') turning message relay requesting off
2017-03-02T09:58:05.766360Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():158
2017-03-02T09:58:05.766386Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-03-02T09:58:05.766493Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1407: Failed to open channel 'galera_cluster' at 'gcomm://172.17.0.4,172.17.0.5,172.17.0.2,172.17.0.3': -110 (Connection timed out)
2017-03-02T09:58:05.766502Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-03-02T09:58:05.766507Z 0 [ERROR] WSREP: wsrep::connect(gcomm://172.17.0.4,172.17.0.5,172.17.0.2,172.17.0.3) failed: 7
2017-03-02T09:58:05.766511Z 0 [ERROR] Aborting

Tags:

Revision history for this message

Przemek (pmalkowski) wrote on 2017-03-02:

Related upstream bug: https://github.com/codership/galera/issues/354

Revision history for this message

Krunal Bauskar (krunal-bauskar) wrote on 2017-03-07:

* Issues reproduced and I agree with solution

- If file has invalid/empty content then we should raise a warning and avoid restoring of pc from file.
- This will cause flow to fall back to normal joining process.

User will get notified about invalid file through a warning message but server will continue to restore using normal joining process. (This will avoid user from manually correcting or deleting the file).

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-784

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-github-codership-galera #354
[open] Edit

Bug watches keep track of this bug in other bug trackers.