pc.recovery fails with unclear error when gvwstate.dat is empty

Bug #1669333 reported by Przemek on 2017-03-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.6
Fix Released
Undecided
Unassigned
5.7
Fix Released
Undecided
Unassigned

Bug Description

In case when the gvwstate.dat ends up empty, the PC recovery procedure claims the restore from disk is successful, and later fails with confusing error. Why the file got truncated is unclear, but it was observed in some cases.
First of all - when the file is empty or not complete - it should not return "WSREP: restore pc from disk successfully" message, there should be some initial check.
Secondly, IMHO when the gvwstate.dat is wrong, pc.recovery procedure should be aborted, and normal joining should be attempted.

This is example err log sequence when the file is present, but empty (PXC 5.7.16):

2017-03-02T09:40:40.231158Z 0 [Note] WSREP: restore pc from disk successfully
2017-03-02T09:40:40.231275Z 0 [Note] WSREP: GMCast version 0
2017-03-02T09:40:40.231608Z 0 [Note] WSREP: (00000000, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2017-03-02T09:40:40.231616Z 0 [Note] WSREP: (00000000, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2017-03-02T09:40:40.231687Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 131: invalid UUID: 00000000 (FATAL)
         at gcomm/src/pc.cpp:PC():267
2017-03-02T09:40:40.231702Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -131 (State not recoverable)
2017-03-02T09:40:40.231731Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1407: Failed to open channel 'galera_cluster' at 'gcomm://172.17.0.4,172.17.0.5,172.17.0.2,172.17.0.3': -131 (State not recoverable)
2017-03-02T09:40:40.231737Z 0 [ERROR] WSREP: gcs connect failed: State not recoverable
2017-03-02T09:40:40.231740Z 0 [ERROR] WSREP: wsrep::connect(gcomm://172.17.0.4,172.17.0.5,172.17.0.2,172.17.0.3) failed: 7
2017-03-02T09:40:40.231743Z 0 [ERROR] Aborting

2017-03-02T09:40:40.231747Z 0 [Note] Forcefully disconnecting 0 remaining clients
2017-03-02T09:40:40.231758Z 0 [Note] WSREP: Service disconnected.
2017-03-02T09:40:41.231829Z 0 [Note] WSREP: Some threads may fail to exit.
2017-03-02T09:40:41.231862Z 0 [Note] Binlog end
2017-03-02T09:40:41.231980Z 0 [Note] /usr/sbin/mysqld: Shutdown complete

Worth noting, that grastate.dat was still OK:

[root@11310f5a0bf9 /]# cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 3efe5400-aa6d-11e6-b772-625f9abee4ba
seqno: 750385
safe_to_bootstrap: 0

Error messages are less confusing when the gvwstate.dat is simply gone, also node attempts to connect and is looking for PC to join:

2017-03-02T09:57:35.258700Z 0 [Note] WSREP: restore pc from disk failed
2017-03-02T09:57:35.259193Z 0 [Note] WSREP: GMCast version 0
2017-03-02T09:57:35.259525Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2017-03-02T09:57:35.259532Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2017-03-02T09:57:35.259838Z 0 [Note] WSREP: EVS version 0
2017-03-02T09:57:35.259927Z 0 [Note] WSREP: gcomm: connecting to group 'galera_cluster', peer '172.17.0.4:,172.17.0.5:,172.17.0.2:,172.17.0.3:'
2017-03-02T09:57:35.261257Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') connection established to 7a337039 tcp://172.17.0.3:4567
2017-03-02T09:57:35.261362Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2017-03-02T09:57:35.261391Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') connection established to 597ad126 tcp://172.17.0.2:4567
2017-03-02T09:57:35.387108Z 0 [Note] WSREP: declaring 597ad126 at tcp://172.17.0.2:4567 stable
2017-03-02T09:57:35.387144Z 0 [Note] WSREP: declaring 7a337039 at tcp://172.17.0.3:4567 stable
2017-03-02T09:57:35.387653Z 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2017-03-02T09:57:35.387690Z 0 [Note] WSREP: view(view_id(NON_PRIM,597ad126,328) memb {
        597ad126,1
        7a337039,3
        a922cb8f,0
} joined {
} left {
} partitioned {
        bc8f69cc,0
})
2017-03-02T09:57:38.761226Z 0 [Note] WSREP: (a922cb8f, 'tcp://0.0.0.0:4567') turning message relay requesting off
2017-03-02T09:58:05.766360Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
         at gcomm/src/pc.cpp:connect():158
2017-03-02T09:58:05.766386Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-03-02T09:58:05.766493Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1407: Failed to open channel 'galera_cluster' at 'gcomm://172.17.0.4,172.17.0.5,172.17.0.2,172.17.0.3': -110 (Connection timed out)
2017-03-02T09:58:05.766502Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-03-02T09:58:05.766507Z 0 [ERROR] WSREP: wsrep::connect(gcomm://172.17.0.4,172.17.0.5,172.17.0.2,172.17.0.3) failed: 7
2017-03-02T09:58:05.766511Z 0 [ERROR] Aborting

Przemek (pmalkowski) wrote :
Krunal Bauskar (krunal-bauskar) wrote :

* Issues reproduced and I agree with solution

- If file has invalid/empty content then we should raise a warning and avoid restoring of pc from file.
- This will cause flow to fall back to normal joining process.

User will get notified about invalid file through a warning message but server will continue to restore using normal joining process. (This will avoid user from manually correcting or deleting the file).

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-784

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.