After crash the node refuses to start in SYNC

Bug #908021 reported by Vadim Tkachenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
Confirmed
Medium
Teemu Ollakka

Bug Description

The master crashed ( report will be in different bug)
and I am trying to start it, but it can't start in SYNCED

/usr/sbin/mysqld
111223 1:47:27 [Note] Flashcache bypass: disabled
111223 1:47:27 [Note] Flashcache setup error is : ioctl failed

111223 1:47:27 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/libgalera_smm.so'
111223 1:47:27 [Note] WSREP: wsrep_load(): Galera 2.0beta(r99) by Codership Oy <email address hidden> loaded succesfully.
111223 1:47:27 [Note] WSREP: Reusing existing '/mnt/data//galera.cache'.
111223 1:47:27 [Note] WSREP: Passing config to GCS: gcache.dir = /mnt/data/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /mnt/data//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; replicator.commit_order = 3
111223 1:47:27 [Note] WSREP: wsrep_sst_grab()
111223 1:47:27 [Note] WSREP: Start replication
111223 1:47:27 [Note] WSREP: Found saved state: a11cef9b-2d31-11e1-0800-695615454002:0
111223 1:47:27 [Note] WSREP: Assign initial position for certification: 0, protocol version: -1
111223 1:47:27 [Note] WSREP: Setting initial position to a11cef9b-2d31-11e1-0800-695615454002:0
111223 1:47:27 [Note] WSREP: protonet asio version 0
111223 1:47:27 [Note] WSREP: backend: asio
111223 1:47:27 [Note] WSREP: GMCast version 0
111223 1:47:27 [Note] WSREP: (fa996d02-2d31-11e1-0800-e95116f8b4c3, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
111223 1:47:27 [Note] WSREP: (fa996d02-2d31-11e1-0800-e95116f8b4c3, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
111223 1:47:27 [Note] WSREP: EVS version 0
111223 1:47:27 [Note] WSREP: PC version 0
111223 1:47:27 [Note] WSREP: gcomm: connecting to group 'trimethylxanthine', peer ''
111223 1:47:27 [Note] WSREP: GMCast::handle_stable_view: view(view_id(PRIM,fa996d02-2d31-11e1-0800-e95116f8b4c3,1) memb {
        fa996d02-2d31-11e1-0800-e95116f8b4c3,
} joined {
} left {
} partitioned {
})
111223 1:47:27 [Note] WSREP: gcomm: connected
111223 1:47:27 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
111223 1:47:27 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
111223 1:47:27 [Note] WSREP: Opened channel 'trimethylxanthine'
111223 1:47:27 [Note] WSREP: New COMPONENT: primary = yes, my_idx = 0, memb_num = 1
111223 1:47:27 [Note] WSREP: Waiting for SST to complete.
111223 1:47:27 [Note] WSREP: STATE_EXCHANGE: sent state UUID: fa9a2969-2d31-11e1-0800-c6c36efcc325
111223 1:47:27 [Note] WSREP: STATE EXCHANGE: sent state msg: fa9a2969-2d31-11e1-0800-c6c36efcc325
111223 1:47:27 [Note] WSREP: STATE EXCHANGE: got state msg: fa9a2969-2d31-11e1-0800-c6c36efcc325 from 0 (node1)
111223 1:47:27 [Note] WSREP: Quorum results:
        version = 2,
        component = PRIMARY,
        conf_id = 0,
        members = 1/1 (joined/total),
        act_id = 0,
        last_appl. = -1,
        protocols = 0/2/1 (gcs/repl/appl),
        group UUID = a11cef9b-2d31-11e1-0800-695615454002
111223 1:47:27 [Note] WSREP: Flow-control interval: [8, 16]
111223 1:47:27 [Note] WSREP: Restored state OPEN -> JOINED (0)
111223 1:47:27 [Note] WSREP: Member 0 (node1) synced with group.
111223 1:47:27 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 0)
111223 1:47:27 [Note] WSREP: New cluster view: global state: a11cef9b-2d31-11e1-0800-695615454002:0, view# 1: Primary, number of nodes: 1, my index: 0, protocol version 1
111223 1:47:27 [Note] WSREP: SST complete, seqno: 0
111223 1:47:27 [Note] Plugin 'FEDERATED' is disabled.
111223 1:47:27 InnoDB: The InnoDB memory heap is disabled
111223 1:47:27 InnoDB: Mutexes and rw_locks use GCC atomic builtins
111223 1:47:27 InnoDB: Compressed tables use zlib 1.2.3
111223 1:47:27 InnoDB: Using Linux native AIO
111223 1:47:27 InnoDB: Initializing buffer pool, size = 128.0M
111223 1:47:27 InnoDB: Completed initialization of buffer pool
111223 1:47:27 InnoDB: highest supported file format is Barracuda.
InnoDB: The log sequence number in ibdata files does not match
InnoDB: the log sequence number in the ib_logfiles!
111223 1:47:27 InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
111223 1:47:27 InnoDB: Waiting for the background threads to start
111223 1:47:28 Percona XtraDB (http://www.percona.com) 1.1.8-20.1 started; log sequence number 1605425
111223 1:47:28 [Note] Event Scheduler: Loaded 0 events
111223 1:47:28 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.17' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), Release alpha22.1, Revision 3673 wsrep_22.3.r3673
111223 1:47:28 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
111223 1:47:28 [Note] WSREP: Assign initial position for certification: 0, protocol version: 1
111223 1:47:28 [Note] WSREP: Synchronized with group, ready for connections
111223 1:47:28 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
111223 1:47:29 [Note] WSREP: evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=0,source=fa996d02-2d31-11e1-0800-e95116f8b4c3,source_view_id=view_id(REG,6e8a535a-2d31-11e1-0800-c9012ea14f6c,16),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=16,node_list=()
} 64
111223 1:47:29 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=3,user_type=255,order=1,seq=0,seq_range=-1,aru_seq=0,flags=4,source=6e8a535a-2d31-11e1-0800-c9012ea14f6c,source_view_id=view_id(REG,6e8a535a-2d31-11e1-0800-c9012ea14f6c,16),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=337,node_list=()
}
111223 1:47:29 [ERROR] WSREP: state after handling message: evs::proto(evs::proto(fa996d02-2d31-11e1-0800-e95116f8b4c3, OPERATIONAL, view_id(REG,6e8a535a-2d31-11e1-0800-c9012ea14f6c,16)), OPERATIONAL) {
current_view=view(view_id(REG,6e8a535a-2d31-11e1-0800-c9012ea14f6c,16) memb {
        6e8a535a-2d31-11e1-0800-c9012ea14f6c,
        fa996d02-2d31-11e1-0800-e95116f8b4c3,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} node: {idx=1,range=[1,0],safe_seq=0} ,msg_index= (1,0),evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=0,source=fa996d02-2d31-11e1-0800-e95116f8b4c3,source_view_id=view_id(REG,6e8a535a-2d31-11e1-0800-c9012ea14f6c,16),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=16,node_list=()
}
,recovery_index= (0,0),evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=4,source=6e8a535a-2d31-11e1-0800-c9012ea14f6c,source_view_id=view_id(REG,6e8a535a-2d31-11e1-0800-c9012e
111223 1:47:29 [ERROR] WSREP: exception from gcomm, backend must be restarted:msg_state.get_to_seq() == max_to_seq: fa996d02-2d31-11e1-0800-e95116f8b4c3 node fa996d02-2d31-11e1-0800-e95116f8b4c3 to seq not consistent with local state: max to seq 39 msg state to seq 2 (FATAL)
         at gcomm/src/pc_proto.cpp:validate_state_msgs():463
111223 1:47:29 [Note] WSREP: Received self-leave message.
111223 1:47:29 [Note] WSREP: Flow-control interval: [0, 0]
111223 1:47:29 [Note] WSREP: Received SELF-LEAVE. Closing connection.
111223 1:47:29 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 0)
111223 1:47:29 [Note] WSREP: RECV thread exiting 0: Success
111223 1:47:29 [Note] WSREP: New cluster view: global state: a11cef9b-2d31-11e1-0800-695615454002:0, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 1
111223 1:47:29 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
111223 1:47:29 [Note] WSREP: applier thread exiting (code:0)

When I try to shutdown it, it hung in
111223 1:49:26 [Note] /usr/sbin/mysqld: Normal shutdown

111223 1:49:26 [Note] WSREP: Stop replication
111223 1:49:26 [Note] WSREP: Closing send monitor...
111223 1:49:26 [Note] WSREP: Closed send monitor.

config is:

[mysqld]
datadir=/mnt/data
user=mysql

binlog_format=ROW

wsrep_provider=/usr/lib64/libgalera_smm.so

wsrep_cluster_address=gcomm://

wsrep_slave_threads=2
wsrep_cluster_name=trimethylxanthine
wsrep_sst_method=rsync
wsrep_node_name=node1

innodb_locks_unsafe_for_binlog=1
innodb_autoinc_lock_mode=2

affects: codership-mysql → galera
Revision history for this message
Alex Yurchenko (ayurchen) wrote :

This is a consequence of lp:908025 - Node B didn't crash and was trying to reconnect to Node A. However Node B was in non-primary component at the moment, whereas Node A was explicitly primary and in this particular case it would be desirable that Node B obeys Node A.

Changed in galera:
importance: Undecided → Medium
milestone: none → 22.1.2
status: New → Confirmed
assignee: nobody → Teemu Ollakka (teemu-ollakka)
Changed in galera:
milestone: 23.1.2 → 23.2.1
Changed in galera:
milestone: 23.2.1 → 24.2.5
milestone: 24.2.5 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.