If donor is killed during xtrabackup SST some processes still kept running

Bug #1138439 reported by Vadim Tkachenko on 2013-03-01
This bug affects 2 people
Bug Description

If I kill donor during xtrabackup SST, the processes are still running:

on donor:

mysql 25318 0.0 0.0 106056 1488 pts/0 S 14:46 0:00 /bin/bash -ue /usr//bin/wsrep_sst_xtrabackup --role donor --address --auth (null) --socket /var/lib/mysql/mysql.sock --datadir /mnt/data/mysql/ --defaults-file /etc/my.cnf --gtid 866edaf2-7fab-11e2-0800-68035b54c6d2:1644544
mysql 25334 0.1 0.1 147788 11460 pts/0 S 14:46 0:00 perl /usr//bin/innobackupex --galera-info --tmpdir=/tmp --stream=tar --defaults-file=/etc/my.cnf --socket=/var/lib/mysql/mysql.sock /tmp
mysql 25335 22.9 0.0 7540 620 pts/0 S 14:46 0:16 nc 4444
mysql 25373 36.1 0.5 242672 42272 pts/0 Sl 14:46 0:22 xtrabackup_55 --defaults-file=/etc/my.cnf --defaults-group=mysqld --backup --suspend-at-end --target-dir=/tmp --stream=tar

on joiner:
mysql 1110 0.0 0.0 106056 1504 pts/1 S 14:46 0:00 /bin/bash -ue /usr//bin/wsrep_sst_xtrabackup --role joiner --address --auth --datadir /mnt/data/mysql/ --defaults-file /etc/my.cnf --parent 1085
mysql 1123 44.4 0.0 7540 624 pts/1 R 14:46 0:40 nc -dl 4444
mysql 1128 39.6 0.0 116000 1172 pts/1 S 14:46 0:36 tar xfi - -C /mnt/data/mysql/

problem with that that we can't restart DONOR and JOINER in clean way
and these process has to be killed manually.

handling these processes should be automatic.

Vadim Tkachenko (vadim-tk) wrote :

The problem here I think is that when mysqld dies it does not send a signal to child process
"wsrep_sst_xtrabackup", but it should.

So I add this bug to codership-mysql project

Vadim Tkachenko (vadim-tk) wrote :

Ok, partially this problem is related to

where innobackupex should handle better failure of mysqld.

But to proper solve issue, the caller (mysqld) should send appropriate signal to child.

I am not able to reproduce / have not seen it with donor, but can reproduce it with receiver.

130306 17:06:16 [Warning] WSREP: Gap in state sequence. Need state transfer.
130306 17:06:18 [Note] WSREP: Running: 'wsrep_sst_xtrabackup --role 'joiner' --address '' --auth '' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '4600''
130306 17:06:18 [Note] WSREP: Prepared SST request: xtrabackup|
130306 17:06:18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130306 17:06:18 [Note] WSREP: Assign initial position for certification: 0, protocol version: 2
130306 17:06:18 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (e9f4b618-8651-11e2-0800-96555ca76cdb): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():442. IST will be unavailable.
130306 17:06:18 [Note] WSREP: Node 0 (Bxc2) requested state transfer from '*any*'. Selected 1 (Bxc1)(SYNCED) as donor.
130306 17:06:18 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 0)
130306 17:06:18 [Note] WSREP: Requesting state transfer: success, donor: 1
raghu Buu:/var/lib [338:137]% p wsrep
mysql 4620 0.0 0.0 4400 616 pts/0 S 17:06 0:00 sh -c wsrep_sst_xtrabackup --role 'joiner' --address '' --auth '' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --parent '4600'
mysql 4621 0.0 0.1 12372 1596 pts/0 S 17:06 0:00 /bin/bash -ue /usr//bin/wsrep_sst_xtrabackup --role joiner --address --auth --datadir /var/lib/mysql/ --defaults-file /etc/my.cnf --parent 4600
raghu 4656 0.0 0.0 9384 924 pts/0 S+ 17:06 0:00 grep -i --color wsrep
raghu Buu:/var/lib [343]% p inno
raghu 4658 0.0 0.0 9384 924 pts/0 S+ 17:06 0:00 grep -i --color inno
raghu Buu:/var/lib [344]% /usr//bin/wsrep_sst_common: line 94: /dev/stderr: Permission denied

It dies after a while though, that 'Permission denied' is from that.

In case of donor, does it still continue or die after sometime?

Vadim Tkachenko (vadim-tk) wrote :

on donor it continues to run for unlimited time,
as what happens in this case - innobackupex dies, but it does not send signal to xtrabackup_55,
so we hit the bug https://bugs.launchpad.net/percona-xtrabackup/+bug/1135441

Waiting for xtrabackup 2.0.7 to be released before testing further.

Changed in percona-xtradb-cluster:
status: New → Triaged

Seems the push fix has been moved to PXB 2.0.8/2.1.4

No longer reproduceable.

Changed in percona-xtradb-cluster:
status: Triaged → Fix Committed
Changed in percona-xtradb-cluster:
importance: Undecided → Medium

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1054

