socat and xbstream left running after IST

Bug #1451524 reported by Laimonas Anusauskas on 2015-05-04
This bug report is a duplicate of:  Bug #1441762: IST Fails with sst script error. Edit Remove
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Confirmed
Undecided
Unassigned

Bug Description

After upgrade to 5.6.22, every time node does IST it leaves socat and xbstream running on the receiver end. Then next time node tries to do IST it fails because address is already in use.

Percona-XtraDB-Cluster-server-56-5.6.22-25.8.978.el6.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.22-25.8.978.el6.x86_64
Percona-XtraDB-Cluster-client-56-5.6.22-25.8.978.el6.x86_64
Percona-XtraDB-Cluster-garbd-3-3.9-1.3494.rhel6.x86_64
percona-xtrabackup-2.2.10-1.el6.x86_64
Percona-XtraDB-Cluster-galera-3-3.9-1.3494.rhel6.x86_64

Log entries from startup and IST:
----------------------
150504 10:55:02 mysqld_safe Starting mysqld daemon with databases from /var/somehost/mysql/data
150504 10:55:02 mysqld_safe Skipping wsrep-recover for ba462d41-bd10-11e4-9338-8b421b0fe52b:651434535 pair
150504 10:55:02 mysqld_safe Assigning ba462d41-bd10-11e4-9338-8b421b0fe52b:651434535 to wsrep_start_position
2015-05-04 10:55:02 0 [Note] WSREP: wsrep_start_position var submitted: 'ba462d41-bd10-11e4-9338-8b421b0fe52b:651434535'
2015-05-04 10:55:02 8731 [Note] WSREP: Read nil XID from storage engines, skipping position init
2015-05-04 10:55:02 8731 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/libgalera_smm.so'
2015-05-04 10:55:02 8731 [Note] WSREP: wsrep_load(): Galera 3.9(r93aca2d) by Codership Oy <email address hidden> loaded successfully.
2015-05-04 10:55:02 8731 [Note] WSREP: CRC-32C: using hardware acceleration.
2015-05-04 10:55:02 8731 [Note] WSREP: Found saved state: ba462d41-bd10-11e4-9338-8b421b0fe52b:651434535
2015-05-04 10:55:02 8731 [Note] WSREP: Passing config to GCS: base_dir = /var/somehost/mysql/data/; base_host = 10.180.19.32; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 1024; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 512; evs.view_forget_timeout = PT24H; gcache.dir = /var/somehost/mysql/data/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/somehost/mysql/data//galera.cache; gcache.page_size = 128M; gcache.size = 5G; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 100; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 1; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false;
2015-05-04 10:55:02 8731 [Note] WSREP: Service thread queue flushed.
2015-05-04 10:55:02 8731 [Note] WSREP: Assign initial position for certification: 651434535, protocol version: -1
2015-05-04 10:55:02 8731 [Note] WSREP: wsrep_sst_grab()
2015-05-04 10:55:02 8731 [Note] WSREP: Start replication
2015-05-04 10:55:02 8731 [Note] WSREP: Setting initial position to ba462d41-bd10-11e4-9338-8b421b0fe52b:651434535
2015-05-04 10:55:02 8731 [Note] WSREP: protonet asio version 0
2015-05-04 10:55:02 8731 [Note] WSREP: Using CRC-32C for message checksums.
2015-05-04 10:55:02 8731 [Note] WSREP: backend: asio
2015-05-04 10:55:02 8731 [Warning] WSREP: access file(/var/somehost/mysql/data//gvwstate.dat) failed(No such file or directory)
2015-05-04 10:55:02 8731 [Note] WSREP: restore pc from disk failed
2015-05-04 10:55:02 8731 [Note] WSREP: GMCast version 0
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
2015-05-04 10:55:02 8731 [Note] WSREP: EVS version 0
2015-05-04 10:55:02 8731 [Note] WSREP: gcomm: connecting to group 'host_prod', peer 'host01.dca.somehost.com:,host02.dca.somehost.com:,host01.vgs.somehost.com:,host02.vgs.somehost.com:'
2015-05-04 10:55:02 8731 [Warning] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' points to own listening address, blacklisting
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.180.19.30:4567
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:02 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') address 'tcp://10.180.19.32:4567' pointing to uuid b064c0fb is blacklisted, skipping
2015-05-04 10:55:03 8731 [Note] WSREP: declaring 0e5bf8ad at tcp://10.180.19.30:4567 stable
2015-05-04 10:55:03 8731 [Note] WSREP: declaring 0e7ac434 at tcp://10.180.19.31:4567 stable
2015-05-04 10:55:03 8731 [Note] WSREP: declaring 63999580 at tcp://10.170.19.31:4567 stable
2015-05-04 10:55:03 8731 [Note] WSREP: declaring e137440c at tcp://10.170.19.32:4567 stable
2015-05-04 10:55:03 8731 [Note] WSREP: Node 0e5bf8ad state prim
2015-05-04 10:55:03 8731 [Note] WSREP: view(view_id(PRIM,0e5bf8ad,126) memb {
        0e5bf8ad,1
        0e7ac434,1
        63999580,0
        b064c0fb,1
        e137440c,0
} joined {
} left {
} partitioned {
})
2015-05-04 10:55:03 8731 [Note] WSREP: save pc into disk
2015-05-04 10:55:03 8731 [Note] WSREP: gcomm: connected
2015-05-04 10:55:03 8731 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2015-05-04 10:55:03 8731 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2015-05-04 10:55:03 8731 [Note] WSREP: Opened channel 'host_prod'
2015-05-04 10:55:03 8731 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 3, memb_num = 5
2015-05-04 10:55:03 8731 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2015-05-04 10:55:03 8731 [Note] WSREP: Waiting for SST to complete.
2015-05-04 10:55:03 8731 [Note] WSREP: STATE EXCHANGE: sent state msg: b0c0c04f-f286-11e4-9f8e-6bdb67e4be99
2015-05-04 10:55:03 8731 [Note] WSREP: STATE EXCHANGE: got state msg: b0c0c04f-f286-11e4-9f8e-6bdb67e4be99 from 0 (garb)
2015-05-04 10:55:03 8731 [Note] WSREP: STATE EXCHANGE: got state msg: b0c0c04f-f286-11e4-9f8e-6bdb67e4be99 from 1 (host01.vgs.somehost.com)
2015-05-04 10:55:03 8731 [Note] WSREP: STATE EXCHANGE: got state msg: b0c0c04f-f286-11e4-9f8e-6bdb67e4be99 from 2 (host01.dca.somehost.com)
2015-05-04 10:55:03 8731 [Note] WSREP: STATE EXCHANGE: got state msg: b0c0c04f-f286-11e4-9f8e-6bdb67e4be99 from 4 (host02.dca.somehost.com)
2015-05-04 10:55:03 8731 [Note] WSREP: STATE EXCHANGE: got state msg: b0c0c04f-f286-11e4-9f8e-6bdb67e4be99 from 3 (host02.vgs.somehost.com)
2015-05-04 10:55:03 8731 [Note] WSREP: Quorum results:
        version = 3,
        component = PRIMARY,
        conf_id = 117,
        members = 4/5 (joined/total),
        act_id = 651448977,
        last_appl. = -1,
        protocols = 0/7/3 (gcs/repl/appl),
        group UUID = ba462d41-bd10-11e4-9338-8b421b0fe52b
2015-05-04 10:55:03 8731 [Note] WSREP: Flow-control interval: [224, 224]
2015-05-04 10:55:03 8731 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 651448977)
2015-05-04 10:55:03 8731 [Note] WSREP: State transfer required:
        Group state: ba462d41-bd10-11e4-9338-8b421b0fe52b:651448977
        Local state: ba462d41-bd10-11e4-9338-8b421b0fe52b:651434535
2015-05-04 10:55:03 8731 [Note] WSREP: New cluster view: global state: ba462d41-bd10-11e4-9338-8b421b0fe52b:651448977, view# 118: Primary, number of nodes: 5, my index: 3, protocol version 3
2015-05-04 10:55:03 8731 [Warning] WSREP: Gap in state sequence. Need state transfer.
2015-05-04 10:55:03 8731 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.180.19.32' --auth 'app_xtrabackup:****' --datadir '/var/somehost/mysql/data/' --defaults-file '/etc/my.cnf' --parent '8731' '' '
WSREP_SST: [INFO] Streaming with xbstream (20150504 10:55:04.889)
WSREP_SST: [INFO] Using socat as streamer (20150504 10:55:04.893)
WSREP_SST: [INFO] Evaluating timeout -s9 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20150504 10:55:05.001)
2015-05-04 10:55:05 8731 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.180.19.32:4444/xtrabackup_sst//1
2015-05-04 10:55:05 8731 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2015-05-04 10:55:05 8731 [Note] WSREP: REPL Protocols: 7 (3, 2)
2015-05-04 10:55:05 8731 [Note] WSREP: Service thread queue flushed.
2015-05-04 10:55:05 8731 [Note] WSREP: Assign initial position for certification: 651448977, protocol version: 3
2015-05-04 10:55:05 8731 [Note] WSREP: Service thread queue flushed.
2015-05-04 10:55:05 8731 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.180.19.32:4568
2015-05-04 10:55:05 8731 [Note] WSREP: Member 3.1 (host02.vgs.somehost.com) requested state transfer from '*any*'. Selected 1.1 (host01.vgs.somehost.com)(SYNCED) as donor.
2015-05-04 10:55:05 8731 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 651449297)
2015-05-04 10:55:05 8731 [Note] WSREP: Requesting state transfer: success, donor: 1
2015-05-04 10:55:06 8731 [Note] WSREP: (b064c0fb, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20150504 10:55:07.380)
WSREP_SST: [INFO] xtrabackup_ist received from donor: Running IST (20150504 10:55:07.391)
WSREP_SST: [INFO] Galera co-ords from recovery: ba462d41-bd10-11e4-9338-8b421b0fe52b:651434535 (20150504 10:55:07.397)
WSREP_SST: [INFO] Total time on joiner: 0 seconds (20150504 10:55:07.403)
WSREP_SST: [INFO] Removing the sst_in_progress file (20150504 10:55:07.407)
2015-05-04 10:55:07 8731 [Note] WSREP: SST complete, seqno: 651434535
2015-05-04 10:55:07 8731 [Note] Plugin 'FEDERATED' is disabled.
2015-05-04 10:55:07 8731 [Note] WSREP: 1.1 (host01.vgs.somehost.com): State transfer to 3.1 (host02.vgs.somehost.com) complete.
2015-05-04 10:55:07 8731 [Note] InnoDB: Using atomics to ref count buffer pool pages
2015-05-04 10:55:07 8731 [Note] InnoDB: The InnoDB memory heap is disabled
2015-05-04 10:55:07 8731 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2015-05-04 10:55:07 8731 [Note] InnoDB: Memory barrier is not used
2015-05-04 10:55:07 8731 [Note] InnoDB: Compressed tables use zlib 1.2.3
2015-05-04 10:55:07 8731 [Note] InnoDB: Using Linux native AIO
2015-05-04 10:55:07 8731 [Note] InnoDB: Using CPU crc32 instructions
2015-05-04 10:55:07 8731 [Note] InnoDB: Initializing buffer pool, size = 30.0G
2015-05-04 10:55:07 8731 [Note] WSREP: Member 1.1 (host01.vgs.somehost.com) synced with group.
2015-05-04 10:55:10 8731 [Note] InnoDB: Completed initialization of buffer pool
2015-05-04 10:55:11 8731 [Note] InnoDB: Highest supported file format is Barracuda.
2015-05-04 10:55:13 8731 [Note] InnoDB: 128 rollback segment(s) are active.
2015-05-04 10:55:13 8731 [Note] InnoDB: Waiting for purge to start
2015-05-04 10:55:13 8731 [Note] InnoDB: Percona XtraDB (http://www.percona.com) 5.6.22-72.0 started; log sequence number 6939710639795
2015-05-04 10:55:13 8731 [Note] RSA private key file not found: /var/somehost/mysql/data//private_key.pem. Some authentication plugins will not work.
2015-05-04 10:55:13 8731 [Note] RSA public key file not found: /var/somehost/mysql/data//public_key.pem. Some authentication plugins will not work.
2015-05-04 10:55:13 8731 [Note] Server hostname (bind-address): '*'; port: 3306
2015-05-04 10:55:13 8731 [Note] IPv6 is not available.
2015-05-04 10:55:13 8731 [Note] - '0.0.0.0' resolves to '0.0.0.0';
2015-05-04 10:55:13 8731 [Note] Server socket created on IP: '0.0.0.0'.
2015-05-04 10:55:13 8731 [Note] Event Scheduler: Loaded 0 events
2015-05-04 10:55:13 8731 [Note] WSREP: Signalling provider to continue.
2015-05-04 10:55:13 8731 [Note] WSREP: inited wsrep sidno 1
2015-05-04 10:55:13 8731 [Note] WSREP: SST received: ba462d41-bd10-11e4-9338-8b421b0fe52b:651434535
2015-05-04 10:55:13 8731 [Note] WSREP: Receiving IST: 14442 writesets, seqnos 651434535-651448977
2015-05-04 10:55:13 8731 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.22-72.0-56-log' socket: '/var/tmp/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), Release rel72.0, Revision 978, WSREP version 25.8, wsrep_25.8.r4150
2015-05-04 10:55:38 8731 [Note] WSREP: IST received: ba462d41-bd10-11e4-9338-8b421b0fe52b:651448977
2015-05-04 10:55:38 8731 [Note] WSREP: 3.1 (host02.vgs.somehost.com): State transfer from 1.1 (host01.vgs.somehost.com) complete.
2015-05-04 10:55:38 8731 [Note] WSREP: Shifting JOINER -> JOINED (TO: 651453564)
2015-05-04 10:55:48 8731 [Note] WSREP: Member 3.1 (host02.vgs.somehost.com) synced with group.
2015-05-04 10:55:48 8731 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 651454811)
2015-05-04 10:55:48 8731 [Note] WSREP: Synchronized with group, ready for connections
2015-05-04 10:55:48 8731 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification
----------------

So everything seems normal and there are no errors. However, socat and xbstream processes are left running:

root 8079 1 0 10:55 pts/0 00:00:00 /bin/sh /usr/bin/mysqld_safe --datadir=/var/untd/mysql/data --pid-file=/var/untd/mysql/log/mysqld.pid
mysql 8731 8079 13 10:55 pts/0 00:01:20 /usr/sbin/mysqld --basedir=/usr --datadir=/var/untd/mysql/data --plugin-dir=/usr/lib64/mysql/plugin --u
mysql 9021 1 0 10:55 ? 00:00:00 socat -u TCP-LISTEN:4444,reuseaddr stdio
mysql 9022 1 0 10:55 ? 00:00:00 xbstream -x

Limus

CentOS release 6.6 (Final)

Looks like it only happens when there is significant load during IST. Here is how I re-produced it on my workstation.

Setup - 2 node PXC. Galera settings in my.cnf:
# GALERA #
wsrep_cluster_name = test
wsrep_provider = /usr/lib64/libgalera_smm.so
wsrep_cluster_address = gcomm://node1,node2
#wsrep_cluster_address = gcomm://node2
wsrep_sst_method = xtrabackup-v2
wsrep_sst_auth = "xtrabackup:xtrabackup"
wsrep_provider_options = "gcache.size=500M;gmcast.segment=1;evs.send_window=1024;evs.user_s
end_window=512"
wsrep_slave_threads = 10
innodb_autoinc_lock_mode = 2
binlog_format = ROW

1. Shutdown one of the nodes
2. Start mysqlslap on other node
mysqlslap -u root -a -i 1000
3. While mysqlslap is running, startup the node that was down.
4. After IST is done, verify that socat and xbstream are still running

Limus

Unable to reproduce the same with above steps ON PXC 5.6.22

## PXC configuration
wsrep_provider=/usr/lib64/libgalera_smm.so
wsrep_cluster_address=gcomm://192.168.1.187,192.168.1.70
binlog_format=ROW
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
wsrep_node_address=192.168.1.70
wsrep_sst_method=xtrabackup-v2
wsrep_cluster_name=my_centos_cluster
wsrep_sst_auth="sstuser:s3cret"
#wsrep_on=OFF
wsrep_provider_options="gmcast.segment=1;evs.send_window=1024;evs.user_send_window=512"
wsrep_slave_threads = 10

[root@percona-pxc56-2 ~]#

Started mysqlslap on node1, (while node 2 was already shutdown)

[root@percona-pxc56-1 mysql]# mysqlslap -u root -a -i 1000
mysqlslap: Error when connecting to server: Access denied for user 'root'@'localhost' (using password: NO)
[root@percona-pxc56-1 mysql]# mysqlslap -u root -p -a -i 1000
Enter password:
Benchmark
 Average number of seconds to run all queries: 0.021 seconds
 Minimum number of seconds to run all queries: 0.010 seconds
 Maximum number of seconds to run all queries: 0.067 seconds
 Number of clients running queries: 1
 Average number of queries per client: 0

[root@percona-pxc56-1 mysql]#

Started mysql on node2, It is gone for IST....when Its done I checked.

[root@percona-pxc56-2 ~]# /etc/init.d/mysql start
Starting MySQL (Percona XtraDB Cluster)....State transfer in progress, setting sleep higher
. SUCCESS!
[root@percona-pxc56-2 ~]#

[root@percona-pxc56-2 ~]# ps -ef | grep socat
root 2208 1474 0 14:45 pts/1 00:00:00 grep socat
[root@percona-pxc56-2 ~]# ps -ef | grep xbstream
root 2213 1474 0 14:45 pts/1 00:00:00 grep xbstream
[root@percona-pxc56-2 ~]#

Am I missing here anything ?

Changed in percona-xtradb-cluster:
status: New → Incomplete

Was the OS Centos 6.x ?

Also, mysqlslap has to be running during IST on second node

I just retried it on my workstation and the problem happens intermittently . In production however it happens 100% of the time.

I will see if I can create use case where it happens all the time.

Yes, I have tested on CentOS 6.4 and mysqlslap was running during IST on second node.
It would be really great help if you can provide some test case. Thanks.

Stefan Schlesinger (sts-2) wrote :

We have got the same issues. I can reproduce the issue with very few database load at about every second database restart.

 - Node is running normally
 - Do a 'service mysql restart'
 - Node joins normally, but socat keeps running
 - Do a 'service mysql restart'
 - Node cannot join the cluster because of an address already in use error

Please find the two logs here:

 - log output from after the first 'service mysql restart', which left a socat running: https://gist.github.com/sts/977d43a63c16939ab2e8

- what happens when the socat + mysql were still running (mysql operating normally), and I do a 'service mysql restart' https://gist.github.com/sts/b9fb4363fd8746eeab4b

Stefan Schlesinger (sts-2) wrote :

I'd also be available for debugging this further. Just hit me a mail to sts[-]ono.at

Download full text (3.4 KiB)

Verified with PXC 5.6.22.

 [root@percona1 ~]# tail -f /data/myisam/percona1_error.log
2015-05-12 12:00:25 15798 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.22-72.0-56' socket: '/var/lib/mysql/mysql.sock' port: 3306 Percona XtraDB Cluster (GPL), Release rel72.0, Revision 978, WSREP version 25.8, wsrep_25.8.r4150
2015-05-12 12:00:25 15798 [Note] WSREP: Receiving IST: 78 writesets, seqnos 299786-299864
2015-05-12 12:00:38 15798 [Note] WSREP: IST received: fadf885a-490b-11e4-9e13-27f574d828b6:299864
2015-05-12 12:00:38 15798 [Note] WSREP: 2.0 (percona1): State transfer from 0.0 (percona2) complete.
2015-05-12 12:00:38 15798 [Note] WSREP: Shifting JOINER -> JOINED (TO: 299864)
2015-05-12 12:00:38 15798 [Note] WSREP: Member 2.0 (percona1) synced with group.
2015-05-12 12:00:38 15798 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 299864)
2015-05-12 12:00:38 15798 [Note] WSREP: Synchronized with group, ready for connections
2015-05-12 12:00:38 15798 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
^C
[root@percona1 ~]# ps -ef|egrep "mysql|xbs|socat"
root 15243 1 0 11:59 pts/0 00:00:00 /bin/sh /usr/bin/mysqld_safe --datadir=/data/myisam --pid-file=/data/myisam/percona1.pid
mysql 15798 15243 58 11:59 pts/0 00:00:41 /usr/sbin/mysqld --basedir=/usr --datadir=/data/myisam --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/libgalera_smm.so --log-error=/data/myisam/percona1_error.log --open-files-limit=65535 --pid-file=/data/myisam/percona1.pid --wsrep_start_position=fadf885a-490b-11e4-9e13-27f574d828b6:299786
mysql 16048 1 0 12:00 ? 00:00:00 socat -u TCP-LISTEN:4444,reuseaddr stdio
mysql 16049 1 0 12:00 ? 00:00:00 xbstream -x

Also with STOP/START IST

[root@percona3 ~]# ps -ef|egrep "mysql|xbs|socat"
root 3489 1 0 11:11 pts/0 00:00:00 /bin/sh /usr/bin/mysqld_safe --datadir=/data/myisam --pid-file=/data/myisam/percona3.pid
mysql 4104 3489 0 11:11 pts/0 00:01:25 /usr/sbin/mysqld --basedir=/usr --datadir=/data/myisam --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/libgalera_smm.so --log-error=/data/myisam/percona3_error.log --open-files-limit=65535 --pid-file=/data/myisam/percona3.pid --wsrep_start_position=fadf885a-490b-11e4-9e13-27f574d828b6:299786
root 15343 3258 0 13:37 pts/0 00:00:00 egrep mysql|xbs|socat
[root@percona3 ~]# service mysql stop
Shutting down MySQL (Percona XtraDB Cluster)..... SUCCESS!
[root@percona3 ~]# service mysql start
Starting MySQL (Percona XtraDB Cluster)..........State transfer in progress, setting sleep higher
. SUCCESS!
[root@percona3 ~]# ps -ef|egrep "mysql|xbs|socat"
root 15396 1 0 13:40 pts/0 00:00:00 /bin/sh /usr/bin/mysqld_safe --datadir=/data/myisam --pid-file=/data/myisam/percona3.pid
mysql 15949 15396 24 13:40 pts/0 00:00:18 /usr/sbin/mysqld --basedir=/usr --datadir=/data/myisam --plugin-dir=/usr/lib64/mysql/plugin --user=mysql --wsrep-provider=/usr/lib64/libgalera_smm.so --log-error=/data/myisam/percona3_error.log --open-files-limit=65535 --pid-file=/data/myisam/percona3.pid --wsrep_start_position=fadf885a-490b-11e4-9e13-27f57...

Read more...

Changed in percona-xtradb-cluster:
status: Incomplete → Confirmed
Stefan Schlesinger (sts-2) wrote :

I think this could be related to recent changes in scripts/wsrep_sst_xtrabackup-v2. I cannot say how long it dates back, but the issue has been known on our side since a couple of weeks already.

https://github.com/percona/percona-xtradb-cluster/commits/262d88d73b6da1230a969e6148a59aeb408e7107/scripts/wsrep_sst_xtrabackup-v2.sh

I am seeing this as well on Ubuntu Trusty, socat remains running after any IST/SST:

2015-05-18 11:18:57 21604 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.10.50.212:4444/xtrabackup_sst//1
2015-05-18 11:18:57 21604 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2015-05-18 11:18:57 21604 [Note] WSREP: REPL Protocols: 6 (3, 2)
2015-05-18 11:18:57 21604 [Note] WSREP: Service thread queue flushed.
2015-05-18 11:18:57 21604 [Note] WSREP: Assign initial position for certification: 424819546, protocol version: 3
2015-05-18 11:18:57 21604 [Note] WSREP: Service thread queue flushed.
2015-05-18 11:18:57 21604 [Note] WSREP: Prepared IST receiver, listening at: tcp://10.10.50.212:4568
2015-05-18 11:18:57 21604 [Note] WSREP: Member 1.0 (atl-userdb4) requested state transfer from '*any*'. Selected 2.0 (atl-userdb1)(SYNCED) as donor.
2015-05-18 11:18:57 21604 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 424819546)
2015-05-18 11:18:57 21604 [Note] WSREP: Requesting state transfer: success, donor: 2
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20150518 11:18:57.335)
2015/05/18 11:18:57 socat[21854] E bind(3, {AF=2 0.0.0.0:4444}, 16): Address already in use
WSREP_SST: [ERROR] Error while getting data from donor node: exit codes: 1 0 (20150518 11:18:57.341)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20150518 11:18:57.344)
2015-05-18 11:18:57 21604 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.10.50.212' --auth 'root:XXXXXX' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '21604' '' : 32 (Broken pipe)
2015-05-18 11:18:57 21604 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2015-05-18 11:18:57 21604 [ERROR] WSREP: SST failed: 32 (Broken pipe)
2015-05-18 11:18:57 21604 [ERROR] Aborting

^Gxbstream: Can't create/write to file './xtrabackup_galera_info' (Errcode: 2 - No such file or directory)
xbstream: failed to create file.

helodron (helodron) wrote :
Download full text (3.7 KiB)

I think the problem with Timeout and PIPE.
Good trace:
188229 22:29:27.741703379 0 timeout (1035120) < execve res=-2(ENOENT) exe=socat args=-u.TCP-LISTEN:17601,reuseaddr,pf=ipv6.stdio. tid=1035120(timeout) pid=1035120(timeout) ptid=1035119(timeout) cwd=/root fdlimit=1024 pgft_maj=0 pgft_min=21 vm_size=11376 vm_rss=124 vm_swap=0 comm=timeout cgroups=cpuset=/.cpu=/.cpuacct=/.memory=/.devices=/.freezer=/.net_cls=/.blkio=/.perf_... env=TERM=xterm.SHELL=/bin/bash.SSH_CLIENT=2a02:...::154 35644 22.SSH_TTY=/...
188230 22:29:27.741713876 0 timeout (1035120) > execve
188231 22:29:27.741715388 0 timeout (1035120) < execve res=-2(ENOENT) exe=socat args=-u.TCP-LISTEN:17601,reuseaddr,pf=ipv6.stdio. tid=1035120(timeout) pid=1035120(timeout) ptid=1035119(timeout) cwd=/root fdlimit=1024 pgft_maj=0 pgft_min=21 vm_size=11376 vm_rss=124 vm_swap=0 comm=timeout cgroups=cpuset=/.cpu=/.cpuacct=/.memory=/.devices=/.freezer=/.net_cls=/.blkio=/.perf_... env=TERM=xterm.SHELL=/bin/bash.SSH_CLIENT=2a02:...::154 35644 22.SSH_TTY=/...
188232 22:29:27.741725946 0 timeout (1035120) > execve
188233 22:29:27.741732668 0 timeout (1035120) > switch next=10 pgft_maj=0 pgft_min=21 vm_size=11376 vm_rss=124 vm_swap=0
634013 22:29:38.741679491 14 timeout (1035119) < wait4
634014 22:29:38.741681931 14 timeout (1035119) > signaldeliver spid=0 dpid=1035119(timeout) sig=14(SIGALRM)
634015 22:29:38.741705993 14 timeout (1035119) > kill pid=1035120(socat) sig=9(SIGKILL)
634016 22:29:38.741710572 14 timeout (1035119) < kill res=0
634017 22:29:38.741711237 14 timeout (1035119) > kill pid=0 sig=9(SIGKILL)
634018 22:29:38.741712195 14 timeout (1035119) < kill res=0
634019 22:29:38.741712534 14 timeout (1035119) > signaldeliver spid=0 dpid=1035119(timeout) sig=9(SIGKILL)
634023 22:29:38.741818809 14 timeout (1035119) > procexit status=9

Bad session:
8365786 22:32:08.467565817 26 timeout (1039023) > execve
8365787 22:32:08.467570304 26 timeout (1039023) < execve res=-2(ENOENT) exe=socat args=-u.TCP-LISTEN:17601,reuseaddr,pf=ipv6.stdio. tid=1039023(timeout) pid=1039023(timeout) ptid=1039021(timeout) cwd=/tmp/tmp.bpv7VUY5Gt fdlimit=32768 pgft_maj=0 pgft_min=24 vm_size=24016 vm_rss=1048 vm_swap=0 comm=timeout cgroups=cpuset=/.cpu=/.cpuacct=/.memory=/.devices=/.freezer=/.net_cls=/.blkio=/.perf_... env=SHELL=/bin/bash.TERM=screen.SSH_CLIENT=2a02:....::149 37197 22.LD_PRELO...
8365792 22:32:08.467588363 26 timeout (1039023) > execve
8365793 22:32:08.467589419 26 timeout (1039023) < execve res=-2(ENOENT) exe=socat args=-u.TCP-LISTEN:17601,reuseaddr,pf=ipv6.stdio. tid=1039023(timeout) pid=1039023(timeout) ptid=1039021(timeout) cwd=/tmp/tmp.bpv7VUY5Gt fdlimit=32768 pgft_maj=0 pgft_min=24 vm_size=24016 vm_rss=1048 vm_swap=0 comm=timeout cgroups=cpuset=/.cpu=/.cpuacct=/.memory=/.devices=/.freezer=/.net_cls=/.blkio=/.perf_... env=SHELL=/bin/bash.TERM=screen.SSH_CLIENT=2a02:...::149 37197 22.LD_PRELO...
8365797 22:32:08.467606225 26 timeout (1039023) > execve
8365799 22:32:08.467611947 26 timeout (1039023) > switch next=139 pgft_maj=0 pgft_min=24 vm_size=24016 vm_rss=1048 vm_swap=0
8428418 22:32:09.045090061 12 timeout (1039021) < wait4
8428420 22:32:09.045103083 12 timeout (1039021) > c...

Read more...

helodron (helodron) wrote :

plz delete my up comment. I am wrong diagnostic problem.

helodron (helodron) wrote :

This fix helped me:
%%
--- /tmp/test/usr/bin/wsrep_sst_xtrabackup-v2 2015-03-26 11:32:57.000000000 +0300
+++ /usr/bin/wsrep_sst_xtrabackup-v2 2015-10-07 11:54:28.724848346 +0300
@@ -952,6 +952,7 @@ then
     else
         # || true if it has already exited
         kill $jpid || true
+ echo 'EOF' | nc localhost ${TSST_PORT} || true
         rm -rf $DATA/.sst
         wsrep_log_info "${IST_FILE} received from donor: Running IST"
     fi
%%

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers