Not full cleanup at crash

Bug #797396 reported by Vadim Tkachenko
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Fix Released
Medium
Alex Yurchenko
codership-maria
Fix Committed
Medium
Alex Yurchenko

Bug Description

When JOINER crashes during rsync process , it does not kills bash and rsync, so they stay in memory.

This is netstat -anp I see after crash:

 netstat -anp | grep -P "(bash|rsync)"
tcp 0 0 0.0.0.0:4567 0.0.0.0:* LISTEN 1776/bash
tcp 0 0 0.0.0.0:4444 0.0.0.0:* LISTEN 1793/rsync
tcp 0 0 10.11.12.220:44326 10.11.12.234:4567 CLOSE_WAIT 1776/bash
tcp 0 0 :::4444 :::* LISTEN 1793/rsync
unix 2 [ ] DGRAM 401833 1793/rsync

With these processes I can't start node, it complains "address is in use"

There is more log how I've got crash:

110614 12:54:00 [Note] WSREP: Requesting state transfer: success, donor: 1
110614 12:54:00 [Warning] WSREP: 1 (localhost.localdomain): State transfer to 0 (localhost.localdomain) failed: -12 (Cannot allocate memory)
110614 12:54:00 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():621: Will never receive state. Need to abort.
110614 12:54:00 [Note] WSREP: gcomm: terminating thread
110614 12:54:00 [Note] WSREP: gcomm: joining thread
110614 12:54:00 [Note] WSREP: gcomm: closing backend
110614 12:54:00 [Note] WSREP: evs::proto(0a06b77c-96c0-11e0-0800-ae72dfc6785b, LEAVING, view_id(REG,0a06b77c-96c0-11e0-0800-ae72dfc6785b,10)) uuid 212b9aa2-9658-11e0-0800-6dde59d39152 missing from install message, assuming partitioned
110614 12:54:00 [Note] WSREP: GMCast::handle_stable_view: view(view_id(NON_PRIM,0a06b77c-96c0-11e0-0800-ae72dfc6785b,10) memb {
        0a06b77c-96c0-11e0-0800-ae72dfc6785b,
} joined {
} left {
} partitioned {
        212b9aa2-9658-11e0-0800-6dde59d39152,
})
110614 12:54:00 [Note] WSREP: GMCast::handle_stable_view: view((empty))
110614 12:54:00 [Note] WSREP: gcomm: closed
110614 12:54:00 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary

Changed in codership-mysql:
status: New → Confirmed
Revision history for this message
Vadim Tkachenko (vadim-tk) wrote :

On related note:

When JOINER gets error
"110614 12:54:00 [Warning] WSREP: 1 (localhost.localdomain): State transfer to 0 (localhost.localdomain) failed: -12 (Cannot allocate memory)" from DONOR,

I guess it makes sense to try another DONOR, rather than crash mysqld.

Revision history for this message
Vadim Tkachenko (vadim-tk) wrote :

However in this case, there was only one DONOR, so probably it was correct decision.

Changed in codership-mysql:
importance: Undecided → Medium
assignee: nobody → Alex Yurchenko (ayurchen)
milestone: none → 0.8.1
Changed in codership-maria:
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Alex Yurchenko (ayurchen)
Changed in codership-mysql:
status: Confirmed → Fix Committed
Changed in codership-maria:
status: Confirmed → Fix Committed
Revision history for this message
Vadim Tkachenko (vadim-tk) wrote :
Download full text (3.8 KiB)

I still have this problem using revision 3099

110714 19:21:59 [Note] [DEBUG] WSREP: Prepared SST request: xtrabackup|192.168.0.99/xtrabackup_sst
110714 19:21:59 [Warning] WSREP: wsrep_notify_cmd is not defined, skipping notification.
110714 19:21:59 [Note] WSREP: Assign initial position for certification: 67681588, protocol version: 1
110714 19:21:59 [Note] WSREP: State transfer required:
        Group state: 8d0cd8e8-ab08-11e0-0800-bfeb931854e6:67681588
        Local state: 00000000-0000-0000-0000-000000000000:-1
110714 19:21:59 [Note] WSREP: Node 0 (cisco.office.percona.com) requested state transfer from '*any*'. Selected 1 (r815.office.percona.com)(SYNCED) as donor.
110714 19:21:59 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 67681588)
110714 19:21:59 [Note] WSREP: Requesting state transfer: success, donor: 1
110714 19:21:59 [Warning] WSREP: 1 (r815.office.percona.com): State transfer to 0 (cisco.office.percona.com) failed: -12 (Cannot allocate memory)
110714 19:21:59 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():645: Will never receive state. Need to abort.
110714 19:21:59 [Note] WSREP: gcomm: terminating thread
110714 19:21:59 [Note] WSREP: gcomm: joining thread
110714 19:21:59 [Note] WSREP: gcomm: closing backend
110714 19:22:00 [Note] WSREP: evs::proto(360eb770-ae89-11e0-0800-bbd6923083d1, LEAVING, view_id(REG,360eb770-ae89-11e0-0800-bbd6923083d1,18)) uuid ce90c4f4-ae2f-11e0-0800-d841c420f3bf missing from install message, assuming partitioned
110714 19:22:00 [Note] WSREP: GMCast::handle_stable_view: view(view_id(NON_PRIM,360eb770-ae89-11e0-0800-bbd6923083d1,18) memb {
        360eb770-ae89-11e0-0800-bbd6923083d1,
} joined {
} left {
} partitioned {
        ce90c4f4-ae2f-11e0-0800-d841c420f3bf,
})
110714 19:22:00 [Note] WSREP: GMCast::handle_stable_view: view((empty))
110714 19:22:00 [Note] WSREP: gcomm: closed
110714 19:22:00 [Note] WSREP: libexec/mysqld: Terminated.
Aborted (core dumped)

then next start:

110714 19:25:13 [Note] WSREP: wsrep_load(): loading provider library '/data/opt/data/vadim/src/galera/libgalera_smm.so'
110714 19:25:13 [Note] WSREP: wsrep_load(): Galera 0.8.1 by Codership Oy <email address hidden> loaded succesfully.
110714 19:25:13 [Note] WSREP: Passing config to GCS: gcs.fc_debug = 0; gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; replicator.commit_order = 3
110714 19:25:13 [Note] WSREP: wsrep_sst_grab()
110714 19:25:13 [Note] WSREP: Start replication
110714 19:25:13 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
110714 19:25:13 [Note] WSREP: Assign initial position for certification: -1, protocol version: 1
110714 19:25:13 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
110714 19:25:13 [Note] WSREP: protonet asio version 0
110714 19:25:13 [Note] WSREP: backend: asio
110714 19:25:13 [Note] WSREP: GMCast version 0
110714 19:25:13 [Note] WSREP: (ab7a341f-ae89-11e0-0800-fd6780d48ba3, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
110714 19:25:13 [Note] WSREP: (ab7a341f-ae89-11e0-0800-fd...

Read more...

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Vadim,

You need to update xtrabackup sst script to monitor the state of the parent mysqld process. Check how it is done in wsrep_sst_rsync.sh.

Changed in codership-mysql:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.