disk full log-bin node causes cluster hang

Bug #1372508 reported by Jay Janssen
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
New
Undecided
Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Invalid
Undecided
Unassigned
5.6
Invalid
Undecided
Unassigned

Bug Description

A node that fills its disk will generate RBR apply errors without log-bin enabled:

140922 15:08:09 [ERROR] Slave SQL: Could not execute Write_rows event on table sbtest.sbtest1; The table 'sbtest1' is full, Error_code: 1114; handler error HA_ERR_RECORD_FILE_FULL; the event's master log FIRST, end_log_pos 273, Error_code: 1114
140922 15:08:09 [Warning] WSREP: RBR event 2 Write_rows apply warning: 135, 40012

The node then aborts and the cluster continues. However, I question if this is good behavior in the case of multiple nodes filling up disk simultaneously.

If instead that node has log-bin enabled, it does not RBR abort, it instead hangs with this message:

140922 15:00:18 [Warning] Disk is full writing './mysql-bin.000001' (Errcode: 28). Waiting for someone to free space... (Expect up to 60 secs delay for server to continue after freeing disk space)
140922 15:00:18 [Warning] Retry in 60 secs. Message reprinted in 600 secs

In this case the node just blocks writes and causes flow control forever -- total cluster hang.

These are both mysql behaviors, I think. The question is if either behavior is correct or incorrect and if we can/should do anything to modify it.

I tested this on 5.5, but I think it applies to 5.6 as well.

Revision history for this message
Jay Janssen (jay-janssen) wrote :

I tested 5.6 and couldn't reproduce this. I'm guessing some behavior changed in 5.6 that causes the node to abort instead of hang. This may be a 'wonfix', but at least the situation is documented now.

Revision history for this message
Nilnandan Joshi (nilnandan-joshi) wrote :

Unable to reproduce this with 5.5.39 and 5.6 of course. I'm seeing that node will be abort instead of hang when it sees, not enough space on disk. But yes, though node will abort, cluster continues. it may still a problem if this happens to all nodes.
So, going to set to confirmed for both.

Without and with enable binlog.

140910 4:11:09 [ERROR] Slave SQL: Could not execute Write_rows event on table test.nil; Error writing file '/tmp/ML6yIoI4' (Errcode: 28), Error_code: 3; Error writing file 'mysql-bin' (errno: 28), Error_code: 1026; Writing one row to the row-based binary log failed, Error_code: 1534; handler error HA_ERR_RBR_LOGGING_FAILED; the event's master log FIRST, end_log_pos 34167, Error_code: 3
140910 4:11:09 [Warning] WSREP: RBR event 34 Write_rows apply warning: 161, 70031
140910 4:11:09 [Warning] WSREP: Failed to apply app buffer: seqno: 70031, status: 1
         at galera/src/replicator_smm.cpp:apply_wscoll():55
Retrying 10th time
140910 4:11:09 [ERROR] Slave SQL: Could not execute Write_rows event on table test.nil; Error writing file '/tmp/ML6yIoI4' (Errcode: 28), Error_c140910 4:11:09 [ERROR] WSREP: Failed to apply trx: source: 1123aa17-383d-11e4-ab92-ae35b2c19fa2 version: 2 local: 0 state: APPLYING flags: 1 conn_id: 5 trx_id: 136449 seqnos (l: 5, g: 70031, s: 70030, d: 70030, ts: 1410279820834811118)
140910 4:11:09 [ERROR] WSREP: Failed to apply trx 70031 10 times
140910 4:11:09 [ERROR] WSREP: Node consistency compromized, aborting...
140910 4:11:09 [Note] WSREP: Closing send monitor...
140910 4:11:09 [Note] WSREP: Closed send monitor.
140910 4:11:09 [Note] WSREP: gcomm: terminating thread
140910 4:11:09 [Note] WSREP: gcomm: joining thread
140910 4:11:09 [Note] WSREP: gcomm: closing backend
140910 4:11:10 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
140910 4:11:10 [Note] WSREP: gcomm: closed
140910 4:11:10 [Note] WSREP: Flow-control interval: [16, 16]
140910 4:11:10 [Note] WSREP: Received NON-PRIMARY.
140910 4:11:10 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 70031)
140910 4:11:10 [Note] WSREP: Received self-leave message.
140910 4:11:10 [Note] WSREP: Flow-control interval: [0, 0]
140910 4:11:10 [Note] WSREP: Received SELF-LEAVE. Closing connection.
140910 4:11:10 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 70031)
140910 4:11:10 [Note] WSREP: RECV thread exiting 0: Success
140910 4:11:10 [Note] WSREP: recv_thread() joined.
140910 4:11:10 [Note] WSREP: Closing replication queue.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Nil,

Are you not able to replicate hanging when disk is full for 5.5 as well? The bug is hanging, and not aborting.

Revision history for this message
Nilnandan Joshi (nilnandan-joshi) wrote :

Hi Raghu,

Yes, node was not hanging but aborting on both 5.5 and 5.6

Revision history for this message
Krunal Bauskar (krunal-bauskar) wrote :

So aborting it expected behavior. Hang may suggest something to look at.

Note: Cluster Node is aborted but cluster continue to operate

Based on this it marking it as Not a Bug.

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1742

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.