Percona XtraDB Cluster - HA scalable solution for MySQL

FTWRL misbehaves

Reported by Raghavendra D Prabhu on 2013-04-19
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
High
Alex Yurchenko
Percona XtraDB Cluster
High
Raghavendra D Prabhu

Bug Description

As per https://bugs.launchpad.net/percona-xtrabackup/+bug/1166752/comments/5 FTWRL doesn't behave as expected. The behavior with stock MySQL is different than that.

This looks like an issue with galera:

=======================================================
#ifdef WITH_WSREP
  long long ret = wsrep->pause(wsrep);
  if (ret >= 0)
  {
    wsrep_locked_seqno= ret;
  }
  else if (ret != -ENOSYS) /* -ENOSYS - no provider */
  {
    WSREP_ERROR("Failed to pause provider: %lld (%s)", -ret, strerror(-ret));

    /* m_mdl_blocks_commits_lock is always NULL here */
    wsrep_locked_seqno= WSREP_SEQNO_UNDEFINED;
    my_error(ER_LOCK_DEADLOCK, MYF(0));
    DBUG_RETURN(TRUE);
  }
#endif /* WITH_WSREP */

=============================================================

So with second FTWRL it tries to pause galera again

=================
wsrep_seqno_t galera::ReplicatorSMM::pause()
{
    gu_trace(local_monitor_.lock());

    wsrep_seqno_t const ret(cert_.position());

    apply_monitor_.drain(ret);
    assert (apply_monitor_.last_left() >= ret);

    if (co_mode_ != CommitOrder::BYPASS)
    {
        commit_monitor_.drain(ret);
        assert (commit_monitor_.last_left() >= ret);
    }

    st_.set(state_uuid_, ret);

    log_info << "Provider paused at " << state_uuid_ << ':' << ret;

    return ret;
}
===========================================

So, this looks like the second pause fails or returns zero.

=============================================================================================================
To conclude, this needs to be handled specially otherwise not just xtrabackup but other applications which rely on FTWRL will also break.

There should be a way to detect if galera is already paused in WSREP, so that if already paused not pause it again.

This is how the logs look:
130420 9:56:26 [Note] WSREP: Provider paused at 11264cec-06e6-11e2-0800-61616b1fc754:3
130420 10:13:37 [Note] WSREP: Provider resumed.
130420 10:21:19 [Note] WSREP: Provider paused at 11264cec-06e6-11e2-0800-61616b1fc754:3
130420 10:21:42 [ERROR] WSREP: Attempt to lock an already locked monitor.
130420 10:21:42 [ERROR] WSREP: Attempt to lock an already locked monitor.: 35 (Resource deadlock avoided)
         at galera/src/monitor.hpp:lock():271
130420 10:21:42 [ERROR] WSREP: Failed to pause provider: 35 (Resource deadlock avoided)

.................................................................................................................................................................................................
To detect if it is alreay paused, does it suffice to check if wsrep_locked_seqno is not WSREP_SEQNO_UNDEFINED?

I tested with this patch: https://gist.github.com/ronin13/ee60ccbec4893b8fc1e1

It seems to be working.

Patch:
=== modified file 'Percona-Server/sql/lock.cc'
--- Percona-Server/sql/lock.cc 2012-11-26 13:57:13 +0000
+++ Percona-Server/sql/lock.cc 2013-04-20 05:19:27 +0000
@@ -1085,19 +1085,22 @@
   m_state= GRL_ACQUIRED_AND_BLOCKS_COMMIT;

 #ifdef WITH_WSREP
- long long ret = wsrep->pause(wsrep);
- if (ret >= 0)
- {
- wsrep_locked_seqno= ret;
- }
- else if (ret != -ENOSYS) /* -ENOSYS - no provider */
- {
- WSREP_ERROR("Failed to pause provider: %lld (%s)", -ret, strerror(-ret));
+ /* Pause the provider given that it is not paused already */
+ if (wsrep_locked_seqno == WSREP_SEQNO_UNDEFINED) {
+ long long ret = wsrep->pause(wsrep);
+ if (ret >= 0)
+ {
+ wsrep_locked_seqno= ret;
+ }
+ else if (ret != -ENOSYS) /* -ENOSYS - no provider */
+ {
+ WSREP_ERROR("Failed to pause provider: %lld (%s)", -ret, strerror(-ret));

- /* m_mdl_blocks_commits_lock is always NULL here */
- wsrep_locked_seqno= WSREP_SEQNO_UNDEFINED;
- my_error(ER_LOCK_DEADLOCK, MYF(0));
- DBUG_RETURN(TRUE);
+ /* m_mdl_blocks_commits_lock is always NULL here */
+ wsrep_locked_seqno= WSREP_SEQNO_UNDEFINED;
+ my_error(ER_LOCK_DEADLOCK, MYF(0));
+ DBUG_RETURN(TRUE);
+ }
   }
 #endif /* WITH_WSREP */
   DBUG_RETURN(FALSE);

Changed in percona-xtradb-cluster:
milestone: none → 5.5.30-24.8
assignee: nobody → Raghavendra D Prabhu (raghavendra-prabhu)
status: New → Triaged

Now, it shows as:

=========
130420 10:59:52 [Note] WSREP: Provider paused at bff239d3-a979-11e2-0800-d80cfb7cf3c6:0
130420 10:59:55 [ERROR] WSREP: Provider paused already with wsrep_locked_seqno: 0
=========

and second FTWRL goes fine.

Alex Yurchenko (ayurchen) wrote :

Raghu, the patch you proposed is not thread-safe. I think this should be fixed in Galera by using a lock counter.

Yes, fixing this in galera would be much better.

@Alexey,

Added galera to this bug for fixing in galera.

Changed in galera:
assignee: nobody → Alex Yurchenko (ayurchen)
importance: Undecided → Medium
milestone: none → 23.2.6
status: New → Confirmed
Changed in percona-xtradb-cluster:
importance: Undecided → High

One more thing I observed here is that, the provider is resumed even if FTWRL fails.

This is how it looks:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

130531 9:25:27 [Note] WSREP: Provider paused at 64790715-c156-11e2-0800-d014f1bda320:47141031
130531 9:25:28 [ERROR] WSREP: Attempt to lock an already locked monitor.
130531 9:25:28 [ERROR] WSREP: Attempt to lock an already locked monitor.: 35 (Resource deadlock avoided)
         at galera/src/monitor.hpp:lock():271
130531 9:25:28 [ERROR] WSREP: Failed to pause provider: 35 (Resource deadlock avoided)
130531 9:25:28 [Note] WSREP: Provider resumed.
130531 9:25:33 [Warning] WSREP: Attempt to unlock an already unlocked monitor.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This is why you see " WSREP: Attempt to unlock an already unlocked monitor. " later

Alex Yurchenko (ayurchen) wrote :

this is bloody dangerous

Changed in galera:
importance: Medium → High
Changed in galera:
status: Confirmed → In Progress
Alex Yurchenko (ayurchen) wrote :

Fix committed in r152

no longer affects: codership-mysql
Changed in galera:
status: In Progress → Fix Committed
Changed in percona-xtradb-cluster:
status: Triaged → Fix Released
Changed in galera:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers