FTWRL misbehaves

Bug #1170821 reported by Raghavendra D Prabhu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
Fix Released
High
Alex Yurchenko
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Released
High
Raghavendra D Prabhu

Bug Description

As per https://bugs.launchpad.net/percona-xtrabackup/+bug/1166752/comments/5 FTWRL doesn't behave as expected. The behavior with stock MySQL is different than that.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

This looks like an issue with galera:

=======================================================
#ifdef WITH_WSREP
  long long ret = wsrep->pause(wsrep);
  if (ret >= 0)
  {
    wsrep_locked_seqno= ret;
  }
  else if (ret != -ENOSYS) /* -ENOSYS - no provider */
  {
    WSREP_ERROR("Failed to pause provider: %lld (%s)", -ret, strerror(-ret));

    /* m_mdl_blocks_commits_lock is always NULL here */
    wsrep_locked_seqno= WSREP_SEQNO_UNDEFINED;
    my_error(ER_LOCK_DEADLOCK, MYF(0));
    DBUG_RETURN(TRUE);
  }
#endif /* WITH_WSREP */

=============================================================

So with second FTWRL it tries to pause galera again

=================
wsrep_seqno_t galera::ReplicatorSMM::pause()
{
    gu_trace(local_monitor_.lock());

    wsrep_seqno_t const ret(cert_.position());

    apply_monitor_.drain(ret);
    assert (apply_monitor_.last_left() >= ret);

    if (co_mode_ != CommitOrder::BYPASS)
    {
        commit_monitor_.drain(ret);
        assert (commit_monitor_.last_left() >= ret);
    }

    st_.set(state_uuid_, ret);

    log_info << "Provider paused at " << state_uuid_ << ':' << ret;

    return ret;
}
===========================================

So, this looks like the second pause fails or returns zero.

=============================================================================================================
To conclude, this needs to be handled specially otherwise not just xtrabackup but other applications which rely on FTWRL will also break.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

There should be a way to detect if galera is already paused in WSREP, so that if already paused not pause it again.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

This is how the logs look:
130420 9:56:26 [Note] WSREP: Provider paused at 11264cec-06e6-11e2-0800-61616b1fc754:3
130420 10:13:37 [Note] WSREP: Provider resumed.
130420 10:21:19 [Note] WSREP: Provider paused at 11264cec-06e6-11e2-0800-61616b1fc754:3
130420 10:21:42 [ERROR] WSREP: Attempt to lock an already locked monitor.
130420 10:21:42 [ERROR] WSREP: Attempt to lock an already locked monitor.: 35 (Resource deadlock avoided)
         at galera/src/monitor.hpp:lock():271
130420 10:21:42 [ERROR] WSREP: Failed to pause provider: 35 (Resource deadlock avoided)

.................................................................................................................................................................................................
To detect if it is alreay paused, does it suffice to check if wsrep_locked_seqno is not WSREP_SEQNO_UNDEFINED?

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

I tested with this patch: https://gist.github.com/ronin13/ee60ccbec4893b8fc1e1

It seems to be working.

Patch:
=== modified file 'Percona-Server/sql/lock.cc'
--- Percona-Server/sql/lock.cc 2012-11-26 13:57:13 +0000
+++ Percona-Server/sql/lock.cc 2013-04-20 05:19:27 +0000
@@ -1085,19 +1085,22 @@
   m_state= GRL_ACQUIRED_AND_BLOCKS_COMMIT;

 #ifdef WITH_WSREP
- long long ret = wsrep->pause(wsrep);
- if (ret >= 0)
- {
- wsrep_locked_seqno= ret;
- }
- else if (ret != -ENOSYS) /* -ENOSYS - no provider */
- {
- WSREP_ERROR("Failed to pause provider: %lld (%s)", -ret, strerror(-ret));
+ /* Pause the provider given that it is not paused already */
+ if (wsrep_locked_seqno == WSREP_SEQNO_UNDEFINED) {
+ long long ret = wsrep->pause(wsrep);
+ if (ret >= 0)
+ {
+ wsrep_locked_seqno= ret;
+ }
+ else if (ret != -ENOSYS) /* -ENOSYS - no provider */
+ {
+ WSREP_ERROR("Failed to pause provider: %lld (%s)", -ret, strerror(-ret));

- /* m_mdl_blocks_commits_lock is always NULL here */
- wsrep_locked_seqno= WSREP_SEQNO_UNDEFINED;
- my_error(ER_LOCK_DEADLOCK, MYF(0));
- DBUG_RETURN(TRUE);
+ /* m_mdl_blocks_commits_lock is always NULL here */
+ wsrep_locked_seqno= WSREP_SEQNO_UNDEFINED;
+ my_error(ER_LOCK_DEADLOCK, MYF(0));
+ DBUG_RETURN(TRUE);
+ }
   }
 #endif /* WITH_WSREP */
   DBUG_RETURN(FALSE);

Changed in percona-xtradb-cluster:
milestone: none → 5.5.30-24.8
assignee: nobody → Raghavendra D Prabhu (raghavendra-prabhu)
status: New → Triaged
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Now, it shows as:

=========
130420 10:59:52 [Note] WSREP: Provider paused at bff239d3-a979-11e2-0800-d80cfb7cf3c6:0
130420 10:59:55 [ERROR] WSREP: Provider paused already with wsrep_locked_seqno: 0
=========

and second FTWRL goes fine.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Raghu, the patch you proposed is not thread-safe. I think this should be fixed in Galera by using a lock counter.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Yes, fixing this in galera would be much better.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Alexey,

Added galera to this bug for fixing in galera.

Changed in galera:
assignee: nobody → Alex Yurchenko (ayurchen)
importance: Undecided → Medium
milestone: none → 23.2.6
status: New → Confirmed
Changed in percona-xtradb-cluster:
importance: Undecided → High
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

One more thing I observed here is that, the provider is resumed even if FTWRL fails.

This is how it looks:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

130531 9:25:27 [Note] WSREP: Provider paused at 64790715-c156-11e2-0800-d014f1bda320:47141031
130531 9:25:28 [ERROR] WSREP: Attempt to lock an already locked monitor.
130531 9:25:28 [ERROR] WSREP: Attempt to lock an already locked monitor.: 35 (Resource deadlock avoided)
         at galera/src/monitor.hpp:lock():271
130531 9:25:28 [ERROR] WSREP: Failed to pause provider: 35 (Resource deadlock avoided)
130531 9:25:28 [Note] WSREP: Provider resumed.
130531 9:25:33 [Warning] WSREP: Attempt to unlock an already unlocked monitor.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This is why you see " WSREP: Attempt to unlock an already unlocked monitor. " later

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

this is bloody dangerous

Changed in galera:
importance: Medium → High
Changed in galera:
status: Confirmed → In Progress
Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Fix committed in r152

no longer affects: codership-mysql
Changed in galera:
status: In Progress → Fix Committed
Changed in percona-xtradb-cluster:
status: Triaged → Fix Released
Changed in galera:
status: Fix Committed → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-959

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.