Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

FTWRL misbehaves

Bug #1170821 reported by Raghavendra D Prabhu on 2013-04-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Galera	Fix Released	High	Alex Yurchenko	Galera 23.2.6
	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Fix Released	High	Raghavendra D Prabhu	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.5.31-23.7.5

Bug Description

As per https://bugs.launchpad.net/percona-xtrabackup/+bug/1166752/comments/5 FTWRL doesn't behave as expected. The behavior with stock MySQL is different than that.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-04-20:

This looks like an issue with galera:

=======================================================
#ifdef WITH_WSREP
  long long ret = wsrep->pause(wsrep);
  if (ret >= 0)
  {
    wsrep_locked_seqno= ret;
  }
  else if (ret != -ENOSYS) /* -ENOSYS - no provider */
  {
    WSREP_ERROR("Failed to pause provider: %lld (%s)", -ret, strerror(-ret));

    /* m_mdl_blocks_commits_lock is always NULL here */
    wsrep_locked_seqno= WSREP_SEQNO_UNDEFINED;
    my_error(ER_LOCK_DEADLOCK, MYF(0));
    DBUG_RETURN(TRUE);
  }
#endif /* WITH_WSREP */

=============================================================

So with second FTWRL it tries to pause galera again

=================
wsrep_seqno_t galera::ReplicatorSMM::pause()
{
gu_trace(local_monitor_.lock());

wsrep_seqno_t const ret(cert_.position());

apply_monitor_.drain(ret);
assert (apply_monitor_.last_left() >= ret);

    if (co_mode_ != CommitOrder::BYPASS)
    {
        commit_monitor_.drain(ret);
        assert (commit_monitor_.last_left() >= ret);
    }

st_.set(state_uuid_, ret);

log_info << "Provider paused at " << state_uuid_ << ':' << ret;

return ret;
}
===========================================

So, this looks like the second pause fails or returns zero.

=============================================================================================================
To conclude, this needs to be handled specially otherwise not just xtrabackup but other applications which rely on FTWRL will also break.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-04-20:

There should be a way to detect if galera is already paused in WSREP, so that if already paused not pause it again.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-04-20:

This is how the logs look:
130420 9:56:26 [Note] WSREP: Provider paused at 11264cec-06e6-11e2-0800-61616b1fc754:3
130420 10:13:37 [Note] WSREP: Provider resumed.
130420 10:21:19 [Note] WSREP: Provider paused at 11264cec-06e6-11e2-0800-61616b1fc754:3
130420 10:21:42 [ERROR] WSREP: Attempt to lock an already locked monitor.
130420 10:21:42 [ERROR] WSREP: Attempt to lock an already locked monitor.: 35 (Resource deadlock avoided)
at galera/src/monitor.hpp:lock():271
130420 10:21:42 [ERROR] WSREP: Failed to pause provider: 35 (Resource deadlock avoided)

.................................................................................................................................................................................................
To detect if it is alreay paused, does it suffice to check if wsrep_locked_seqno is not WSREP_SEQNO_UNDEFINED?

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-04-20:

I tested with this patch: https://gist.github.com/ronin13/ee60ccbec4893b8fc1e1

It seems to be working.

Patch:
=== modified file 'Percona-Server/sql/lock.cc'
--- Percona-Server/sql/lock.cc 2012-11-26 13:57:13 +0000
+++ Percona-Server/sql/lock.cc 2013-04-20 05:19:27 +0000
@@ -1085,19 +1085,22 @@
m_state= GRL_ACQUIRED_AND_BLOCKS_COMMIT;

#ifdef WITH_WSREP
- long long ret = wsrep->pause(wsrep);
- if (ret >= 0)
- {
- wsrep_locked_seqno= ret;
- }
- else if (ret != -ENOSYS) /* -ENOSYS - no provider */
- {
- WSREP_ERROR("Failed to pause provider: %lld (%s)", -ret, strerror(-ret));
+ /* Pause the provider given that it is not paused already */
+ if (wsrep_locked_seqno == WSREP_SEQNO_UNDEFINED) {
+ long long ret = wsrep->pause(wsrep);
+ if (ret >= 0)
+ {
+ wsrep_locked_seqno= ret;
+ }
+ else if (ret != -ENOSYS) /* -ENOSYS - no provider */
+ {
+ WSREP_ERROR("Failed to pause provider: %lld (%s)", -ret, strerror(-ret));

- /* m_mdl_blocks_commits_lock is always NULL here */
- wsrep_locked_seqno= WSREP_SEQNO_UNDEFINED;
- my_error(ER_LOCK_DEADLOCK, MYF(0));
- DBUG_RETURN(TRUE);
+ /* m_mdl_blocks_commits_lock is always NULL here */
+ wsrep_locked_seqno= WSREP_SEQNO_UNDEFINED;
+ my_error(ER_LOCK_DEADLOCK, MYF(0));
+ DBUG_RETURN(TRUE);
+ }
}
#endif /* WITH_WSREP */
DBUG_RETURN(FALSE);

Changed in percona-xtradb-cluster:
milestone:	none → 5.5.30-24.8
assignee:	nobody → Raghavendra D Prabhu (raghavendra-prabhu)
status:	New → Triaged

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-04-20:

https://gist.github.com/2d1394a36fae5020a326

Added a message.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-04-20:

Now, it shows as:

=========
130420 10:59:52 [Note] WSREP: Provider paused at bff239d3-a979-11e2-0800-d80cfb7cf3c6:0
130420 10:59:55 [ERROR] WSREP: Provider paused already with wsrep_locked_seqno: 0
=========

and second FTWRL goes fine.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2013-04-20:

Raghu, the patch you proposed is not thread-safe. I think this should be fixed in Galera by using a lock counter.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-04-22:

Yes, fixing this in galera would be much better.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-05-07:

@Alexey,

Added galera to this bug for fixing in galera.

Alex Yurchenko (ayurchen) on 2013-05-18

Changed in galera:
assignee:	nobody → Alex Yurchenko (ayurchen)
importance:	Undecided → Medium
milestone:	none → 23.2.6
status:	New → Confirmed

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-05-31

Changed in percona-xtradb-cluster:
importance:	Undecided → High

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-06-04:

#10

One more thing I observed here is that, the provider is resumed even if FTWRL fails.

This is how it looks:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

130531 9:25:27 [Note] WSREP: Provider paused at 64790715-c156-11e2-0800-d014f1bda320:47141031
130531 9:25:28 [ERROR] WSREP: Attempt to lock an already locked monitor.
130531 9:25:28 [ERROR] WSREP: Attempt to lock an already locked monitor.: 35 (Resource deadlock avoided)
at galera/src/monitor.hpp:lock():271
130531 9:25:28 [ERROR] WSREP: Failed to pause provider: 35 (Resource deadlock avoided)
130531 9:25:28 [Note] WSREP: Provider resumed.
130531 9:25:33 [Warning] WSREP: Attempt to unlock an already unlocked monitor.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

This is why you see " WSREP: Attempt to unlock an already unlocked monitor. " later

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2013-06-04:

#11

this is bloody dangerous

Changed in galera:
importance:	Medium → High

Alex Yurchenko (ayurchen) on 2013-06-11

Changed in galera:
status:	Confirmed → In Progress

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2013-06-13:

#12

Fix committed in r152

no longer affects:	codership-mysql
Changed in galera:
status:	In Progress → Fix Committed

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-06-13

Changed in percona-xtradb-cluster:
status:	Triaged → Fix Released

Alex Yurchenko (ayurchen) on 2013-06-28

Changed in galera:
status:	Fix Committed → Fix Released

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

#13

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-959

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.