Comment 14 for bug 1206008

Revision history for this message
Yura Sorokin (yura-sorokin) wrote :

The problem turned out to be with a race condition between "srv_error_monitor_thread" in srv0srv.cc
while (trx) {
  if (!trx_state_eq(trx, TRX_STATE_NOT_STARTED)
       && trx_state_eq(trx, TRX_STATE_ACTIVE)
       && trx->mysql_thd
       && innobase_thd_is_idle(trx->mysql_thd))
  {...}
  ...
}

and a query execution thread like the one in comment #11, in particular "trx_commit_in_memory" in trx0trx.cc
...
ut_ad(trx_state_eq(trx, TRX_STATE_ACTIVE));
trx->state = TRX_STATE_NOT_STARTED;
read_view_remove(trx->global_read_view, false);
MONITOR_INC(MONITOR_TRX_NL_RO_COMMIT);
...

The case which causes the assertion (and the crash) is the following
1. thread 1("srv_error_monitor_thread") is running with trx->state == TRX_STATE_ACTIVE
2. thread 1 executes the first part in the "if" statement condition
      !trx_state_eq(trx, TRX_STATE_NOT_STARTED)
3. trx_state_eq() in thread 1 returns "false" and therefore the next part of the && operator is going to be evaluated.
4. context switches to thread 2 (query execution thread)
5. thread 2 changes transaction state to TRX_STATE_NOT_STARTED
      trx->state = TRX_STATE_NOT_STARTED;
6. at some point context changes back to thread 1 ("srv_error_monitor_thread")
7. the next part of the && condition of the "if" statement is evaluated
      && trx_state_eq(trx, TRX_STATE_ACTIVE)
      with trx->state == TRX_STATE_NOT_STARTED (changed by thread 2)
8. inside trx_state_eq() we run into an assertion
      switch (trx->state) {
        ...
        case TRX_STATE_NOT_STARTED:
          /* This state is not allowed for running transactions. */
          ut_a(state == TRX_STATE_NOT_STARTED);
          ...
       }

At the same time the following two assertions
  ut_ad(!trx->in_rw_trx_list);
  ut_ad(!trx->in_ro_trx_list);
pass without any problems.

In other words it was
  ut_a(state == TRX_STATE_NOT_STARTED);
causing the crash, not the following two "in_rw_trx_list" and "in_ro_trx_list" checks.

The suggested fix is to rework
  if (!trx_state_eq(trx, TRX_STATE_NOT_STARTED)
       && trx_state_eq(trx, TRX_STATE_ACTIVE)
       && ...)
statement
to just
  if (trx_state_eq(trx, TRX_STATE_ACTIVE)
      && ...)

and to remove
  ut_a(state == TRX_STATE_NOT_STARTED);
check from the trx_state_eq()