innodb_flush_method=O_DSYNC | ALL_O_DIRECT leads to log writes with log_sys->mutex locked

Bug #1075129 reported by Alexey Kopytov
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Status tracked in 5.7
5.1
Won't Fix
Medium
Unassigned
5.5
Triaged
Medium
Unassigned
5.6
Triaged
Medium
Unassigned
5.7
Triaged
Medium
Unassigned

Bug Description

When innodb_file_flush method has the default (empty) value or is O_DIRECT, InnoDB does buffered log writes with log_sys->mutex locked, and then calls fsync() after releasing the mutex, i.e. the actual I/O happens with the mutex unlocked.

With O_DSYNC or ALL_O_DIRECT, the actual I/O happens inside the lock. Which makes log_sys->mutex very hot in some workloads.

We can fix this by queuing the writes inside the lock, and then processing the queue after releasing the mutex and before returning from log_write_up_to().

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Considering the following fragment from log_write_up_to:

 group = UT_LIST_GET_FIRST(log_sys->log_groups);

 /* Do the write to the log files */

 while (group) {
  log_group_write_buf(
   group, log_sys->buf + area_start,
   area_end - area_start,
   ut_uint64_align_down(log_sys->written_to_all_lsn,
          OS_FILE_LOG_BLOCK_SIZE),
   start_offset - area_start);

  log_group_set_fields(group, log_sys->write_lsn);

  group = UT_LIST_GET_NEXT(log_groups, group);
 }

 mutex_exit(&(log_sys->mutex));

 if (srv_unix_file_flush_method == SRV_UNIX_O_DSYNC
     || srv_unix_file_flush_method == SRV_UNIX_ALL_O_DIRECT) {
  /* O_DSYNC means the OS did not buffer the log file at all:
  so we have also flushed to disk what we have written */

  log_sys->flushed_to_disk_lsn = log_sys->write_lsn;

 } else if (flush_to_disk) {

  group = UT_LIST_GET_FIRST(log_sys->log_groups);

  fil_flush(group->space_id, FALSE);
  log_sys->flushed_to_disk_lsn = log_sys->write_lsn;
 }

There already is a log_do_write in log_group_write_buf:

 if (log_do_write) {
  log_sys->n_log_ios++;

  srv_os_log_pending_writes++;

  fil_io(OS_FILE_WRITE | OS_FILE_LOG, TRUE, group->space_id, 0,
         next_offset / UNIV_PAGE_SIZE,
         next_offset % UNIV_PAGE_SIZE, write_len, buf, group);

  srv_os_log_pending_writes--;

  srv_os_log_written+= write_len;
  srv_log_writes++;
 }

However, it is unconditionally set to TRUE in non-UNIV_DEBUG (and
nowhere set to false in UNIV_DEBUG too).

However, the same variable cannot be reused, since to increment
log_sys->n_log_ios++ among others requires the log_sys mutex.

So, one may want to replace fil_io over there with an in-memory
buffering so that counters are updated (the worst can happen with a crash is the counters
being incorrect) and then do the I/O after mutex_exit in
log_write_up_to but before the if condition with
SRV_UNIX_O_DSYNC.

Even this should benefit O_DSYNC / ALL_O_DIRECT the most, it will
also benefit normal case since it will avoid the overhead of
_fil_aio when under the mutex.

tags: added: xtradb
Revision history for this message
Laurynas Biveinis (laurynas-biveinis) wrote :
tags: added: performance
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-1277

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.