Percona Server moved to https://jira.percona.com/projects/PS

innodb_flush_method=O_DSYNC | ALL_O_DIRECT leads to log writes with log_sys->mutex locked

Series 5.1
Bug #1075129

Bug #1075129 reported by Alexey Kopytov on 2012-11-05

This bug affects 3 people

	Status	Importance	Assigned to
Percona Server moved to https://jira.percona.com/projects/PS	Status tracked in 5.7
5.1	Won't Fix	Medium	Unassigned
5.5	Triaged	Medium	Unassigned
5.6	Triaged	Medium	Unassigned
5.7	Triaged	Medium	Unassigned

Bug Description

When innodb_file_flush method has the default (empty) value or is O_DIRECT, InnoDB does buffered log writes with log_sys->mutex locked, and then calls fsync() after releasing the mutex, i.e. the actual I/O happens with the mutex unlocked.

With O_DSYNC or ALL_O_DIRECT, the actual I/O happens inside the lock. Which makes log_sys->mutex very hot in some workloads.

We can fix this by queuing the writes inside the lock, and then processing the queue after releasing the mutex and before returning from log_write_up_to().

Tags:

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2012-11-26:

Considering the following fragment from log_write_up_to:

group = UT_LIST_GET_FIRST(log_sys->log_groups);

/* Do the write to the log files */

while (group) {
  log_group_write_buf(
   group, log_sys->buf + area_start,
   area_end - area_start,
   ut_uint64_align_down(log_sys->written_to_all_lsn,
          OS_FILE_LOG_BLOCK_SIZE),
   start_offset - area_start);

log_group_set_fields(group, log_sys->write_lsn);

group = UT_LIST_GET_NEXT(log_groups, group);
}

mutex_exit(&(log_sys->mutex));

if (srv_unix_file_flush_method == SRV_UNIX_O_DSYNC
     || srv_unix_file_flush_method == SRV_UNIX_ALL_O_DIRECT) {
  /* O_DSYNC means the OS did not buffer the log file at all:
  so we have also flushed to disk what we have written */

log_sys->flushed_to_disk_lsn = log_sys->write_lsn;

} else if (flush_to_disk) {

group = UT_LIST_GET_FIRST(log_sys->log_groups);

fil_flush(group->space_id, FALSE);
log_sys->flushed_to_disk_lsn = log_sys->write_lsn;
}

There already is a log_do_write in log_group_write_buf:

if (log_do_write) {
log_sys->n_log_ios++;

srv_os_log_pending_writes++;

  fil_io(OS_FILE_WRITE | OS_FILE_LOG, TRUE, group->space_id, 0,
         next_offset / UNIV_PAGE_SIZE,
         next_offset % UNIV_PAGE_SIZE, write_len, buf, group);

srv_os_log_pending_writes--;

srv_os_log_written+= write_len;
srv_log_writes++;
}

However, it is unconditionally set to TRUE in non-UNIV_DEBUG (and
nowhere set to false in UNIV_DEBUG too).

However, the same variable cannot be reused, since to increment
log_sys->n_log_ios++ among others requires the log_sys mutex.

So, one may want to replace fil_io over there with an in-memory
buffering so that counters are updated (the worst can happen with a crash is the counters
being incorrect) and then do the I/O after mutex_exit in
log_write_up_to but before the if condition with
SRV_UNIX_O_DSYNC.

Even this should benefit O_DSYNC / ALL_O_DIRECT the most, it will
also benefit normal case since it will avoid the overhead of
_fil_aio when under the mutex.

Considering the following fragment from log_write_up_to:

group = UT_LIST_GET_FIRST(log_sys->log_groups);

/* Do the write to the log files */

while (group) {
		log_group_write_buf(
			group, log_sys->buf + area_start,
			area_end - area_start,
			ut_uint64_align_down(log_sys->written_to_all_lsn,
					     OS_FILE_LOG_BLOCK_SIZE),
			start_offset - area_start);

log_group_set_fields(group, log_sys->write_lsn);

group = UT_LIST_GET_NEXT(log_groups, group);
	}

mutex_exit(&(log_sys->mutex));

if (srv_unix_file_flush_method == SRV_UNIX_O_DSYNC
	    || srv_unix_file_flush_method == SRV_UNIX_ALL_O_DIRECT) {
		/* O_DSYNC means the OS did not buffer the log file at all:
		so we have also flushed to disk what we have written */

log_sys->flushed_to_disk_lsn = log_sys->write_lsn;

} else if (flush_to_disk) {

group = UT_LIST_GET_FIRST(log_sys->log_groups);

fil_flush(group->space_id, FALSE);
		log_sys->flushed_to_disk_lsn = log_sys->write_lsn;
	}

There already is a log_do_write in log_group_write_buf:

if (log_do_write) {
		log_sys->n_log_ios++;

srv_os_log_pending_writes++;

fil_io(OS_FILE_WRITE | OS_FILE_LOG, TRUE, group->space_id, 0,
		       next_offset / UNIV_PAGE_SIZE,
		       next_offset % UNIV_PAGE_SIZE, write_len, buf, group);

srv_os_log_pending_writes--;

srv_os_log_written+= write_len;
		srv_log_writes++;
	}

However, it is unconditionally set to TRUE in non-UNIV_DEBUG (and 
nowhere set to false in UNIV_DEBUG too).

However, the same variable cannot be reused, since to increment 
log_sys->n_log_ios++ among others requires the log_sys mutex.

So, one may want to replace fil_io over there with an in-memory 
buffering so that counters are updated (the worst can happen with  a crash is the counters 
being incorrect) and then do the I/O after mutex_exit in 
log_write_up_to  but before the if condition with 
SRV_UNIX_O_DSYNC.

Even this should benefit O_DSYNC / ALL_O_DIRECT the most, it will 
also benefit normal case since it will avoid the overhead of 
_fil_aio  when under the mutex.

Laurynas Biveinis (laurynas-biveinis) on 2013-02-06

tags:

added: xtradb

Revision history for this message

Laurynas Biveinis (laurynas-biveinis) wrote on 2016-03-17:

Related https://bugs.mysql.com/bug.php?id=77094

tags:

added: performance

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-25:

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-1277

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

mysql-bugs #77094 Edit

Bug watches keep track of this bug in other bug trackers.