Percona XtraBackup moved to https://jira.percona.com/projects/PXB

xtrabackup 2.4 log copy thread hangs if there is no enough CPU

Bug #1706342 reported by Rick Pizzi on 2017-07-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Percona XtraBackup moved to https://jira.percona.com/projects/PXB	New	Undecided	Unassigned

Bug Description

We have been trying to take backups of a large MySQL 5.7.18 instance using xtrabackup 2.4.7.
We have observed that if cpu usage hits 100% the log copy threads hangs, and after a little while xtrabackup dies with an error saying that it is unable to cope with the redo log growth:

xtrabackup: error: log block numbers mismatch:
xtrabackup: error: expected log block no. 212999933, but got no. 215097077 from the log file.
xtrabackup: error: it looks like InnoDB log has wrapped around before xtrabackup could process all records due to either log copying being too slow, or log files being too small.

Actually, depending on when xtrabackup realizes the problem, other errors may show up too, including assertions:

InnoDB: Last flushed lsn: 97554425340507 load_index lsn 97554732668223
[FATAL] InnoDB: An optimized(without redo logging) DDLoperation has been performed. All modified pages may not have been flushed to the disk yet.
PXB will not be able take a consistent backup. Retry the backup operation
2017-07-25 10:17:48 0x7f38933d6700 InnoDB: Assertion failure in thread 139880965170944 in file ut0ut.cc line 916

From some quick troubleshooting, the log copy threads appears to hang in:

Thread 16 (Thread 0x7fde978b4700 (LWP 12487)):
#0 0x000000399a80ba0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000000000092525e in os_event::timed_wait(timespec const*) ()
#2 0x000000000092557e in os_event::wait_time_low(unsigned long, long) ()
#3 0x000000000075ce54 in log_copying_thread(void*) ()
#4 0x000000399a807a51 in start_thread () from /lib64/libpthread.so.0
#5 0x000000399a0e893d in clone () from /lib64/libc.so.6

We have been trying different options and settings until we realized it all depends on the fact CPU on the test machine is available or not.

Turns out that this only happens if you hit 100% utilisation, and we were able to make xtrabackup work correctly by reducing CPU usage (we removed compression to save CPU, and reduced parallelism in xtrabackup itself).

We made several tests and as long as there is available idle CPU the log copy thread works fine, but it hangs if there is no CPU left for it and never recovers.

Rick Pizzi (pizzi) on 2017-07-25

affects:

percona-server → percona-xtrabackup

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2017-07-25:

Are you running DDL run during backup? If yes then please see also this reports:
https://bugs.launchpad.net/percona-xtrabackup/+bug/1555626
https://bugs.launchpad.net/percona-xtrabackup/+bug/1532878

According the error message it is similar as described here:

https://bugs.launchpad.net/percona-xtrabackup/+bug/1532878/comments/3

And now you can try to use 2.4.8 with new options as shown here:

https://bugs.launchpad.net/percona-xtrabackup/+bug/1532878/comments/13

Revision history for this message

Rick Pizzi (pizzi) wrote on 2017-07-25: Re: [Bug 1706342] xtrabackup 2.4 log copy thread hangs if there is no enough CPU

Download full text (3.9 KiB)

It doesn’t seem to be related with any DDL. Most of the time I get the other error.
Question, should the “log scanned up to … (LSN)” appear once every second (by default)?
It seems to get “stuck" in our tests quite often, like in:

170725 22:49:32 >> log scanned up to (97599947036309)
170725 22:49:33 >> log scanned up to (97599947908874)
170725 22:49:34 >> log scanned up to (97599948863500)
170725 22:49:35 >> log scanned up to (97599949768269)
170725 22:49:36 >> log scanned up to (97599950590202)

(pauses several seconds)

170725 22:49:43 >> log scanned up to (97599956699160)
170725 22:49:44 >> log scanned up to (97599957736494)

—
Riccardo Pizzi
<email address hidden>
http://www.pinballowners.com/webmaster

On 25/lug/2017, at 19:41, Shahriyar Rzayev <email address hidden> wrote:

> Are you running DDL run during backup? If yes then please see also this reports:
> https://bugs.launchpad.net/percona-xtrabackup/+bug/1555626
> https://bugs.launchpad.net/percona-xtrabackup/+bug/1532878
>
> According the error message it is similar as described here:
>
> https://bugs.launchpad.net/percona-xtrabackup/+bug/1532878/comments/3
>
> And now you can try to use 2.4.8 with new options as shown here:
>
> https://bugs.launchpad.net/percona-xtrabackup/+bug/1532878/comments/13
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1706342
>
> Title:
> xtrabackup 2.4 log copy thread hangs if there is no enough CPU
>
> Status in Percona XtraBackup:
> New
>
> Bug description:
> We have been trying to take backups of a large MySQL 5.7.18 instance using xtrabackup 2.4.7.
> We have observed that if cpu usage hits 100% the log copy threads hangs, and after a little while xtrabackup dies with an error saying that it is unable to cope with the redo log growth:
>
> xtrabackup: error: log block numbers mismatch:
> xtrabackup: error: expected log block no. 212999933, but got no. 215097077 from the log file.
> xtrabackup: error: it looks like InnoDB log has wrapped around before xtrabackup could process all records due to either log copying being too slow, or log files being too small.
>
> Actually, depending on when xtrabackup realizes the problem, other
> errors may show up too, including assertions:
>
> InnoDB: Last flushed lsn: 97554425340507 load_index lsn 97554732668223
> [FATAL] InnoDB: An optimized(without redo logging) DDLoperation has been performed. All modified pages may not have been flushed to the disk yet.
> PXB will not be able take a consistent backup. Retry the backup operation
> 2017-07-25 10:17:48 0x7f38933d6700 InnoDB: Assertion failure in thread 139880965170944 in file ut0ut.cc line 916
>
>
> From some quick troubleshooting, the log copy threads appears to hang in:
>
> Thread 16 (Thread 0x7fde978b4700 (LWP 12487)):
> #0 0x000000399a80ba0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> #1 0x000000000092525e in os_event::timed_wait(timespec const*) ()
> #2 0x000000000092557e in os_event::wait_time_low(unsigned long, long) ()
> #3 0x000000000075ce54 in log_copying_thread(void*) ()
> #4 0x000000399a807a51 in s...

(pauses several seconds)

170725 22:49:43 >> log scanned up to (97599956699160)
170725 22:49:44 >> log scanned up to (97599957736494)

— 
Riccardo Pizzi
pizzi@leopardus.com
http://www.pinballowners.com/webmaster

On 25/lug/2017, at 19:41, Shahriyar Rzayev <shahriyar.rzayev@percona.com> wrote:

> Are you running DDL run during backup? If yes then please see also this reports:
> https://bugs.launchpad.net/percona-xtrabackup/+bug/1555626
> https://bugs.launchpad.net/percona-xtrabackup/+bug/1532878
> 
> According the error message it is similar as described here:
> 
> https://bugs.launchpad.net/percona-xtrabackup/+bug/1532878/comments/3
> 
> And now you can try to use 2.4.8 with new options as shown here:
> 
> https://bugs.launchpad.net/percona-xtrabackup/+bug/1532878/comments/13
> 
> -- 
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1706342
> 
> Title:
>  xtrabackup 2.4 log copy thread hangs if there is no enough CPU
> 
> Status in Percona XtraBackup:
>  New
> 
> Bug description:
>  We have been trying to take backups of a large MySQL 5.7.18 instance using xtrabackup 2.4.7.
>  We have observed that if cpu usage hits 100% the log copy threads hangs, and after a little while xtrabackup dies with an error saying that it is unable to cope with the redo log growth:
> 
>  xtrabackup: error: log block numbers mismatch:
>  xtrabackup: error: expected log block no. 212999933, but got no. 215097077 from the log file.
>  xtrabackup: error: it looks like InnoDB log has wrapped around before xtrabackup could process all records due to either log copying being too slow, or  log files being too small.
> 
>  Actually, depending on when xtrabackup realizes the problem, other
>  errors may show up too, including assertions:
> 
>  InnoDB: Last flushed lsn: 97554425340507 load_index lsn 97554732668223
>  [FATAL] InnoDB: An optimized(without redo logging) DDLoperation has been performed. All modified pages may not have been flushed to the disk yet.
>      PXB will not be able take a consistent backup. Retry the backup operation
>  2017-07-25 10:17:48 0x7f38933d6700  InnoDB: Assertion failure in thread 139880965170944 in file ut0ut.cc line 916
> 
> 
>  From some quick troubleshooting, the log copy threads appears to hang in:
> 
>  Thread 16 (Thread 0x7fde978b4700 (LWP 12487)):
>  #0  0x000000399a80ba0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>  #1  0x000000000092525e in os_event::timed_wait(timespec const*) ()
>  #2  0x000000000092557e in os_event::wait_time_low(unsigned long, long) ()
>  #3  0x000000000075ce54 in log_copying_thread(void*) ()
>  #4  0x000000399a807a51 in start_thread () from /lib64/libpthread.so.0
>  #5  0x000000399a0e893d in clone () from /lib64/libc.so.6
> 
> 
>  We have been trying different options and settings until we realized it all depends on the fact CPU on the test machine is available or not.
> 
>  Turns out that this only happens if you hit 100% utilisation, and we
>  were able to make xtrabackup work correctly by reducing CPU usage (we
>  removed compression to save CPU, and reduced parallelism in xtrabackup
>  itself).
> 
>  We made several tests and as long as there is available idle CPU the
>  log copy thread works fine, but it hangs if there is no CPU left for
>  it and never recovers.
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/percona-xtrabackup/+bug/1706342/+subscriptions
>

Revision history for this message

Rick Pizzi (pizzi) wrote on 2017-08-18:

I have noticed that even if the "log scanned" log entry stops often for several seconds the backup runs fine since we have set old_alter_table=ON

Revision history for this message

Sergei Glushchenko (sergei.glushchenko) wrote on 2017-08-19:

Have you try to tune log copying thread with --log-copy-interval?

Revision history for this message

Rick Pizzi (pizzi) wrote on 2017-08-19:

yes, nothing changes

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-20:

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXB-1457

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.