xtrabackup 2.4 log copy thread hangs if there is no enough CPU

Bug #1706342 reported by Rick Pizzi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraBackup moved to https://jira.percona.com/projects/PXB
New
Undecided
Unassigned

Bug Description

We have been trying to take backups of a large MySQL 5.7.18 instance using xtrabackup 2.4.7.
We have observed that if cpu usage hits 100% the log copy threads hangs, and after a little while xtrabackup dies with an error saying that it is unable to cope with the redo log growth:

xtrabackup: error: log block numbers mismatch:
xtrabackup: error: expected log block no. 212999933, but got no. 215097077 from the log file.
xtrabackup: error: it looks like InnoDB log has wrapped around before xtrabackup could process all records due to either log copying being too slow, or log files being too small.

Actually, depending on when xtrabackup realizes the problem, other errors may show up too, including assertions:

InnoDB: Last flushed lsn: 97554425340507 load_index lsn 97554732668223
[FATAL] InnoDB: An optimized(without redo logging) DDLoperation has been performed. All modified pages may not have been flushed to the disk yet.
    PXB will not be able take a consistent backup. Retry the backup operation
2017-07-25 10:17:48 0x7f38933d6700 InnoDB: Assertion failure in thread 139880965170944 in file ut0ut.cc line 916

From some quick troubleshooting, the log copy threads appears to hang in:

Thread 16 (Thread 0x7fde978b4700 (LWP 12487)):
#0 0x000000399a80ba0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000000000092525e in os_event::timed_wait(timespec const*) ()
#2 0x000000000092557e in os_event::wait_time_low(unsigned long, long) ()
#3 0x000000000075ce54 in log_copying_thread(void*) ()
#4 0x000000399a807a51 in start_thread () from /lib64/libpthread.so.0
#5 0x000000399a0e893d in clone () from /lib64/libc.so.6

We have been trying different options and settings until we realized it all depends on the fact CPU on the test machine is available or not.

Turns out that this only happens if you hit 100% utilisation, and we were able to make xtrabackup work correctly by reducing CPU usage (we removed compression to save CPU, and reduced parallelism in xtrabackup itself).

We made several tests and as long as there is available idle CPU the log copy thread works fine, but it hangs if there is no CPU left for it and never recovers.

Rick Pizzi (pizzi)
affects: percona-server → percona-xtrabackup
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Are you running DDL run during backup? If yes then please see also this reports:
https://bugs.launchpad.net/percona-xtrabackup/+bug/1555626
https://bugs.launchpad.net/percona-xtrabackup/+bug/1532878

According the error message it is similar as described here:

https://bugs.launchpad.net/percona-xtrabackup/+bug/1532878/comments/3

And now you can try to use 2.4.8 with new options as shown here:

https://bugs.launchpad.net/percona-xtrabackup/+bug/1532878/comments/13

Revision history for this message
Rick Pizzi (pizzi) wrote : Re: [Bug 1706342] xtrabackup 2.4 log copy thread hangs if there is no enough CPU
Download full text (3.9 KiB)

It doesn’t seem to be related with any DDL. Most of the time I get the other error.
Question, should the “log scanned up to … (LSN)” appear once every second (by default)?
It seems to get “stuck" in our tests quite often, like in:

170725 22:49:32 >> log scanned up to (97599947036309)
170725 22:49:33 >> log scanned up to (97599947908874)
170725 22:49:34 >> log scanned up to (97599948863500)
170725 22:49:35 >> log scanned up to (97599949768269)
170725 22:49:36 >> log scanned up to (97599950590202)

(pauses several seconds)

170725 22:49:43 >> log scanned up to (97599956699160)
170725 22:49:44 >> log scanned up to (97599957736494)


Riccardo Pizzi
<email address hidden>
http://www.pinballowners.com/webmaster

On 25/lug/2017, at 19:41, Shahriyar Rzayev <email address hidden> wrote:

> Are you running DDL run during backup? If yes then please see also this reports:
> https://bugs.launchpad.net/percona-xtrabackup/+bug/1555626
> https://bugs.launchpad.net/percona-xtrabackup/+bug/1532878
>
> According the error message it is similar as described here:
>
> https://bugs.launchpad.net/percona-xtrabackup/+bug/1532878/comments/3
>
> And now you can try to use 2.4.8 with new options as shown here:
>
> https://bugs.launchpad.net/percona-xtrabackup/+bug/1532878/comments/13
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1706342
>
> Title:
> xtrabackup 2.4 log copy thread hangs if there is no enough CPU
>
> Status in Percona XtraBackup:
> New
>
> Bug description:
> We have been trying to take backups of a large MySQL 5.7.18 instance using xtrabackup 2.4.7.
> We have observed that if cpu usage hits 100% the log copy threads hangs, and after a little while xtrabackup dies with an error saying that it is unable to cope with the redo log growth:
>
> xtrabackup: error: log block numbers mismatch:
> xtrabackup: error: expected log block no. 212999933, but got no. 215097077 from the log file.
> xtrabackup: error: it looks like InnoDB log has wrapped around before xtrabackup could process all records due to either log copying being too slow, or log files being too small.
>
> Actually, depending on when xtrabackup realizes the problem, other
> errors may show up too, including assertions:
>
> InnoDB: Last flushed lsn: 97554425340507 load_index lsn 97554732668223
> [FATAL] InnoDB: An optimized(without redo logging) DDLoperation has been performed. All modified pages may not have been flushed to the disk yet.
> PXB will not be able take a consistent backup. Retry the backup operation
> 2017-07-25 10:17:48 0x7f38933d6700 InnoDB: Assertion failure in thread 139880965170944 in file ut0ut.cc line 916
>
>
> From some quick troubleshooting, the log copy threads appears to hang in:
>
> Thread 16 (Thread 0x7fde978b4700 (LWP 12487)):
> #0 0x000000399a80ba0e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> #1 0x000000000092525e in os_event::timed_wait(timespec const*) ()
> #2 0x000000000092557e in os_event::wait_time_low(unsigned long, long) ()
> #3 0x000000000075ce54 in log_copying_thread(void*) ()
> #4 0x000000399a807a51 in s...

Read more...

Revision history for this message
Rick Pizzi (pizzi) wrote :

I have noticed that even if the "log scanned" log entry stops often for several seconds the backup runs fine since we have set old_alter_table=ON

Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

Have you try to tune log copying thread with --log-copy-interval?

Revision history for this message
Rick Pizzi (pizzi) wrote :

yes, nothing changes

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXB-1457

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.