XtraBackup stuck in prepare stage waiting for pending reads

Bug #1709495 reported by Ovais Tariq on 2017-08-08
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona XtraBackup moved to https://jira.percona.com/projects/PXB
Status tracked in 2.4
2.4
New
Undecided
Unassigned

Bug Description

PXB version
/usr/bin/xtrabackup version 2.4.7 based on MySQL server 5.7.13 Linux (x86_64) (revision id: 05f1fcf)

PXB log entries
xtrabackup: cd to /var/lib/mysql/
xtrabackup: This target seems to be not prepared yet.
InnoDB: Number of pools: 1
xtrabackup: xtrabackup_logfile detected: size=1449263104, start_lsn=(14747305820062)
xtrabackup: using the following InnoDB configuration for recovery:
xtrabackup: innodb_data_home_dir = .
xtrabackup: innodb_data_file_path = ibdata1:12M:autoextend
xtrabackup: innodb_log_group_home_dir = .
xtrabackup: innodb_log_files_in_group = 1
xtrabackup: innodb_log_file_size = 1449263104
xtrabackup: using the following InnoDB configuration for recovery:
xtrabackup: innodb_data_home_dir = .
xtrabackup: innodb_data_file_path = ibdata1:12M:autoextend
xtrabackup: innodb_log_group_home_dir = .
xtrabackup: innodb_log_files_in_group = 1
xtrabackup: innodb_log_file_size = 1449263104
xtrabackup: Starting InnoDB instance for recovery.
xtrabackup: Using 37133221888 bytes for buffer pool (set by --use-memory parameter)
InnoDB: PUNCH HOLE support available
InnoDB: Mutexes and rw_locks use GCC atomic builtins
InnoDB: Uses event mutexes
InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
InnoDB: Compressed tables use zlib 1.2.8
InnoDB: Number of pools: 1
InnoDB: Using CPU crc32 instructions
InnoDB: Initializing buffer pool, total size = 34.625G, instances = 1, chunk size = 128M
InnoDB: Completed initialization of buffer pool
InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
InnoDB: Highest supported file format is Barracuda.
InnoDB: Log scan progressed past the checkpoint lsn 14747305820062
InnoDB: Doing recovery: scanned up to log sequence number 14747311062528 (0%)
InnoDB: Doing recovery: scanned up to log sequence number 14747316305408 (0%)
InnoDB: Doing recovery: scanned up to log sequence number 14747321548288 (1%)
InnoDB: Doing recovery: scanned up to log sequence number 14747326791168 (1%)
...
InnoDB: Doing recovery: scanned up to log sequence number 14748593339904 (99%)
InnoDB: Doing recovery: scanned up to log sequence number 14748594007176 (99%)
InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percent: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 InnoDB: Waited for 10 seconds for 128 pending reads
InnoDB: Waited for 20 seconds for 128 pending reads
InnoDB: Waited for 30 seconds for 128 pending reads
InnoDB: Waited for 40 seconds for 128 pending reads
InnoDB: Waited for 50 seconds for 128 pending reads
InnoDB: Waited for 60 seconds for 128 pending reads
InnoDB: Waited for 70 seconds for 128 pending reads
InnoDB: Waited for 80 seconds for 128 pending reads
InnoDB: Waited for 90 seconds for 128 pending reads
InnoDB: Waited for 100 seconds for 128 pending reads
...
InnoDB: Waited for 93730 seconds for 128 pending reads
InnoDB: Waited for 93740 seconds for 128 pending reads
InnoDB: Waited for 93750 seconds for 128 pending reads
InnoDB: Waited for 93760 seconds for 128 pending reads
InnoDB: Waited for 93770 seconds for 128 pending reads

And then it just keeps waiting on the pending reads until PXB is killed.

We hit a similar bug in PS sometime ago where the pending reads would not finish and we ended up using innodb_empty_free_list_algorithm=legacy to fix the problem:
https://github.com/uber/percona-server/pull/2

Hi Ovais,

I am afraid your suggested fix does not apply to xtrabackup. Can you recommend any specific conditions to try to reproduce this issue? If you are able to reproduce it in your environment, can you try to get gdb stacktrace of xtrabackup in stuck state?

Ovais Tariq (ovais-tariq) wrote :

Sergei,

I will get the stacktraces when it happens again.

Evgeniy Firsov (firs1978) wrote :

Here is stacktrace for the issue:

 6 pthread_cond_wait,wait(os0event.cc:165),os_event::wait_low(os0event.cc:165),os_event_wait_low(os0event.cc:534),os_aio_simulated_handler(os0file.cc:7968),os_aio_handler(os0file.cc:7968),fil_aio_wait(fil0fil.cc:6094),io_handler_thread(srv0start.cc:309),start_thread(libpthread.so.0),clone(libc.so.6)
          4 nanosleep(libpthread.so.0),os_thread_sleep(os0thread.cc:254),recv_apply_hashed_log_recs(log0recv.cc:2756),log_checkpoint(log0log.cc:1938),log_margin_checkpoint_age(log0log.cc:341),mtr_t::Command::prepare_write(mtr0mtr.cc:869),mtr_t::Command::execute(mtr0mtr.cc:949),mtr_t::commit(mtr0mtr.cc:576),ibuf_mtr_commit(ibuf0ibuf.ic:60),ibuf_merge_or_delete_for_page(ibuf0ibuf.ic:60),buf_page_io_complete(buf0buf.cc:5894),fil_aio_wait(fil0fil.cc:6128),io_handler_thread(srv0start.cc:309),start_thread(libpthread.so.0),clone(libc.so.6)
          1 pthread_cond_wait,wait(os0event.cc:165),os_event::wait_low(os0event.cc:165),os_event_wait_low(os0event.cc:534),buf_flush_page_cleaner_coordinator(buf0flu.cc:3161),start_thread(libpthread.so.0),clone(libc.so.6)
          1 nanosleep(libpthread.so.0),os_thread_sleep(os0thread.cc:254),recv_writer_thread(log0recv.cc:962),start_thread(libpthread.so.0),clone(libc.so.6)
          1 nanosleep(libpthread.so.0),os_thread_sleep(os0thread.cc:254),buf_read_recv_pages(buf0rea.cc:850),recv_read_in_area(log0recv.cc:2722),recv_apply_hashed_log_recs(log0recv.cc:2722),innobase_start_or_create_for_mysql(srv0start.cc:2217),innodb_init(xtrabackup.cc:2004),xtrabackup_prepare_func(xtrabackup.cc:7585),main(xtrabackup.cc:8704)

Vishal kasle (vishal.percona) wrote :

Hello,
             Can you please provide more information on the bug report. We will require the complete xtrabackup command that you executed, version of the server that is being backed up and its configuration. It will also be helpful if you could send us the complete pmp file.
            Are you able to reproduce the issue with smaller value at --user-memory, like 4GB?
It could be important to get a coredump with gcore command in future to investigate locks.
     We would also like to know the specific conditions in which the bug can be reproduced so that we can work on confirming it and escalating the bug. I will also update the bug report.

Changed in percona-xtrabackup:
status: New → Incomplete
Ovais Tariq (ovais-tariq) wrote :
Download full text (1.4 MiB)

XtraBackup version: 2.4.8
XtraBackup command: /usr/bin/xtrabackup --prepare --use-memory=99388227584 --target-dir=/var/lib/mysql

Here is the excerpt from the log generated by XtraBackup

y: scanned up to log sequence number 4989864730112 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:13.731390926Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989869972992 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:13.850566635Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989875215872 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:13.970826346Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989880458752 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:14.075342972Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989885701632 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:14.151425148Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989890944512 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:14.226469111Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989896187392 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:14.300669689Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989901430272 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:14.389361263Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989906673152 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:14.46385536Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989911916032 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:14.54271289Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989917158912 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:14.61438211Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989922401792 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:14.689529148Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989927644672 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:14.767624259Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989932887552 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:14.842746832Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989938130432 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:14.916134859Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989943373312 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:14.989721719Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989948616192 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:15.067548354Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989953859072 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:15.137438745Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989959101952 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:15.208959886Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989964344832 (2%)\n","stream":"stderr","time":"2018-01-15T21:26:15.280494648Z"}
{"log":"InnoDB: Doing recovery: scanned up to log sequence number 4989969587712 (2%)\n","stream":"stderr...

Ovais Tariq (ovais-tariq) wrote :
Download full text (4.0 KiB)

Here is the excerpt of the log related to start of the XtraBackup process

{"log":"/usr/bin/xtrabackup version 2.4.8 based on MySQL server 5.7.13 Linux (x86_64) (revision id: 97330f7)\n","stream":"stderr","time":"2018-01-15T21:26:01.075981396Z"}
{"log":"xtrabackup: cd to /var/lib/mysql/\n","stream":"stderr","time":"2018-01-15T21:26:01.076032974Z"}
{"log":"xtrabackup: This target seems to be not prepared yet.\n","stream":"stderr","time":"2018-01-15T21:26:01.076055973Z"}
{"log":"InnoDB: Number of pools: 1\n","stream":"stderr","time":"2018-01-15T21:26:01.076276245Z"}
{"log":"xtrabackup: xtrabackup_logfile detected: size=15111028736, start_lsn=(4989581614901)\n","stream":"stderr","time":"2018-01-15T21:26:02.740604422Z"}
{"log":"xtrabackup: using the following InnoDB configuration for recovery:\n","stream":"stderr","time":"2018-01-15T21:26:02.741569031Z"}
{"log":"xtrabackup: innodb_data_home_dir = .\n","stream":"stderr","time":"2018-01-15T21:26:02.741585401Z"}
{"log":"xtrabackup: innodb_data_file_path = ibdata1:12M:autoextend\n","stream":"stderr","time":"2018-01-15T21:26:02.741590127Z"}
{"log":"xtrabackup: innodb_log_group_home_dir = .\n","stream":"stderr","time":"2018-01-15T21:26:02.741598946Z"}
{"log":"xtrabackup: innodb_log_files_in_group = 1\n","stream":"stderr","time":"2018-01-15T21:26:02.741604487Z"}
{"log":"xtrabackup: innodb_log_file_size = 15111028736\n","stream":"stderr","time":"2018-01-15T21:26:02.741619202Z"}
{"log":"xtrabackup: using the following InnoDB configuration for recovery:\n","stream":"stderr","time":"2018-01-15T21:26:02.741689111Z"}
{"log":"xtrabackup: innodb_data_home_dir = .\n","stream":"stderr","time":"2018-01-15T21:26:02.741694776Z"}
{"log":"xtrabackup: innodb_data_file_path = ibdata1:12M:autoextend\n","stream":"stderr","time":"2018-01-15T21:26:02.741698443Z"}
{"log":"xtrabackup: innodb_log_group_home_dir = .\n","stream":"stderr","time":"2018-01-15T21:26:02.741720873Z"}
{"log":"xtrabackup: innodb_log_files_in_group = 1\n","stream":"stderr","time":"2018-01-15T21:26:02.741726736Z"}
{"log":"xtrabackup: innodb_log_file_size = 15111028736\n","stream":"stderr","time":"2018-01-15T21:26:02.741729925Z"}
{"log":"xtrabackup: Starting InnoDB instance for recovery.\n","stream":"stderr","time":"2018-01-15T21:26:02.741733111Z"}
{"log":"xtrabackup: Using 99388227584 bytes for buffer pool (set by --use-memory parameter)\n","stream":"stderr","time":"2018-01-15T21:26:02.741737845Z"}
{"log":"InnoDB: PUNCH HOLE support available\n","stream":"stderr","time":"2018-01-15T21:26:02.741754754Z"}
{"log":"InnoDB: Mutexes and rw_locks use GCC atomic builtins\n","stream":"stderr","time":"2018-01-15T21:26:02.741770948Z"}
{"log":"InnoDB: Uses event mutexes\n","stream":"stderr","time":"2018-01-15T21:26:02.741775591Z"}
{"log":"InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier\n","stream":"stderr","time":"2018-01-15T21:26:02.74178021Z"}
{"log":"InnoDB: Compressed tables use zlib 1.2.8\n","stream":"stderr","time":"2018-01-15T21:26:02.741785883Z"}
{"log":"InnoDB: Number of pools: 1\n","stream":"stderr","time":"2018-01-15T21:26:02.741926008Z"}
{"log":"InnoDB: Using CPU crc32 instructions\n","stream":"stderr","time":"20...

Read more...

Ovais Tariq (ovais-tariq) wrote :

Complete XtraBackup log attached

Ovais Tariq (ovais-tariq) wrote :

pt-pmp samples attached

Changed in percona-xtrabackup:
status: Incomplete → New
Ovais Tariq (ovais-tariq) wrote :

I don't think using less memory for the prepare step may actually help here.

Nickolay Ihalainen (ihanick) wrote :

Steps to reproduce (on HDD):
docker run --name=p57 -it -e MYSQL_ALLOW_EMPTY_PASSWORD=1 percona:latest --innodb_buffer_pool_size=2G --innodb_log_file_size=2G

Inside docker exec -it p57 bash shell:
apt-get install -y percona-xtrabackup-24 sysbench

while true ; do sysbench --test=/usr/share/sysbench/oltp_insert.lua --mysql-host=localhost --threads=3 --time=600 run ; done

Open another terminal and wait until total database size will be 10GB.

Without stopping insert load, run backup:
xtrabackup --backup

Run prepare:
xtrabackup --prepare --use-memory=4G
xtrabackup version 2.4.9 based on MySQL server 5.7.13 Linux (x86_64) (revision id: a467167cdd4)
xtrabackup: cd to /xtrabackup_backupfiles/
xtrabackup: This target seems to be not prepared yet.
InnoDB: Number of pools: 1
xtrabackup: xtrabackup_logfile detected: size=1443102720, start_lsn=(8805379375)
xtrabackup: using the following InnoDB configuration for recovery:
xtrabackup: innodb_data_home_dir = .
xtrabackup: innodb_data_file_path = ibdata1:10M:autoextend
xtrabackup: innodb_log_group_home_dir = .
xtrabackup: innodb_log_files_in_group = 1
xtrabackup: innodb_log_file_size = 1443102720
xtrabackup: using the following InnoDB configuration for recovery:
xtrabackup: innodb_data_home_dir = .
xtrabackup: innodb_data_file_path = ibdata1:10M:autoextend
xtrabackup: innodb_log_group_home_dir = .
xtrabackup: innodb_log_files_in_group = 1
xtrabackup: innodb_log_file_size = 1443102720
xtrabackup: Starting InnoDB instance for recovery.
xtrabackup: Using 4294967296 bytes for buffer pool (set by --use-memory parameter)
InnoDB: PUNCH HOLE support available
InnoDB: Mutexes and rw_locks use GCC atomic builtins
InnoDB: Uses event mutexes
InnoDB: GCC builtin __atomic_thread_fence() is used for memory barrier
InnoDB: Compressed tables use zlib 1.2.8
InnoDB: Number of pools: 1
InnoDB: Using CPU crc32 instructions
InnoDB: Initializing buffer pool, total size = 4G, instances = 1, chunk size = 128M
.....
InnoDB: Doing recovery: scanned up to log sequence number 10088119076 (99%)
InnoDB: Database was not shutdown normally!
InnoDB: Starting crash recovery.
InnoDB: 2 transaction(s) which must be rolled back or cleaned up in total 2 row operations to undo
InnoDB: Trx id counter is 16438528
InnoDB: Starting an apply batch of log records to the database...
InnoDB: Progress in percent: 0 InnoDB: Waited for 10 seconds for 128 pending reads
InnoDB: Waited for 20 seconds for 128 pending reads
....
InnoDB: Waited for 370 seconds for 128 pending reads

At this point there is no IO from xtrabackup at all, according to pidstat -d 1

Ovais Tariq (ovais-tariq) wrote :

Nickolay,

Thanks for the test case. Would you mind testing with the proposed fix?

Evgeniy Firsov (firs1978) wrote :

Original fix breaks --export functionality, so as it leaves some insert buffer pages unmerged.

This fix might be cleaner: https://github.com/efirs/percona-xtrabackup/commit/2edf965d6ba490f5e1155976287b8c5c3429a64d

So as it invalidates buffer pool at the end of recovery.

Nickolay, thank you for easy test case.

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXB-1467

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers