Percona XtraBackup moved to https://jira.percona.com/projects/PXB

Server hangs after running xtrabackup with MariaDB 10.0 and innodb_flush_method = O_DIRECT

Bug #1625984 reported by Tomas Mozes on 2016-09-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Percona XtraBackup moved to https://jira.percona.com/projects/PXB	Fix Released	Undecided	Unassigned

Bug Description

After backing up something around 15-25GB data with xtrabackup:
$ xtrabackup --backup --stream=xbstream --parallel=4 --compress --compress-threads=12

the server starts acting strange. The xtrabackup process only prints "log scanned up to" messages with same lsn:

...
160921 09:09:54 >> log scanned up to (7313323134056)
160921 09:09:55 >> log scanned up to (7313323134056)
160921 09:09:56 >> log scanned up to (7313323134056)
160921 09:09:57 >> log scanned up to (7313323134056)
160921 09:09:58 >> log scanned up to (7313323134056)
160921 09:09:59 >> log scanned up to (7313323134056)
...

MariaDB cannot be stopped nor xtrabackup can be killed. Some files cannot be read:
$ cat /var/lib/mysql/backup-my.cnf -> hangs.

During this time, the sever iowait is high (also load) as some files cannot be even read. Iotop shows no activity for xtrabackup (normally it reads around 200-300MB/s). There is nothing in the system logs, nor in MariaDB. This machine is running virtualized in Xen and there is nothing in dom0 logs also.

A normal shutdown fails, only a forced machine shutdown works. After reboot, the files are readable normally.

Under normal circumstances, all MariaDB data files can be read:
$ find /var/lib/mysql -type f -exec cat {} \; > /dev/null

After unsetting innodb_flush_method in MariaDB my.cnf the backup completes normally and system continues to work normally.

$ xfs_info /var/lib/mysql/
meta-data=/dev/xvda7 isize=512 agcount=23, agsize=7864256 blks
         = sectsz=512 attr=2, projid32bit=1
         = crc=1 finobt=1 spinodes=0
data = bsize=4096 blocks=178257920, imaxpct=25
         = sunit=64 swidth=64 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=61440, version=2
         = sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

$ find /var/lib/mysql -type f | wc -l
624

$ du -hs /var/lib/mysql/
454G /var/lib/mysql/

Server virtualized in Xen domU, 230GB RAM, Linux kernel 4.4.21, hardware raid 10 on hdd, Xtrabackup 2.4.4.

Revision history for this message

Tomas Mozes (hydrapolic) wrote on 2016-09-21:

my.cnf Edit (2.4 KiB, text/plain)

Revision history for this message

Sergei Glushchenko (sergei.glushchenko) wrote on 2016-09-27:

Looks like --parallel=4 completely saturates your IO. Please try to reduce the number

Revision history for this message

Sergei Glushchenko (sergei.glushchenko) wrote on 2016-09-27:

Or there is some other issue with IO in your setup. Does changing innodb_flush_method help?

Revision history for this message

Tomas Mozes (hydrapolic) wrote on 2016-09-29:

Forgot to mention, this behaves the same:
$ xtrabackup --backup --stream=xbstream

Yes, after switching to default innodb_flush_method it works properly.

Revision history for this message

Tomas Mozes (hydrapolic) wrote on 2016-10-19:

Reproduced on other hardware, steps taken:
- boot up the virtual machine
- restore a 450GB database (from mysqldump - took 32 hours)
- xtrabackup --backup --stream=xbstream

Xtrabackup is in uninterruptible sleep state.

Revision history for this message

Tomas Mozes (hydrapolic) wrote on 2016-10-19:

echo l > /proc/sysrq-trigger

Revision history for this message

Tomas Mozes (hydrapolic) wrote on 2016-10-19:

sysrq-l.log Edit (450.5 KiB, text/plain)

echo l > /proc/sysrq-trigger

Revision history for this message

Tomas Mozes (hydrapolic) wrote on 2016-10-19:

sysrq-m.log Edit (3.7 KiB, text/plain)

echo m > /proc/sysrq-trigger

Revision history for this message

Tomas Mozes (hydrapolic) wrote on 2016-10-19:

sysrq-w.log Edit (2.4 KiB, text/plain)

echo w > /proc/sysrq-trigger

Revision history for this message

Sergei Glushchenko (sergei.glushchenko) wrote on 2016-10-25:

#10

Can you please attach GDB to the xtrabackup and collect the stack trace while it is in this state? Just make sure you have debug symbols installed.
strace output would also be useful.

Revision history for this message

Tomas Mozes (hydrapolic) wrote on 2016-10-28:

#11

I've been doing some tests to discover which part is having problems. Seems like it's xfs on kernel > 4.1:

kernel 4.1 + xfs = pass

kernel 4.4 + ext4 = pass
kernel 4.4 + xfs - O_DIRECT = pass
kernel 4.4 + xfs + O_DIRECT = FAIL

Revision history for this message

Tomas Mozes (hydrapolic) wrote on 2016-11-03:

#12

sysrq-w-2.log Edit (19.0 KiB, text/plain)

After trying to get the bt from gdb it hanged:
gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof xtrabackup)
gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof mysqld)

Then strace -f -p $(pidof xtrabackup) didn't work either.

Attaching a new echo w > /proc/sysrq-trigger.

Revision history for this message

Tomas Mozes (hydrapolic) wrote on 2017-03-21:

#13

It also happens with ext4, but it took more time to show up. So kernel 4.4, either xfs/ext4 with O_DIRECT has this problem. On kernel 4.1 it cannot be reproduced.

I also tried the latest kernel 4.4.55 which has a fix for DM deadlocks, but it didn't help.

Revision history for this message

Tomas Mozes (hydrapolic) wrote on 2017-10-27:

#14

Cannot reproduce any more with MariaDB 10.1 and Linux kernel 4.12+.

Changed in percona-xtrabackup:
status:	New → Fix Released

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-20:

#15

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXB-1403

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.