Server hangs after running xtrabackup with MariaDB 10.0 and innodb_flush_method = O_DIRECT

Bug #1625984 reported by Tomas Mozes
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraBackup moved to https://jira.percona.com/projects/PXB
Fix Released
Undecided
Unassigned

Bug Description

After backing up something around 15-25GB data with xtrabackup:
$ xtrabackup --backup --stream=xbstream --parallel=4 --compress --compress-threads=12

the server starts acting strange. The xtrabackup process only prints "log scanned up to" messages with same lsn:

...
160921 09:09:54 >> log scanned up to (7313323134056)
160921 09:09:55 >> log scanned up to (7313323134056)
160921 09:09:56 >> log scanned up to (7313323134056)
160921 09:09:57 >> log scanned up to (7313323134056)
160921 09:09:58 >> log scanned up to (7313323134056)
160921 09:09:59 >> log scanned up to (7313323134056)
...

MariaDB cannot be stopped nor xtrabackup can be killed. Some files cannot be read:
$ cat /var/lib/mysql/backup-my.cnf -> hangs.

During this time, the sever iowait is high (also load) as some files cannot be even read. Iotop shows no activity for xtrabackup (normally it reads around 200-300MB/s). There is nothing in the system logs, nor in MariaDB. This machine is running virtualized in Xen and there is nothing in dom0 logs also.

A normal shutdown fails, only a forced machine shutdown works. After reboot, the files are readable normally.

Under normal circumstances, all MariaDB data files can be read:
$ find /var/lib/mysql -type f -exec cat {} \; > /dev/null

After unsetting innodb_flush_method in MariaDB my.cnf the backup completes normally and system continues to work normally.

$ xfs_info /var/lib/mysql/
meta-data=/dev/xvda7 isize=512 agcount=23, agsize=7864256 blks
         = sectsz=512 attr=2, projid32bit=1
         = crc=1 finobt=1 spinodes=0
data = bsize=4096 blocks=178257920, imaxpct=25
         = sunit=64 swidth=64 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=61440, version=2
         = sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

$ find /var/lib/mysql -type f | wc -l
624

$ du -hs /var/lib/mysql/
454G /var/lib/mysql/

Server virtualized in Xen domU, 230GB RAM, Linux kernel 4.4.21, hardware raid 10 on hdd, Xtrabackup 2.4.4.

Revision history for this message
Tomas Mozes (hydrapolic) wrote :
Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

Looks like --parallel=4 completely saturates your IO. Please try to reduce the number

Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

Or there is some other issue with IO in your setup. Does changing innodb_flush_method help?

Revision history for this message
Tomas Mozes (hydrapolic) wrote :

Forgot to mention, this behaves the same:
$ xtrabackup --backup --stream=xbstream

Yes, after switching to default innodb_flush_method it works properly.

Revision history for this message
Tomas Mozes (hydrapolic) wrote :

Reproduced on other hardware, steps taken:
- boot up the virtual machine
- restore a 450GB database (from mysqldump - took 32 hours)
- xtrabackup --backup --stream=xbstream

Xtrabackup is in uninterruptible sleep state.

Revision history for this message
Tomas Mozes (hydrapolic) wrote :

echo l > /proc/sysrq-trigger

Revision history for this message
Tomas Mozes (hydrapolic) wrote :

echo l > /proc/sysrq-trigger

Revision history for this message
Tomas Mozes (hydrapolic) wrote :

echo m > /proc/sysrq-trigger

Revision history for this message
Tomas Mozes (hydrapolic) wrote :

echo w > /proc/sysrq-trigger

Revision history for this message
Sergei Glushchenko (sergei.glushchenko) wrote :

Can you please attach GDB to the xtrabackup and collect the stack trace while it is in this state? Just make sure you have debug symbols installed.
strace output would also be useful.

Revision history for this message
Tomas Mozes (hydrapolic) wrote :

I've been doing some tests to discover which part is having problems. Seems like it's xfs on kernel > 4.1:

kernel 4.1 + xfs = pass

kernel 4.4 + ext4 = pass
kernel 4.4 + xfs - O_DIRECT = pass
kernel 4.4 + xfs + O_DIRECT = FAIL

Revision history for this message
Tomas Mozes (hydrapolic) wrote :

After trying to get the bt from gdb it hanged:
gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof xtrabackup)
gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof mysqld)

Then strace -f -p $(pidof xtrabackup) didn't work either.

Attaching a new echo w > /proc/sysrq-trigger.

Revision history for this message
Tomas Mozes (hydrapolic) wrote :

It also happens with ext4, but it took more time to show up. So kernel 4.4, either xfs/ext4 with O_DIRECT has this problem. On kernel 4.1 it cannot be reproduced.

I also tried the latest kernel 4.4.55 which has a fix for DM deadlocks, but it didn't help.

Revision history for this message
Tomas Mozes (hydrapolic) wrote :

Cannot reproduce any more with MariaDB 10.1 and Linux kernel 4.12+.

Changed in percona-xtrabackup:
status: New → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXB-1403

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.