nvme: avoid cqe corruption

Bug #1788035 reported by Kamal Mostafa on 2018-08-20
44
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Kamal Mostafa
Xenial
Undecided
Unassigned

Bug Description

To address customer-reported NVMe issue with instance types (notably c5 and m5) that expose EBS volumes as NVMe devices, this commit from mainline v4.6 should be backported to Xenial:

d783e0bd02e700e7a893ef4fa71c69438ac1c276 nvme: avoid cqe corruption when update at the same time as read

dmesg sample:

[Wed Aug 15 01:11:21 2018] nvme 0000:00:1f.0: I/O 8 QID 1 timeout, aborting
[Wed Aug 15 01:11:21 2018] nvme 0000:00:1f.0: I/O 9 QID 1 timeout, aborting
[Wed Aug 15 01:11:21 2018] nvme 0000:00:1f.0: I/O 21 QID 2 timeout, aborting
[Wed Aug 15 01:11:32 2018] nvme 0000:00:1f.0: I/O 10 QID 1 timeout, aborting
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: I/O 8 QID 1 timeout, reset controller
[Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x2
[Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x2
[Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x2
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 21 QID 2
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with status: 0007
[Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887751
[Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887751
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 22 QID 2
[Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887767
[Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887767
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 23 QID 2
[Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887769
[Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 83887769
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 8 QID 1
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 9 QID 1
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with status: 0007
[Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 41943136
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 10 QID 1
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with status: 0007
[Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 6976
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 22 QID 1
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 23 QID 1
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 24 QID 1
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 25 QID 1
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: Cancelling I/O 2 QID 0
[Wed Aug 15 01:11:51 2018] nvme nvme1: Abort status: 0x7
[Wed Aug 15 01:11:51 2018] nvme 0000:00:1f.0: completing aborted command with status: fffffffc
[Wed Aug 15 01:11:51 2018] blk_update_request: I/O error, dev nvme1n1, sector 96
[Wed Aug 15 01:11:51 2018] XFS (nvme1n1): metadata I/O error: block 0x5000687 ("xlog_iodone") error 5 numblks 64
[Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_do_force_shutdown(0x2) called from line 1197 of file /build/linux-c2Z51P/linux-4.4.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc075d428
[Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_log_force: error -5 returned.
[Wed Aug 15 01:11:51 2018] XFS (nvme1n1): Log I/O Error Detected. Shutting down filesystem
[Wed Aug 15 01:11:51 2018] XFS (nvme1n1): Please umount the filesystem and rectify the problem(s)
[Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 872, lost async page write
[Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_imap_to_bp: xfs_trans_read_buf() returned error -5.
[Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_iunlink_remove: xfs_imap_to_bp returned error -5.
[Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 873, lost async page write
[Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 874, lost async page write
[Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 875, lost async page write
[Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 876, lost async page write
[Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 877, lost async page write
[Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 878, lost async page write
[Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 879, lost async page write
[Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 880, lost async page write
[Wed Aug 15 01:11:51 2018] Buffer I/O error on dev nvme1n1, logical block 881, lost async page write
[Wed Aug 15 01:11:51 2018] XFS (nvme1n1): metadata I/O error: block 0x5000697 ("xlog_iodone") error 5 numblks 64
[Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_log_force: error -5 returned.
[Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_do_force_shutdown(0x2) called from line 1197 of file /build/linux-c2Z51P/linux-4.4.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc075d428
[Wed Aug 15 01:11:51 2018] XFS (nvme1n1): metadata I/O error: block 0x5000699 ("xlog_iodone") error 5 numblks 64
[Wed Aug 15 01:11:51 2018] XFS (nvme1n1): xfs_do_force_shutdown(0x2) called from line 1197 of file /build/linux-c2Z51P/linux-4.4.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc075d428
[Wed Aug 15 01:12:20 2018] XFS (nvme1n1): xfs_log_force: error -5 returned.
[Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 22 QID 1 timeout, aborting
[Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 23 QID 1 timeout, aborting
[Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 24 QID 1 timeout, aborting
[Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 25 QID 1 timeout, aborting
[Wed Aug 15 01:12:22 2018] nvme 0000:00:1f.0: I/O 24 QID 2 timeout, aborting
[Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2
[Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2
[Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2
[Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2
[Wed Aug 15 01:12:22 2018] nvme nvme1: Abort status: 0x2
[Wed Aug 15 01:12:50 2018] XFS (nvme1n1): xfs_log_force: error -5 returned.
[Wed Aug 15 01:12:52 2018] nvme 0000:00:1f.0: I/O 22 QID 1 timeout, reset controller
[Wed Aug 15 01:13:21 2018] XFS (nvme1n1): xfs_log_force: error -5 returned.
[Wed Aug 15 01:13:51 2018] XFS (nvme1n1): xfs_log_force: error -5 returned.
[Wed Aug 15 01:14:21 2018] XFS (nvme1n1): xfs_log_force: error -5 returned.

Changed in linux (Ubuntu Xenial):
status: New → In Progress
tags: added: kernel-da-key xenial
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.4.0-135.161

---------------
linux (4.4.0-135.161) xenial; urgency=medium

  * linux: 4.4.0-135.161 -proposed tracker (LP: #1788766)

  * [Regression] APM Merlin boards fail to recover link after interface down/up
    (LP: #1785739)
    - net: phylib: fix interrupts re-enablement in phy_start
    - net: phy: fix phy_start to consider PHY_IGNORE_INTERRUPT

  * qeth: don't clobber buffer on async TX completion (LP: #1786057)
    - s390/qeth: don't clobber buffer on async TX completion

  * nvme: avoid cqe corruption (LP: #1788035)
    - nvme: avoid cqe corruption when update at the same time as read

  * CacheFiles: Error: Overlong wait for old active object to go away.
    (LP: #1776254)
    - cachefiles: Fix missing clear of the CACHEFILES_OBJECT_ACTIVE flag
    - cachefiles: Wait rather than BUG'ing on "Unexpected object collision"

  * fscache cookie refcount updated incorrectly during fscache object allocation
    (LP: #1776277) // fscache cookie refcount updated incorrectly during fscache
    object allocation (LP: #1776277)
    - fscache: Fix reference overput in fscache_attach_object() error handling

  * FS-Cache: Assertion failed: FS-Cache: 6 == 5 is false (LP: #1774336)
    - Revert "UBUNTU: SAUCE: CacheFiles: fix a read_waiter/read_copier race"
    - fscache: Allow cancelled operations to be enqueued
    - cachefiles: Fix refcounting bug in backing-file read monitoring

  * linux-cloud-tools-common: Ensure hv-kvp-daemon.service starts before
    walinuxagent.service (LP: #1739107)
    - [Debian] hyper-v -- Ensure that hv-kvp-daemon.service starts before
      walinuxagent.service

 -- Khalid Elmously <email address hidden> Sun, 26 Aug 2018 23:56:50 -0400

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Tanguy Moal (tuxnco) wrote :

Hello, it seems that this issue or a similar one still occurs despite the fix.

Please find syslog output here after.

Best regards

Greg Frank (greglarious) wrote :

i am also still seeing this bug after the fix.

Marco (userneky) wrote :

I am trying to reproduce it.

Has anyone tried it ?

Greg Frank (greglarious) wrote :

We were reproducing this multiple times a day on multiple of our EC2 M5 instances. Interesting anecdote, our least loaded instances produced the bug more often than our heavily loaded instances.

We've since switched to M4 servers and do not have time to flip back and help test this right now.

Marco (userneky) wrote :

I have the same.

I have the less loaded instances in my real environment crashing and the environment where I am trying to reproduce the issue stressing it not crashing.

I am finding a way to reproduce it.

How did you reproduce it?

Marco (userneky) wrote :

So far the best I could reach is get the kernel call trace but not crashing the node yet.

Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.306845] INFO: task java:1932 blocked for more than 120 seconds.
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.308573] Not tainted 4.4.0-1069-aws #79-Ubuntu
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.309944] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312015] java D ffff88032dc57db0 0 1932 1 0x00000000
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312019] ffff88032dc57db0 0000000000000246 ffff88041077e900 ffff88032dcae900
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312022] ffff88032dc58000 0000000000074da7 ffff88040dbdf888 ffff88040dbdf824
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312024] ffff88032dc57df8 ffff88032dc57dc8 ffffffff81823825 ffff88040dbdf800
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312027] Call Trace:
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312035] [<ffffffff81823825>] schedule+0x35/0x80
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312042] [<ffffffff812f8309>] jbd2_log_wait_commit+0xa9/0x130
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312045] [<ffffffff810c77b0>] ? wake_atomic_t_function+0x60/0x60
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312048] [<ffffffff812fae21>] __jbd2_journal_force_commit+0x61/0xb0
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312049] [<ffffffff812faeb1>] jbd2_journal_force_commit+0x21/0x30
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312053] [<ffffffff812c71c9>] ext4_force_commit+0x29/0x30
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312057] [<ffffffff8129a7c1>] ext4_sync_file+0x1a1/0x350
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312065] [<ffffffff81247bae>] vfs_fsync_range+0x4e/0xb0
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312072] [<ffffffff81247c6d>] do_fsync+0x3d/0x70
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312074] [<ffffffff81247f10>] SyS_fsync+0x10/0x20
Oct 12 15:33:41 ip-10-16-21-10 kernel: [10919.312077] [<ffffffff818279ce>] entry_SYSCALL_64_fastpath+0x22/0xc1
Oct 12 15:35:01 ip-10-16-21-10 CRON[31352]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

using this simple script writing and reading same file in parallel from different process and let it run for while in my case I got the first call trace after 3h.

while true
do
dd iflag=nocache if=/dev/zero of=/tmp/myfilesource$1 bs=1k count=$2
dd iflag=nocache if=/tmp/myfilesource$1 of=/tmp/myfileout$1 bs=1k count=$2
done

root@ip-10-16-21-10:~# jobs
[1] Running nohup ./mytest1.sh 1 1024000&
[2]- Running nohup ./mytest1.sh 1 1024 &
[3]+ Running nohup ./mytest1.sh 1 102400 &

Marco (userneky) wrote :

I didn't find so far way to reproduce the issue systematically.

It doesn't seem to me load related as nodes with lower load crash more often then ones with high load.

But I can confirm that the fix released with ubuntu 4.4.0-135 kernel doesn't fix the issue.

As this morning (17/10/28) we faced the issue again on node with kernel 4.4.0-1069-aws which already include the patch.

AWS kernel include the patch released on the ubuntu 4.4.0-135 kernel since 4.4.0-1067-aws.

Greg Frank (greglarious) wrote :

We did not have a series of steps to reproduce this. Just left a server running without much happening and boom.

Scott Emmons (lscotte) wrote :

We can confirm that this patch does not solve the issue as we are still seeing the same dmesg pattern with the 4.4.0-1069-aws kernel.

Marco (userneky) wrote :

As this issue seems far from being solved and I dont see any progess coming from canonical neither aws which I find quite annoying considering the impact for them today we switched back our instances to m4

Brian Moyles (bmoyles) wrote :
Download full text (5.8 KiB)

We encountered an instance that had a nvme failure very early on in boot today. I've updated our internal Canonical case as well as our Amazon case on this, but posting relevant details here as well for consistency:

# uname -a
Linux XXX 4.4.0-1069-aws #79-Ubuntu SMP Mon Sep 24 15:01:41 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"

# echo type $EC2_INSTANCE_TYPE
type m5.xlarge

# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0 0 10G 0 disk /

# ls -al /dev/nvme* /dev/xvd* /dev/sd*
ls: cannot access '/dev/xvd*': No such file or directory
crw------- 1 root root 248, 0 Oct 31 15:02 /dev/nvme0
brw-rw---- 1 root disk 259, 0 Oct 31 15:02 /dev/nvme0n1
lrwxrwxrwx 1 root root 7 Oct 31 15:02 /dev/sda1 -> nvme0n1

# dmesg | grep '63\.'
[ 63.401466] nvme 0000:00:1f.0: I/O 0 QID 0 timeout, disable controller
[ 63.505790] nvme 0000:00:1f.0: Cancelling I/O 0 QID 0
[ 63.505812] nvme 0000:00:1f.0: Identify Controller failed (-4)
[ 63.507536] nvme 0000:00:1f.0: Removing after probe failure
[ 63.507604] iounmap: bad address ffffc90001b40000
[ 63.508941] CPU: 1 PID: 351 Comm: kworker/1:3 Tainted: P O 4.4.0-1069-aws #79-Ubuntu
[ 63.508943] Hardware name: Amazon EC2 m5.xlarge/, BIOS 1.0 10/16/2017
[ 63.508948] Workqueue: events nvme_remove_dead_ctrl_work [nvme]
[ 63.508950] 0000000000000286 3501e2639044a4d2 ffff8800372bfce0 ffffffff923ffe03
[ 63.508952] ffff88040dd878f0 ffffc90001b40000 ffff8800372bfd00 ffffffff9206d3af
[ 63.508954] ffff88040dd878f0 ffff88040dd87a58 ffff8800372bfd10 ffffffff9206d3ec
[ 63.508956] Call Trace:
[ 63.508961] [<ffffffff923ffe03>] dump_stack+0x63/0x90
[ 63.508965] [<ffffffff9206d3af>] iounmap.part.1+0x7f/0x90
[ 63.508967] [<ffffffff9206d3ec>] iounmap+0x2c/0x30
[ 63.508969] [<ffffffffc039abfa>] nvme_dev_unmap.isra.35+0x1a/0x30 [nvme]
[ 63.508972] [<ffffffffc039bd1e>] nvme_remove+0xce/0xe0 [nvme]
[ 63.508976] [<ffffffff92441e0e>] pci_device_remove+0x3e/0xc0
[ 63.508980] [<ffffffff9254f654>] __device_release_driver+0xa4/0x150
[ 63.508982] [<ffffffff9254f723>] device_release_driver+0x23/0x30
[ 63.508986] [<ffffffff9243abda>] pci_stop_bus_device+0x7a/0xa0
[ 63.508988] [<ffffffff9243ad3a>] pci_stop_and_remove_bus_device_locked+0x1a/0x30
[ 63.508990] [<ffffffffc039a62c>] nvme_remove_dead_ctrl_work+0x3c/0x50 [nvme]
[ 63.508994] [<ffffffff9209d86b>] process_one_work+0x16b/0x490
[ 63.508996] [<ffffffff9209dbdb>] worker_thread+0x4b/0x4d0
[ 63.508998] [<ffffffff9209db90>] ? process_one_work+0x490/0x490
[ 63.509001] [<ffffffff920a3e47>] kthread+0xe7/0x100
[ 63.509005] [<ffffffff92823301>] ? __schedule+0x301/0x7f0
[ 63.509007] [<ffffffff920a3d60>] ? kthread_create_on_node+0x1e0/0x1e0
[ 63.509009] [<ffffffff92827e35>] ret_from_fork+0x55/0x80
[ 63.509011] [<ffffffff920a3d60>] ? kthread_create_on_node+0x1e0/0x1e0
[ 63.509013] Trying to free nonexistent resource <00000000febf8000-00000000febfbfff>

# modinfo nvme
filename: /lib/modules/4.4.0-1069-aws/kernel/drivers/nvme/host/nvme.ko
version: 1.0
lic...

Read more...

I'm investigating this issue, and built a kernel with the following two patches:

a) https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7776db1ccc1
b) A debug patch present in http://lists.infradead.org/pipermail/linux-nvme/2017-February/008498.html

The idea of the first patch, which was merged upstream in Linux 4.12, is to poll the completion
queue of the device in the event of a timeout - if it succeeds, means that the device didn't post a completion, so could be an adapter issue.

The idea of the 2nd patch is just to provide debug information in case of a mismatch in the choice
of the blk-mq hw queue in nvme driver - it's a debug patch proposed in the mailing list to address a similar bug report in the past.

The kernel with the debug patches is available in PPA - to install it, one can follow the below instructions:

a) sudo add-apt-repository ppa:gpiccoli/test-nvme-182638
b) sudo apt-get update
c) sudo apt-get install linux-image-4.4.0-1073-aws

After installation is complete, please reboot the instance and after it's restarted,
check "uname -rv" output, which should be:

"4.4.0-1073-aws #83+hf182638v20181129b1-Ubuntu SMP Fri Nov 30 17:09:30 UTC 2018"

Please notice this is a test kernel, shouldn't be used in any production environment, nor is
officially supported in any form.

Anybody that can test this, much appreciated. Please post the complete dmesg after/if the issue is triggered.
Thanks,

Guilherme

Some updates here: the patch was released in the -proposed pocket, and is available in the kernel 4.4.0-1075-aws - to enable the proposed repository please see this https://wiki.ubuntu.com/Testing/EnableProposed. The plan is to have this kernel released in the first week of February, after all tests/validations finish in the proposed package.

With this kernel, if the timeouts occur the driver will poll the completion queue to be sure the io "timeouting" isn't completed, and our tests showed that for this bug, the io is there, which seems to indicate a missed interrupt. So, a kernel with the patch will mitigate the effects of the timeouts, not leading to the aborts anymore. The following message will be observed in dmesg:

[39630.417191] nvme 0000:00:04.0: I/O 0 QID 2 timeout, completion polled

Thanks,

Guilherme

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers