Amazon I3 Instance Buffer I/O error on dev nvme0n1

Bug #1668129 reported by Pete Cheslock on 2017-02-27
142
This bug affects 26 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Critical
Dan Streetman
Xenial
Critical
Dan Streetman
linux-aws (Ubuntu)
Critical
Dan Streetman
Xenial
Critical
Dan Streetman

Bug Description

On the AWS i3 instance class - when putting the new NVME storage disks under high IO load - seeing data corruption and errors in dmesg

[ 662.884390] blk_update_request: I/O error, dev nvme0n1, sector 120063912
[ 662.887824] Buffer I/O error on dev nvme0n1, logical block 14971093, lost async page write
[ 662.891254] Buffer I/O error on dev nvme0n1, logical block 14971094, lost async page write
[ 662.895591] Buffer I/O error on dev nvme0n1, logical block 14971095, lost async page write
[ 662.899873] Buffer I/O error on dev nvme0n1, logical block 14971096, lost async page write
[ 662.904179] Buffer I/O error on dev nvme0n1, logical block 14971097, lost async page write
[ 662.908458] Buffer I/O error on dev nvme0n1, logical block 14971098, lost async page write
[ 662.912287] Buffer I/O error on dev nvme0n1, logical block 14971099, lost async page write
[ 662.916047] Buffer I/O error on dev nvme0n1, logical block 14971100, lost async page write
[ 662.920285] Buffer I/O error on dev nvme0n1, logical block 14971101, lost async page write
[ 662.924565] Buffer I/O error on dev nvme0n1, logical block 14971102, lost async page write
[ 663.645530] blk_update_request: I/O error, dev nvme0n1, sector 120756912
<snip>
[ 1012.752265] blk_update_request: I/O error, dev nvme0n1, sector 3744
[ 1012.755396] buffer_io_error: 194552 callbacks suppressed
[ 1012.755398] Buffer I/O error on dev nvme0n1, logical block 20, lost async page write
[ 1012.759248] Buffer I/O error on dev nvme0n1, logical block 21, lost async page write
[ 1012.763368] Buffer I/O error on dev nvme0n1, logical block 22, lost async page write
[ 1012.767271] Buffer I/O error on dev nvme0n1, logical block 23, lost async page write
[ 1012.771314] Buffer I/O error on dev nvme0n1, logical block 24, lost async page write

Able to replicate this with a bonnie++ stress test.

bonnie++ -d /mnt/test/ -r 1000

Linux i-0d76e144d85f487cf 4.4.0-64-generic #85-Ubuntu SMP Mon Feb 20 11:50:30 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Feb 27 02:12 seq
 crw-rw---- 1 root audio 116, 33 Feb 27 02:12 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.5
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: N/A
DistroRelease: Ubuntu 16.04
Ec2AMI: ami-bc62b2aa
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-east-1d
Ec2InstanceType: i3.2xlarge
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
IwConfig: Error: [Errno 2] No such file or directory
JournalErrors:
 Error: command ['journalctl', '-b', '--priority=warning', '--lines=1000'] failed with exit code 1: Hint: You are currently not seeing messages from other users and the system.
       Users in the 'systemd-journal' group can see all messages. Pass -q to
       turn off this notice.
 No journal files were opened due to insufficient permissions.
Lsusb: Error: command ['lsusb'] failed with exit code 1:
MachineType: Xen HVM domU
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=screen-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-64-generic root=UUID=cfda0544-9803-41e7-badb-43563085ff3a ro console=tty1 console=ttyS0
ProcVersionSignature: Ubuntu 4.4.0-64.85-generic 4.4.44
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-64-generic N/A
 linux-backports-modules-4.4.0-64-generic N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial ec2-images
Uname: Linux 4.4.0-64-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

WifiSyslog:

_MarkForUpload: True
dmi.bios.date: 12/12/2016
dmi.bios.vendor: Xen
dmi.bios.version: 4.2.amazon
dmi.chassis.type: 1
dmi.chassis.vendor: Xen
dmi.modalias: dmi:bvnXen:bvr4.2.amazon:bd12/12/2016:svnXen:pnHVMdomU:pvr4.2.amazon:cvnXen:ct1:cvr:
dmi.product.name: HVM domU
dmi.product.version: 4.2.amazon
dmi.sys.vendor: Xen

Pete Cheslock (pete-cheslock) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1668129

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected ec2-images xenial
description: updated

apport information

apport information

apport information

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Ricky Ramirez (rram) wrote :

I tested this on a few i3.2xlarge instances using bonnie++ and an ext4 mounted filesystem

ami-cc10c1da (precise) - does not recognize the ephemeral device
ami-822bfa94 (trusty) - does NOT appear to be affected
ami-1ac0120c (xenial) - is affected
ami-e600d1f0 (yakkety) - is affected

Dan Streetman (ddstreet) wrote :

@rram what region are you using?

Ricky Ramirez (rram) wrote :

@ddstreet us-east-1

Dan Streetman (ddstreet) wrote :

@rram can you try to reproduce with us-west-2? I'm able to repro in east-1 but not in west-2.

Matt Billenstein (matt-e) wrote :

I've had this issue on 4 different instances in us-west-2 -- two I still have running -- can I help?

Dan Streetman (ddstreet) wrote :

@matt-e do you have a quicker reproducer I can use? I've been trying bonnie++ from the description, but that doesn't repro at all in west-2 for me, and only once in east-1 so far. Also what AMI are you using in west-2 that shows the problem?

Matt Billenstein (matt-e) wrote :

I don't know that I do -- I'm finding these errors when rsync'ing a larger database from another machine.

I'm using ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20170221 (ami-a58d0dc5)

Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
status: New → Triaged
Changed in linux (Ubuntu):
status: Confirmed → Triaged
tags: added: kernel-key
Anne-Marie (eg-ubuntu) wrote :

I find it is easy to reproduce by using "dd if=/dev/zero of=big bs=4096" -- I generally get errors within a few minutes.

Anne-Marie (eg-ubuntu) wrote :

(This was in eu-west-1, with ami-405f7226)

Changed in linux (Ubuntu):
importance: High → Critical
Changed in linux (Ubuntu Xenial):
importance: High → Critical
Ricky Ramirez (rram) wrote :

@ddstreet I can replicate the issue in us-west-2. All i3.2xls using ext4 as the filesystem I'm testing on.

ami-4e98182e (precise) - doesn't recognize device
ami-17ac2c77 (trusty) - no error
ami-edf6758d (xenial) - I/O errors
ami-a49b1bc4 (yakkety) - I/O errors

Anne-Marie (eg-ubuntu) wrote :

FYI - it doesn't occur on RHEL 7.3 nor Amazon Linux.

Dan Streetman (ddstreet) on 2017-02-28
Changed in linux (Ubuntu Xenial):
assignee: nobody → Dan Streetman (ddstreet)
Changed in linux (Ubuntu):
assignee: nobody → Dan Streetman (ddstreet)
Dan Streetman (ddstreet) wrote :

I have a quick reproducer now:

NCPUS=...whatever...
for n in $( seq 1 $NCPUS ) ; do ( dd if=/dev/zero of=/mnt/test/out$n bs=1024k count=1024k ) & done

Dan Streetman (ddstreet) wrote :

This is reproducable with the latest upstream kernel as well (4.10), so this isn't a bug in the ubuntu kernel; it will require an upstream fix and backport that into xenial/yakkety.

Dan Streetman (ddstreet) wrote :

On an i3 instance in east-1, where i can reproduce fairly easily, the errors i'm getting unfortunately don't help. the nvme controller is failing some requests, but it isn't providing any useful info about why it doesn't like the requests. for example, here is some debug I added:

[ 1464.634709] nvme nvme0: invalid field command_id 3eb qid 5 cmd_type 1 cmd_flags 4001 data_dir 1 status 2002

the controller is failing a request with error 2 "invalid field", and sets the "more error data" flag 0x2000. So I pulled the error log page, which is supposed to provide more data about why the request failed.

[ 1464.634836] nvme nvme0: error log entry: count 5d5281a qid 5 command_id 3eb status 2002 byte ff bit ff lba 0 ns 1 vendor 0 csi 0

the nvme controller error log is supposed to provide details about the failure, but this provides no new info; the qid and command_id match the failure above, but the error location fields (byte and bit) which are supposed to point to the specific byte/bit in the request that the controller doesn't like, are set to 0xffff which means "If the error is not specific to a particular command then this field shall be set to FFFFh" - so that's totally unhelpful.

Instead of trying to determine what the controller's unhappy about, I'll try bisecting with an older kernel, to find the commit that introduces the failure.

Matt Wilson (msw-amazon) wrote :

Dan,

It appears that the requests that are being submitted refer to DMA addresses that exceed the guest physical memory range, and this is why the requests are being failed. The address seen is outside the E820 map:

[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009dfff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009e000-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007fffffff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000fc000000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x0000000fbfffffff] usable
[ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[ 0.000000] e820: last_pfn = 0xfc0000 max_arch_pfn = 0x400000000
[ 0.000000] e820: last_pfn = 0x80000 max_arch_pfn = 0x400000000
[ 0.000000] e820: [mem 0x80000000-0xfbffffff] available for PCI devices
[ 5.595004] e820: reserve RAM buffer [mem 0x0009e000-0x0009ffff]

We see an address of 0xfc7ffb000

Dan Streetman (ddstreet) wrote :

> We see an address of 0xfc7ffb000

Hi Matt,

I don't think you're accounting for the additional pages due to the Xen balloon, are you? That increases physical memory, after boot. If you check the /proc/zoneinfo file, look at the Normal zone's spanned pages and start pfn, e.g.:

Node 0, zone Normal
  pages free 15116671
        min 7661
        low 22873
        high 38085
   node_scanned 0
        spanned 15499264
        present 15499264
        managed 15212161
...
  start_pfn: 1048576

and so,
$ printf "%x\n" $[ 1048576 + 15499264 ]
fc8000

meaning that address you see is part of the pages in the balloon memory region...

I disabled Ubuntu's memory hotadd (commented it out in /lib/udev/rules.d/40-vm-hotadd.rules), and rebooted, and the Normal zone's present pages was reduced so that the end is fc0000, matching the boot time max pfn; I then tried to reproduce the problem and it seems gone!

So I think that must be the issue; the hypervisor's NVMe driver isn't expecting any pages from the Xen ballooned region. I checked on Amazon Linux, and saw why it isn't affected:

$ grep XEN_BALLOON /boot/config-4.4.41-36.55.amzn1.x86_64
# CONFIG_XEN_BALLOON is not set

I suspect that skips quite a lot of problems for Amazon Linux, as the Xen ballooning is quite annoying (see bug 1518457 comment 126, for example).

Maybe Ubuntu should disable Xen ballooning for AWS also? If not, then this seems to be a hypervisor bug, it needs to allow pages from the ballooned region also.

Matt Wilson (msw-amazon) wrote :

Yes, ballooning has been a constant source of problems which is why it is disabled in Amazon Linux AMI.

We do not currently support DMA to/from guest physical addresses outside of the E820 map for ENA networking or NVMe storage interfaces. This effectively means that ballooning needs to be disabled, or perhaps some changes would need to be made in the Xen swiotlb code to bounce data that resides in guest physical addresses that are outside of the E820 map.

Anne-Marie (eg-ubuntu) wrote :

FYI: RHEL 7.3 does not suffer from this problem and appears to have ballooning enabled:

$ grep CONFIG_XEN_BALLOON /boot/config-3.10.0-514.el7.x86_64
CONFIG_XEN_BALLOON=y
# CONFIG_XEN_BALLOON_MEMORY_HOTPLUG is not set

$ uname -a
Linux cassandra-a-2 3.10.0-514.el7.x86_64 #1 SMP Wed Oct 19 11:24:13 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.3 (Maipo)

Matt Wilson (msw-amazon) wrote :

I imagine CONFIG_XEN_BALLOON_MEMORY_HOTPLUG is set for the Ubuntu kernel?

Dan Streetman (ddstreet) wrote :

>> FYI: RHEL 7.3 does not suffer from this problem and appears to have ballooning enabled:
> I imagine CONFIG_XEN_BALLOON_MEMORY_HOTPLUG is set for the Ubuntu kernel?

yes exactly, CONFIG_XEN_BALLOON_MEMORY_HOTPLUG must be enabled for it to actually increase the physical memory region, which is what the problem is in this case.

Dan Streetman (ddstreet) wrote :

For those watching this bug, to work around this until there is an AMI available that fixes it, you can disable udev memory hotadd by changing the /lib/udev/rules.d/40-vm-hotadd.rules file to comment out the memory hotadd rule, like this:

--- /lib/udev/rules.d/40-vm-hotadd.rules.old 2017-03-01 22:02:39.905314616 +0000
+++ /lib/udev/rules.d/40-vm-hotadd.rules 2017-03-01 22:02:46.797002312 +0000
@@ -6,7 +6,7 @@
 LABEL="vm_hotadd_apply"

 # Memory hotadd request
-SUBSYSTEM=="memory", ACTION=="add", DEVPATH=="/devices/system/memory/memory[0-9]*", TEST=="state", ATTR{state}="online"
+#SUBSYSTEM=="memory", ACTION=="add", DEVPATH=="/devices/system/memory/memory[0-9]*", TEST=="state", ATTR{state}="online"

 # CPU hotadd request
 SUBSYSTEM=="cpu", ACTION=="add", DEVPATH=="/devices/system/cpu/cpu[0-9]*", TEST=="online", ATTR{online}="1"

Changed in linux-aws (Ubuntu):
assignee: nobody → Dan Streetman (ddstreet)
Changed in linux-aws (Ubuntu Xenial):
assignee: nobody → Dan Streetman (ddstreet)
status: New → Fix Committed
Changed in linux-aws (Ubuntu):
status: New → Triaged
importance: Undecided → Critical
Changed in linux-aws (Ubuntu Xenial):
importance: Undecided → Critical
Jacob Scott (snakescott) wrote :

@ddstreet a few quick questions

* When would you (roughly) expect an AMI to be available?
* How high is your confidence in the 40-vm-hotadd.rules change workaround? Sounds like very high?
* For those of us who are not knowledgeable about this subsystem, are there any drawbacks or things to watch out for if we change 40-vm-hotadd.rules?

Thanks!

Dan Streetman (ddstreet) wrote :

Note on above; once hotadd is disabled, the xen balloon driver will still perform the memory hotplug, but the added pages won't be available for use. So you can check /proc/zoneinfo, and look at the Normal zone, e.g.:

with hotadd enabled (the default in Ubuntu):

Node 0, zone Normal
  pages free 15116671
        min 7661
        low 22873
        high 38085
   node_scanned 0
        spanned 15499264
        present 15499264
        managed 15212161

notice the 'spanned' and 'present' pages are the same; the 'spanned' pages include the physical pages added by the xen balloon driver, and 'present' indicates they're available for use (some of them are, based on how inflated the balloon is).

With memory hotadd disabled (commented out in the udev rules file, as shown in above comment):

Node 0, zone Normal
  pages free 15104522
        min 16356
        low 31567
        high 46778
   node_scanned 0
        spanned 15499264
        present 15466496
        managed 15212150

notice the 'spanned' pages is the same as before, meaning the xen balloon driver still added the physical pages, but the 'present' value is lower, indicating the extra balloon pages aren't available for the system to use, meaning they won't be sent to the NVMe controller, which works around this bug.

Changed in linux-aws (Ubuntu):
status: Triaged → In Progress
Changed in linux-aws (Ubuntu Xenial):
status: Fix Committed → In Progress
Dan Streetman (ddstreet) wrote :

> * When would you (roughly) expect an AMI to be available?

it's too early to say, there are various steps before the fix lands in an AMI.

> * How high is your confidence in the 40-vm-hotadd.rules change workaround? Sounds like very high?

100%. if it doesn't work for you, please let me know.

> * For those of us who are not knowledgeable about this subsystem, are there any drawbacks or things to watch out for if we change 40-vm-hotadd.rules?

not for this specific case. AWS doesn't use ballooning or any kind of memory hotplug (that I am aware of...Matt can correct me if that is wrong), so there is no issue with disabling it inside an AWS instance.

Jacob Scott (snakescott) wrote :

Great, thanks very much!

Ricky Ramirez (rram) wrote :

Excellent. Sanity check here: This also means that trusty is not affected because the udev rules don't match.

I have /lib/udev/rules.d/40-hyperv-hotadd.rules:

# On Hyper-V Virtual Machines we want to add memory and cpus as soon as they appear
ATTR{[dmi/id]sys_vendor}!="Microsoft Corporation", GOTO="hyperv_hotadd_end"
ATTR{[dmi/id]product_name}!="Virtual Machine", GOTO="hyperv_hotadd_end"

# Memory hotadd request
SUBSYSTEM=="memory", ACTION=="add", DEVPATH=="/devices/system/memory/memory[0-9]*", TEST=="state", ATTR{state}="online"

# CPU hotadd request
SUBSYSTEM=="cpu", ACTION=="add", DEVPATH=="/devices/system/cpu/cpu[0-9]*", TEST=="online", ATTR{online}="1"

LABEL="hyperv_hotadd_end"

Whereas in xenial the file moved to /lib/udev/rules.d/40-vm-hotadd.rules and includes ATTR{[dmi/id]sys_vendor}=="Xen", GOTO="vm_hotadd_apply" which does trigger the bug.

Dan Streetman (ddstreet) wrote :

> Excellent. Sanity check here: This also means that trusty is not affected because the udev
> rules don't match.
...
> in xenial...
> and includes ATTR{[dmi/id]sys_vendor}=="Xen", GOTO="vm_hotadd_apply" which does trigger the bug.

that's correct, on 14.04 the Xen balloon memory is not switched online because the rule doesn't match on Xen systems, as you pointed out. In an AWS instance, commenting out the memory hotadd udev rule in 16.04 changes its behavior (re: memory hotadd) to match 14.04, which works around the bug.

Dan Streetman (ddstreet) on 2017-03-02
Changed in linux-aws (Ubuntu):
status: In Progress → Fix Committed
Changed in linux-aws (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: removed: kernel-key
Patrick (skyshard) wrote :

I've applied the udev rules change and it doesn't seem to make a difference on the instances I'm testing with:

(after applying the change and reloading udev, rebooting, etc)
ubuntu@hot-i3-muguasak:~$ cat /proc/zoneinfo
...
Node 0, zone Normal
  pages free 14714755
        min 7663
        low 22874
        high 38085
   node_scanned 0
        spanned 15499264
        present 15499264
        managed 15212046

ubuntu@hot-i3-muguasak:~$ cat /lib/udev/rules.d/40-vm-hotadd.rules
# On Hyper-V and Xen Virtual Machines we want to add memory and cpus as soon as they appear
ATTR{[dmi/id]sys_vendor}=="Microsoft Corporation", ATTR{[dmi/id]product_name}=="Virtual Machine", GOTO="vm_hotadd_apply"
ATTR{[dmi/id]sys_vendor}=="Xen", GOTO="vm_hotadd_apply"
GOTO="vm_hotadd_end"

LABEL="vm_hotadd_apply"

# Memory hotadd request
#SUBSYSTEM=="memory", ACTION=="add", DEVPATH=="/devices/system/memory/memory[0-9]*", TEST=="state", ATTR{state}="online"

# CPU hotadd request
SUBSYSTEM=="cpu", ACTION=="add", DEVPATH=="/devices/system/cpu/cpu[0-9]*", TEST=="online", ATTR{online}="1"

LABEL="vm_hotadd_end"

Errors are the same:
Mar 20 22:51:03 ip-172-30-4-8 kernel: [ 6797.319668] EXT4-fs warning (device nvme0n1): ext4_end_bio:314: I/O error -5 writing to inode 108921589 (offset 4185915392 size
8388608 starting block 95900672)
Mar 20 22:51:03 ip-172-30-4-8 kernel: [ 6797.319670] buffer_io_error: 246 callbacks suppressed
Mar 20 22:51:03 ip-172-30-4-8 kernel: [ 6797.319671] Buffer I/O error on device nvme0n1, logical block 95900416
Mar 20 22:51:03 ip-172-30-4-8 kernel: [ 6797.322738] Buffer I/O error on device nvme0n1, logical block 95900417
Mar 20 22:51:03 ip-172-30-4-8 kernel: [ 6797.325826] Buffer I/O error on device nvme0n1, logical block 95900418
Mar 20 22:51:03 ip-172-30-4-8 kernel: [ 6797.329083] Buffer I/O error on device nvme0n1, logical block 95900419
Mar 20 22:51:03 ip-172-30-4-8 kernel: [ 6797.332017] Buffer I/O error on device nvme0n1, logical block 95900420
Mar 20 22:51:03 ip-172-30-4-8 kernel: [ 6797.334949] Buffer I/O error on device nvme0n1, logical block 95900421
Mar 20 22:51:03 ip-172-30-4-8 kernel: [ 6797.337913] Buffer I/O error on device nvme0n1, logical block 95900422
Mar 20 22:51:03 ip-172-30-4-8 kernel: [ 6797.340918] Buffer I/O error on device nvme0n1, logical block 95900423
Mar 20 22:51:03 ip-172-30-4-8 kernel: [ 6797.343829] Buffer I/O error on device nvme0n1, logical block 95900424
Mar 20 22:51:03 ip-172-30-4-8 kernel: [ 6797.346815] Buffer I/O error on device nvme0n1, logical block 95900425
Mar 20 22:51:04 ip-172-30-4-8 kernel: [ 6797.826561] JBD2: Detected IO errors while flushing file data on nvme0n1-8
Mar 20 22:51:26 ip-172-30-4-8 kernel: [ 6820.697487] JBD2: Detected IO errors while flushing file data on nvme0n1-8
Mar 20 22:51:36 ip-172-30-4-8 kernel: [ 6830.697208] JBD2: Detected IO errors while flushing file data on nvme0n1-8

Am I missing something obvious?

Dan Streetman (ddstreet) wrote :

Patrick, does this command return any results:

$ grep 0 /sys/devices/system/memory/memory*/online

Dan Streetman (ddstreet) wrote :

Also,

$ grep memory /lib/udev/rules.d/* /etc/udev/rules.d/*

Dan Streetman (ddstreet) wrote :

And, rebuild your initramfs, to make sure it doesn't have a stale udev rule in it (although mine doesn't contain the memory hotadd udev rule):

$ sudo update-initramfs -u

Patrick (skyshard) wrote :

$ grep 0 /sys/devices/system/memory/memory*/online
/sys/devices/system/memory/memory504/online:0

$ grep memory /lib/udev/rules.d/* /etc/udev/rules.d/*
/lib/udev/rules.d/40-vm-hotadd.rules:# On Hyper-V and Xen Virtual Machines we want to add memory and cpus as soon as they appear
/lib/udev/rules.d/40-vm-hotadd.rules:#SUBSYSTEM=="memory", ACTION=="add", DEVPATH=="/devices/system/memory/memory[0-9]*", TEST=="state", ATTR{state}="online"

I ran update-initramfs and rebooted again, but was still able to reproduce the error with dd

Patrick (skyshard) wrote :

Just to be thorough:

# find / -xdev -type f -name '*.rules' -print0 | xargs -0 fgrep memory
/lib/udev/rules.d/40-vm-hotadd.rules:# On Hyper-V and Xen Virtual Machines we want to add memory and cpus as soon as they appear
/lib/udev/rules.d/40-vm-hotadd.rules:#SUBSYSTEM=="memory", ACTION=="add", DEVPATH=="/devices/system/memory/memory[0-9]*", TEST=="state", ATTR{state}="online"

Dan Streetman (ddstreet) wrote :

Patrick,

can you attach your /proc/zoneinfo file. also, which image type is this?

Dan Streetman (ddstreet) wrote :

Ok, I figured out the problem.

You're using the yakkety kernel, 4.8. In the Xenial 4.4 kernel, memory hotplug auto-onlining is disabled; however in the 4.8 kernel, memory hotplug auto-onlining is enabled, so disabling the udev rule with the 4.8 kernel does nothing - the kernel's already onlined the balloon memory region.

Edit your /etc/default/grub.d/50-cloudimg-settings.cfg file to add a kernel boot param "memhp_default_state=offline", e.g.:

--- /etc/default/grub.d/50-cloudimg-settings.cfg.orig 2017-03-21 17:52:26.604389516 +0000
+++ /etc/default/grub.d/50-cloudimg-settings.cfg 2017-03-21 17:52:46.564462247 +0000
@@ -8,7 +8,7 @@
 GRUB_TIMEOUT=0

 # Set the default commandline
-GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0"
+GRUB_CMDLINE_LINUX_DEFAULT="console=tty1 console=ttyS0 memhp_default_state=offline"

 # Set the grub console type
 GRUB_TERMINAL=console

Then, run:

$ sudo update-grub

make sure you see the new boot param in your grub.cfg, e.g.:

$ grep memhp_default_state /boot/grub/grub.cfg
 linux /boot/vmlinuz-4.8.0-41-generic root=UUID=765f00af-531a-44bc-a083-66143320d408 ro console=tty1 console=ttyS0 memhp_default_state=offline

Then reboot, and when it comes back up, check to make sure memory auto-onlining is disabled now:

$ cat /sys/devices/system/memory/auto_online_blocks
offline

Note, you still need to disable the udev memory hotplug online rule, as mentioned in previous comments.

Patrick (skyshard) wrote :

Thanks for figuring that out! This was using the 16.04 HVM image in us-east-1 ami-2757f631 + hardware enablement (linux-generic-hwe-16.04)

Patrick (skyshard) on 2017-03-21
description: updated
Amanpreet Singh (aps-sids) wrote :

I think I changed the status by mistake (Did not know I could do that), and I'm unable to revert it :/

Changed in linux-aws (Ubuntu):
status: Fix Committed → Fix Released
Changed in linux-aws (Ubuntu Xenial):
status: Fix Committed → In Progress
status: In Progress → Fix Committed
Dan Streetman (ddstreet) on 2017-03-29
Changed in linux-aws (Ubuntu):
status: Fix Released → Fix Committed
Guido Iaquinti (ilmerovingio) wrote :

Hi, is there any ETA for the fix release?

Dan Streetman (ddstreet) wrote :

I'm not sure when a new AMI build is scheduled, but you can 'sudo apt install linux-aws' currently in an existing xenial instance to upgrade to the AWS-specific kernel that has xen ballooning disabled, which fixes this problem.

Andrew Lau (alau) wrote :

Is there any intention of backporting either linux-aws or any of the NVMe bug fixes from linux-aws into linux-image-virtual-lts-xenial for Ubuntu 14.04 since they're both 4.4.0 kernels?

Andrew Lau (alau) on 2017-04-13
no longer affects: linux-lts-xenial (Ubuntu)
no longer affects: linux-lts-xenial (Ubuntu Xenial)
Dan Streetman (ddstreet) wrote :

> Is there any intention of backporting either linux-aws or any of the NVMe bug fixes from
> linux-aws into linux-image-virtual-lts-xenial for Ubuntu 14.04 since they're both
> 4.4.0 kernels?

the fix for this bug is to change the kernel config param CONFIG_XEN_BALLOON from y to n, disabling Xen ballooning in the linux-aws kernel. This avoids the problems associated with DMA to/from the memory pages added by the balloon driver, which are physically located outside the e820 region.

As such, there is no actual code change to backport into the generic xenial (or other release) kernel, and disabling Xen ballooning in the generic kernel is inappropriate, as it would remove functionality from any other Ubuntu-under-Xen users who do want to use Xen ballooning.

An upstream discussion on this particular issue is here:
https://lkml.org/lkml/2017/3/22/878

we're consulting Amazon to see if they may be able to update the AWS hypervisor to reject any requests by the linux guest to register hotadded memory pages, which would also fix this problem.

Mark Rose (markrose) wrote :

This bug is still present on 14.04 using linux-generic-lts-xenial kernel 4.4.0-87-generic.

Dan Streetman (ddstreet) wrote :

> This bug is still present on 14.04 using linux-generic-lts-xenial kernel 4.4.0-87-generic.

that's correct, and there is no planned change for the standard kernel. Only the linux-aws kernel is being changed to address this issue, by disabling Xen memory ballooning, as described in comment 50.

A bit more detail on the issue:

1. AWS Xen hypervisor boots linux and provides e820 map, and Xen balloon target.
2. Ubuntu kernel boots and sets up all memory listed in the e820 map.
3. Xen balloon driver notices total memory doesn't quite match its target, and so requests some pages from Xen hypervisor.
4. AWS Xen hypervisor allows Ubuntu kernel balloon driver to have exactly 11 more pages, which are registered with the Ubuntu kernel as hotplugged memory (hypervisor rejects requests for any more balloon pages).
5. The new balloon hotplugged pages are enabled (via udev or kernel config or sysfs), which makes them available for general use
6. If any NVMe I/O operation uses any of those 11 balloon pages for DMA, the hypervisor sees that the page physical address is outside its e820 map address range (because it was a hotplugged page) and fails the NVMe I/O.

The problem here lies either in #4 or #6 above, meaning that the hypervisor either should reject all requests for additional hotplugged memory pages (step 4) or it should allow DMA using hotplugged memory pages (step 6). Any change to the Ubuntu kernel is only working around this hypervisor problem by not enabling any hotplugged pages.

AWS is well aware of this and is investigating what changes can be made to their hypervisor, but I am not part of those discussions and so I can't provide any more detail on if/when AWS might fix either #4 and/or #6. I will note that the Amazon Linux kernel has Xen ballooning disabled, and I believe the RHEL kernel does as well, so they have both only worked around this issue.

Until the AWS hypervisor is changed, there are various options to work around the issue:

Trusty:
The trusty 14.04 release does have Xen ballooning enabled, and it does hotplug memory, however the udev rules do not enable the hotplugged memory, so this issue does not exist in trusty (unless the hotplugged memory is manually enabled).

Xenial with 4.4 kernel:
The standard 4.4 kernel in Xenial does have Xen ballooning enabled, because it may be desired under non-AWS Xen hypervisors. The recommended way to work around the issue is to edit the 40-vm-hotadd.rules as described in comment 29.

Xenial with HWE kernel, or Zesty:
Starting with the 4.8 kernel, hotplug memory is automatically onlined, so in addition to editing the udev rule as described above (in Xenial with 4.4 kernel), you also must add a kernel boot param as described in comment 44.

Xenial linux-aws:
The linux-aws kernel has Xen ballooning disabled in the kernel configuration, so it will not cause any memory to be hotplugged, thus avoiding the problem; no other workaround is required when using the linux-aws kernel.

I am marking this as "wont fix" for the standard Xenial kernel.

Changed in linux (Ubuntu Xenial):
status: Triaged → Won't Fix
Changed in linux (Ubuntu):
status: Triaged → Won't Fix
Dan Streetman (ddstreet) wrote :

> This bug is still present on 14.04 using linux-generic-lts-xenial kernel 4.4.0-87-generic.

Sorry, I misread your statement - unless you have edited your udev rule to enable the hotplug memory, you should not encounter this issue using Trusty with either the 3.13 or 4.4 kernel. If you are, I suggest you check your udev rule (comment 29).

Dan Streetman (ddstreet) on 2017-10-13
Changed in linux-aws (Ubuntu):
status: Fix Committed → Fix Released
Changed in linux-aws (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers