Hung tasks on UEC cloud images with EBS volumes
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
On the UEC images on Amazon, from time to time people see hung task for more than 120 seconds messages. From previous experience with Amazon, EXT4 file systems are prone to these messages due to the way that the virtual (/dev/xvd*) EBS disks are presented to the DomU and EXT4's delayed commits. EBS volumes are presented to DomU's as physical disks that are attached in Dom0; the actual disk is a network device. During periods of high I/O, flushing of dirty pages can result in the hung tasks while the flushing to the network disk happens.
Generally, this affects m2.* and cc1.4xlarge instance types (the expensive premium instances). Adjusting these the "vm.dirty_
This bug has been filed to see about getting guidance for the community from the kernel team on tuning of vm.dirty* settings to prevent hung tasks.
vm.dirty_
vm.dirty_
vm.dirty_ratio = 20
vm.dirty_bytes = 0
vm.dirty_
vm.dirty_
vm.drop_caches = 0
---
ProblemType: Bug
DistroRelease: Ubuntu 11.04
Package: linux-image-virtual 2.6.38.8.22
ProcVersionSign
Uname: Linux 2.6.38-8-virtual x86_64
AlsaDevices:
total 0
crw------- 1 root root 116, 1 2011-07-06 21:41 seq
crw------- 1 root root 116, 33 2011-07-06 21:41 timer
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
CurrentDmesg: [ 21.600011] eth0: no IPv6 routers present
Date: Mon Jul 11 16:01:31 2011
Ec2AMI: ami-6463980d
Ec2AMIManifest: (unknown)
Ec2Availability
Ec2InstanceType: t1.micro
Ec2Kernel: aki-427d952b
Ec2Ramdisk: unavailable
Lspci:
Lsusb: Error: command ['lsusb'] failed with exit code 1:
ProcEnviron:
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcKernelCmdLine: root=LABEL=
ProcModules: acpiphp 24097 0 - Live 0x0000000000000000
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh insta
---
AlsaDevices:
total 0
crw------- 1 root root 116, 1 2011-07-25 20:32 seq
crw------- 1 root root 116, 33 2011-07-25 20:32 timer
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
CurrentDmesg:
DistroRelease: Ubuntu 11.04
Ec2AMI: ami-c55b9cac
Ec2AMIManifest: (unknown)
Ec2Availability
Ec2InstanceType: t1.micro
Ec2Kernel: aki-427d952b
Ec2Ramdisk: unavailable
Lspci:
Lsusb: Error: command ['lsusb'] failed with exit code 1:
Package: linux (not installed)
ProcEnviron:
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcKernelCmdLine: root=LABEL=
ProcModules: acpiphp 24097 0 - Live 0x0000000000000000
ProcVersionSign
Tags: natty ec2-images
Uname: Linux 2.6.38-10-virtual x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm admin audio cdrom dialout dip floppy plugdev video
Having I/O created faster than the storage is able to cope with generally gets the system at some point. I am not sure I missed it or there is actually nothing, but question is what values have been tested to change? And maybe there is no good value to handle large memory systems and small ones.
Otherwise, yeah, it could be worth adjusting dirty_backround _ration downwards (to start backgroud writeout sooner, though if the percentage is too small on small systems, writes get potentially more fragmented) and move the dirty_ratio up (to get a bigger window until processes start to get waiting for flushed I/O).