Hung tasks on UEC cloud images with EBS volumes

Bug #808872 reported by Ben Howard
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

On the UEC images on Amazon, from time to time people see hung task for more than 120 seconds messages. From previous experience with Amazon, EXT4 file systems are prone to these messages due to the way that the virtual (/dev/xvd*) EBS disks are presented to the DomU and EXT4's delayed commits. EBS volumes are presented to DomU's as physical disks that are attached in Dom0; the actual disk is a network device. During periods of high I/O, flushing of dirty pages can result in the hung tasks while the flushing to the network disk happens.

Generally, this affects m2.* and cc1.4xlarge instance types (the expensive premium instances). Adjusting these the "vm.dirty_background_ratio" and "vm.dirty_expire_centisec" have show the ability to mitigate these symptoms. The problem, however, is that adjusting these values can result in poor system performance depending on the workload and the instance type. For example, on a m2.4xlarge which has 72G of RAM, the number of dirty pages can be significantly bigger than a t1.micro which only has 604M of RAM.

This bug has been filed to see about getting guidance for the community from the kernel team on tuning of vm.dirty* settings to prevent hung tasks.

vm.dirty_background_ratio = 10
vm.dirty_background_bytes = 0
vm.dirty_ratio = 20
vm.dirty_bytes = 0
vm.dirty_writeback_centisecs = 500
vm.dirty_expire_centisecs = 3000
vm.drop_caches = 0

---

ProblemType: Bug
DistroRelease: Ubuntu 11.04
Package: linux-image-virtual 2.6.38.8.22
ProcVersionSignature: User Name 2.6.38-8.42-virtual 2.6.38.2
Uname: Linux 2.6.38-8-virtual x86_64
AlsaDevices:
 total 0
 crw------- 1 root root 116, 1 2011-07-06 21:41 seq
 crw------- 1 root root 116, 33 2011-07-06 21:41 timer
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
CurrentDmesg: [ 21.600011] eth0: no IPv6 routers present
Date: Mon Jul 11 16:01:31 2011
Ec2AMI: ami-6463980d
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-east-1d
Ec2InstanceType: t1.micro
Ec2Kernel: aki-427d952b
Ec2Ramdisk: unavailable
Lspci:

Lsusb: Error: command ['lsusb'] failed with exit code 1:
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: root=LABEL=uec-rootfs ro console=hvc0
ProcModules: acpiphp 24097 0 - Live 0x0000000000000000
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh insta
---
AlsaDevices:
 total 0
 crw------- 1 root root 116, 1 2011-07-25 20:32 seq
 crw------- 1 root root 116, 33 2011-07-25 20:32 timer
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
CurrentDmesg:

DistroRelease: Ubuntu 11.04
Ec2AMI: ami-c55b9cac
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-east-1a
Ec2InstanceType: t1.micro
Ec2Kernel: aki-427d952b
Ec2Ramdisk: unavailable
Lspci:

Lsusb: Error: command ['lsusb'] failed with exit code 1:
Package: linux (not installed)
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: root=LABEL=uec-rootfs ro console=hvc0
ProcModules: acpiphp 24097 0 - Live 0x0000000000000000
ProcVersionSignature: User Name 2.6.38-10.46-virtual 2.6.38.7
Tags: natty ec2-images
Uname: Linux 2.6.38-10-virtual x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm admin audio cdrom dialout dip floppy plugdev video

Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote :
summary: - Hung tasks on UEC cloud images
+ Hung tasks on UEC cloud images with EBS volumes
description: updated
description: updated
Revision history for this message
Stefan Bader (smb) wrote :

Having I/O created faster than the storage is able to cope with generally gets the system at some point. I am not sure I missed it or there is actually nothing, but question is what values have been tested to change? And maybe there is no good value to handle large memory systems and small ones.

Otherwise, yeah, it could be worth adjusting dirty_backround_ration downwards (to start backgroud writeout sooner, though if the percentage is too small on small systems, writes get potentially more fragmented) and move the dirty_ratio up (to get a bigger window until processes start to get waiting for flushed I/O).

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in dianosing the problem. From a terminal window please run:

apport-collect 808872

and then change the status of the bug back to 'New'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote : BootDmesg.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote : ProcCpuinfo_.txt

apport information

Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote : UdevDb.txt

apport information

Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote : UdevLog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → New
Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote :

Added apport information and reset status to new per comment 3.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 808872

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote :

Logs files were added.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.