cfq triggers smbd timeouts

Bug #627380 reported by Joshua Coombs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

I have Samba installed on a 10.4 amd64 system with a single share on an ext4 volume. This share is hit by multiple Windows systems nightly as they store ntbackup dumps to the box. I've been chasing random backup job failures on this machine for awhile now, thinking there was a bug in Samba but couldn't find an error in the logs or ever saw a crash dump. If I run the dumps manually during the day, no problem. The problems only occurred at night, and the time was random. When failures occurred if there were multiple they'd all happen at the same time. Windows would only report an error writing.

Digging around I found some discussions online noting cfq causing IO starvation in some workloads, causing processes to appear to hang for durations of up to and over 2 minutes. My samba logs show 1 minute plus 'pauses' in activity right before Windows logs a failure. Changing to noop has so far cleared up the failures.

I'm currently running 2.6.35-999-generic #201008021608 (mainline kernel) due to bug 474089, and have upgraded to the Maverick samba packages and related libs as part of trying to track this issue down. I'm writing to a 2TB SATA drive behind a cciss controller, no RAID. The problem is definitely load related so I can only really get one viable test in per night, and for obvious reasons I can't stick with a broken config for too many nights in a row, but I'm more than willing to try and gather whatever test data is needed to improve things.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Joshua,

Please be sure to confirm this issue exists with the latest development release of Ubuntu. ISO CD images are available from http://cdimage.ubuntu.com/daily/current/ . If the issue remains, please run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 627380

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Joshua Coombs (josh-coombs-gmail) wrote :

I'm working on setting up a test system as we speak. Given that our setup has stabilized since switching to noop I've lost the ability to tinker with it directly unfortunately. Hopefully I'll have data in a few days.

Revision history for this message
Joshua Coombs (josh-coombs-gmail) wrote :

So far I haven't been able to hit my test box with enough load to trigger a repeat. My production system has glitched a few more times even with noop so now I'm really lost.

I can toss a newer mainline kernel on my production system, or I can continue attempting to up the load on my test box, how would you prefer I proceed?

Revision history for this message
Joshua Coombs (josh-coombs-gmail) wrote :

I don't have the resources to replicate or permission to alter my production system at this time, so I'm going to request this get closed out till I can replicate reliably again if that happens.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.