reading data from firewire drive stalls, ieee1394 errors appear in messages

Bug #440338 reported by Ethan Tira-Thompson
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
pm-utils (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

When playing video from an external ieee1394 drive with mplayer (unsure if it occurs other times too), the video will stall at an irregular interval. After a second or two it will start playing again.

During this time, 'top' reports ~100% CPU utilization waiting for I/O (%wa), and the following lines are dumped in /var/log/messages:
Oct 1 23:13:42 ginger kernel: [260307.000180] ieee1394: sbp2: aborting sbp2 command
Oct 1 23:13:42 ginger kernel: [260307.000213] sd 2:0:0:0: [sdb] CDB: ATA command pass through(16): 85 08 2e 00 00 00 00 00 00 00 00 00 00 40 a1 00
Oct 1 23:13:42 ginger kernel: [260362.000084] ieee1394: sbp2: aborting sbp2 command
Oct 1 23:13:42 ginger kernel: [260362.000103] sd 2:0:0:0: [sdb] CDB: ATA command pass through(16): 85 08 2e 00 00 00 00 00 00 00 00 00 00 40 ec 00
Oct 1 23:13:42 ginger kernel: [260370.000194] ieee1394: sbp2: aborting sbp2 command
Oct 1 23:13:42 ginger kernel: [260370.000221] sd 2:0:0:0: [sdb] CDB: ATA command pass through(16): 85 08 2e 00 00 00 00 00 00 00 00 00 00 40 a1 00

This happens about once every 1-20 minutes, averaging about 4 or 5 minutes (pretty irregular, I don't see a pattern).
Also, going back in the log there is a steady stream of these messages, even when the machine is otherwise idle. So it is not access via mplayer, that is simply a symptom that makes it obvious (and annoying).

The external drive is a 1TB western digital caviar green WD10EADS living in a Macally G-S350SUA firewire 400 enclosure. I presume the ubuntu-bug has already collected by other system information.

ProblemType: Bug
Architecture: i386
Date: Thu Oct 1 23:14:03 2009
DistroRelease: Ubuntu 9.10
Package: linux-generic 2.6.31.11.22
ProcEnviron:
 LANGUAGE=en_US:en
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.31-11.36-generic
SourcePackage: linux-meta
Uname: Linux 2.6.31-11-generic i686

Revision history for this message
Ethan Tira-Thompson (ejtttje) wrote :
Revision history for this message
kernel-janitor (kernel-janitor) wrote :

[This is an automated message. Apologies if it has reached you inappropriately.]

This bug was reported against the linux-meta package when it likely should have been reported against the linux package instead. We are automatically transitioning this to the linux kernel package so that the appropriate teams are notified and made aware of this issue.

If this bug really is a bug in the linux-meta package you can move it back to linux-meta and set the Status to Confirmed, or contact us on the #ubuntu-kernel channel on the FreeNode IRC server. Thanks.

affects: linux-meta (Ubuntu) → linux (Ubuntu)
Revision history for this message
Ethan Tira-Thompson (ejtttje) wrote :

I've done some research, it looks like the ATA command which is being passed through the SCSI-over-firewire connection is IDENTIFY PACKET DEVICE (for errors with "a1" in the next to last byte of the hex dump) and IDENTIFY DEVICE (for errors with "ec" as the next to last byte). So something is trying to directly query the device (SMART check for drive health?), and apparently my enclosure chipset either doesn't support the passthrough, or the drive doesn't respond and is giving the abort.

I will attempt to followup with the enclosure manufacturer, but also if we could figure out what is triggering this query and disable it that would be helpful so I can use my media center again :)

Revision history for this message
Ethan Tira-Thompson (ejtttje) wrote :
Download full text (4.6 KiB)

I should also add this may be related to bug #438605.

I replicated the problem with another disk (500MB seagate barracuda 7200.10) in another enclosure of the same model (Macally G-S350SUA).

I tried running smartctl -a -d sat /dev/sdb and it locked up. ctrl-C and kill -9 were unable to kill the process, but when I turned off the drive (already unmounted) then it exited, with the following in /var/log/messages:

Oct 4 15:24:07 ginger kernel: [ 9467.000117] ieee1394: sbp2: aborting sbp2 command
Oct 4 15:26:09 ginger kernel: [ 9467.000146] sd 7:0:0:0: [sdb] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00
Oct 4 15:26:09 ginger kernel: [ 9467.007586] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO
Oct 4 15:26:09 ginger kernel: [ 9528.000141] ieee1394: sbp2: aborting sbp2 command
Oct 4 15:26:09 ginger kernel: [ 9528.000170] sd 7:0:0:0: [sdb] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00
Oct 4 15:26:09 ginger kernel: [ 9589.000410] ieee1394: sbp2: aborting sbp2 command
Oct 4 15:26:20 ginger kernel: [ 9589.000439] sd 7:0:0:0: [sdb] CDB: ATA command pass through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00
Oct 4 15:26:20 ginger kernel: [ 9600.424528] INFO: task smartctl:4707 blocked for more than 120 seconds.
Oct 4 15:26:20 ginger kernel: [ 9600.424547] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 4 15:26:20 ginger kernel: [ 9600.424561] smartctl D c080b3a0 0 4707 4574 0x00000004
Oct 4 15:26:20 ginger kernel: [ 9600.424583] db7d1c38 00000082 db4ab960 c080b3a0 c1925a88 c080b3a0 35a0fd6c 0000089c
Oct 4 15:26:20 ginger kernel: [ 9600.424610] c080b3a0 c080b3a0 c1925a88 c080b3a0 359e31f9 0000089c c080b3a0 cbaf6a80
Oct 4 15:26:20 ginger kernel: [ 9600.424635] c19257f0 db7d1ce8 7fffffff db7d1cec db7d1c94 c0569b15 db7d1c4c 00000096
Oct 4 15:26:20 ginger kernel: [ 9600.424660] Call Trace:
Oct 4 15:26:20 ginger kernel: [ 9600.424700] [<c0569b15>] schedule_timeout+0x185/0x200
Oct 4 15:26:20 ginger kernel: [ 9600.424728] [<c02fe95e>] ? __blk_run_queue+0x6e/0x120
Oct 4 15:26:20 ginger kernel: [ 9600.424750] [<c02f4206>] ? elv_insert+0x116/0x1f0
Oct 4 15:26:20 ginger kernel: [ 9600.424767] [<c0569832>] wait_for_common+0xa2/0x120
Oct 4 15:26:20 ginger kernel: [ 9600.424783] [<c0137c50>] ? default_wake_function+0x0/0x10
Oct 4 15:26:20 ginger kernel: [ 9600.424799] [<c0569942>] wait_for_completion+0x12/0x20
Oct 4 15:26:20 ginger kernel: [ 9600.424814] [<c03017f5>] blk_execute_rq+0x75/0xd0
Oct 4 15:26:20 ginger kernel: [ 9600.424827] [<c03016b0>] ? blk_end_sync_rq+0x0/0x30
Oct 4 15:26:20 ginger kernel: [ 9600.424842] [<c0301a3e>] ? blk_recount_segments+0x1e/0x40
Oct 4 15:26:20 ginger kernel: [ 9600.424857] [<c02fd81a>] ? blk_rq_bio_prep+0x6a/0x80
Oct 4 15:26:20 ginger kernel: [ 9600.424876] [<c03012ef>] ? blk_rq_append_bio+0x1f/0x60
Oct 4 15:26:20 ginger kernel: [ 9600.424892] [<c03013ee>] ? blk_rq_map_kern+0xbe/0x120
Oct 4 15:26:20 ginger kernel: [ 9600.424909] [<c030500f>] sg_scsi_ioctl+0x20f/0x330
Oct 4 15:26:20 ginger kernel: [ 9600.424926] [<c030534f>] scsi_cmd_ioc...

Read more...

Revision history for this message
Ethan Tira-Thompson (ejtttje) wrote :

I have tracked down the source of the sbp2 passthrough commands. ps ax -F during 'normal' operation, then again during a stall, then see what's different:
root 5313 1017 0 459 480 0 21:35 ? S 0:00 /bin/sh -c /etc/acpi/power.sh
root 5314 5313 0 459 508 0 21:35 ? S 0:00 /bin/sh /etc/acpi/power.sh
root 5319 5314 0 459 564 0 21:35 ? S 0:00 /bin/sh /usr/sbin/pm-powersave
root 5336 5319 0 459 552 0 21:35 ? S 0:00 /bin/sh /usr/lib/pm-utils/power.d/95hdparm-apm false
root 5346 5336 0 426 400 0 21:35 ? D 0:00 hdparm -i /dev/sdb
root 5347 5336 0 463 560 0 21:35 ? S 0:00 grep -q AdvancedPM=yes

These are consistently found every time. Running the hdparm command manually does indeed trigger the error. Stopping acpid fixes the video hiccups. I have a feeling someone more familiar with acpid configuration could figure out the 'proper' fix so it doesn't requery hdparm so often. Moving to the acpid package.

Revision history for this message
Ethan Tira-Thompson (ejtttje) wrote :

Looks like the 95hdparm-apm script in particular needs to do something about not calling hdparm unnecessarily... on drives where it is not supported (e.g. my external firewire enclosure) this causes a stall on access to the drive which interrupts things like video playback.

affects: linux (Ubuntu) → pm-utils (Ubuntu)
Revision history for this message
Ethan Tira-Thompson (ejtttje) wrote :

I see that this bug is still alive and well in the release candidate... this bug makes realtime tasks (like watching videos) from (some?) external firewire drives unusable due to frequent skips and pauses. Perhaps should be fixed for the release?

Revision history for this message
Ethan Tira-Thompson (ejtttje) wrote :

This continues to be a problem in the 9.10 release. I find the easiest solution for myself is to disable ACPI since this is an always-on media center. Any thoughts on whether the ATA command passthrough hiccup is specific to this model of hard drive enclosure, or potentially a more widespread problem? If so, this could be causing subtle performance problems with firewire drives.

Perhaps we can request that the ieee1394 driver be patched to avoid the long block associated with this error, but also pm-utils could try to detect and avoid failed calls to hdparm... (the driver blockage may be unavoidable?)

Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

Please reply if this is still an issue on a supported release.

Changed in pm-utils (Ubuntu):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.