Flaky firewire disks Fail Feisty but Do Dapper

Bug #134396 reported by Tommy Trussell on 2007-08-24
6
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned

Bug Description

I have an external Firewire/USB enclosure -- a MacAlly PHR-100AC enclosure containing a Maxtor DiamondMax Plus 9 200GB ATA/133 HDD. I have it formatted with three ext3 partitions (two large, and one small).

I copied my working data from my old PowerBook G3 Series (Wallstreet II/PDQ) running Dapper to this drive, then mounted it on an eMonster 550r running Feisty. This worked OK, though I would occasionally notice the FireWire drive wouldn't mount on the first try and I would have to power it down or log out and in before it would come up. I should have taken this as a warning!

This morning it didn't come up at all, and I examined dmesg and notice (among other error messages) it was complaining the drive had been mounted more than some number of times (50?) without a fsck. Like an idiot, I decided to run fsck manually, and of course it found some problems and corrected them. (Unfortunately I didn't keep the log of its corrections.)

After running fsck, the drive mounted in Feisty, but lots of data was missing and the logs were filling with all sorts of dire error messages. I tried various things but could not copy any of the data from the drive without encountering file errors.

For some reason I moved the drive back to the PowerBook and discovered the Dapper can read the drive perfectly. So I copied the data over the network to a new firewire drive on the Feisty box. I'm going to gather some of the relevant data and add that to this report, but I have to go at the moment....

Tommy Trussell (tommy-trussell) wrote :
Download full text (4.0 KiB)

information from Feisty system where firewire drive fails after being plugged in:

twt@emonster:~$ uname -a
Linux emonster 2.6.20-16-generic #2 SMP Thu Jun 7 20:19:32 UTC 2007 i686 GNU/Linux
twt@emonster:~$ lspci
00:00.0 Host bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX Host bridge (rev 03)
00:01.0 PCI bridge: Intel Corporation 440BX/ZX/DX - 82443BX/ZX/DX AGP bridge (rev 03)
00:07.0 ISA bridge: Intel Corporation 82371AB/EB/MB PIIX4 ISA (rev 02)
00:07.1 IDE interface: Intel Corporation 82371AB/EB/MB PIIX4 IDE (rev 01)
00:07.2 USB Controller: Intel Corporation 82371AB/EB/MB PIIX4 USB (rev 01)
00:07.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 02)
00:0b.0 Multimedia audio controller: Cirrus Logic CS 4614/22/24/30 [CrystalClear SoundFusion Audio Accelerator] (rev 01)
00:10.0 Ethernet controller: D-Link System Inc RTL8139 Ethernet (rev 10)
00:12.0 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
00:12.1 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
00:12.2 USB Controller: ALi Corporation USB 1.1 Controller (rev 03)
00:12.3 USB Controller: ALi Corporation USB 2.0 Controller (rev 01)
00:12.4 FireWire (IEEE 1394): ALi Corporation M5253 P1394 OHCI 1.1 Controller
01:00.0 VGA compatible controller: nVidia Corporation NV6 [Vanta/Vanta LT] (rev 15)
twt@emonster:~$ tail -30 /var/log/syslog
Aug 23 20:49:09 emonster -- MARK --
Aug 23 21:09:09 emonster -- MARK --
Aug 23 21:17:01 emonster /USR/SBIN/CRON[17508]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Aug 23 21:29:09 emonster -- MARK --
Aug 23 21:41:05 emonster kernel: [19979.040000] ieee1394: Error parsing configrom for node 0-00:1023
Aug 23 21:41:05 emonster kernel: [19979.040000] ieee1394: Node changed: 0-00:1023 -> 0-01:1023
Aug 23 21:41:05 emonster kernel: [19979.040000] ieee1394: Node changed: 0-01:1023 -> 0-02:1023
Aug 23 21:41:05 emonster kernel: [19979.048000] ieee1394: sbp2: Reconnected to SBP-2 device
Aug 23 21:41:05 emonster kernel: [19979.048000] ieee1394: sbp2: Node 0-01:1023: Max speed [S400] - Max payload [2048]
Aug 23 21:41:19 emonster kernel: [19992.760000] ieee1394: Node resumed: ID:BUS[0-00:1023] GUID[001010054000080c]
Aug 23 21:41:19 emonster kernel: [19992.768000] ieee1394: sbp2: Reconnected to SBP-2 device
Aug 23 21:41:19 emonster kernel: [19992.768000] ieee1394: sbp2: Node 0-01:1023: Max speed [S400] - Max payload [2048]
Aug 23 21:41:19 emonster kernel: [19992.768000] scsi6 : SBP-2 IEEE-1394
Aug 23 21:41:19 emonster kernel: [19992.768000] ieee1394: sbp2: Workarounds for node 0-00:1023: 0x2 (firmware_revision 0x000265, vendor_id 0x001010, model_id 0x000201)
Aug 23 21:41:20 emonster kernel: [19993.772000] ieee1394: sbp2: Logged into SBP-2 device
Aug 23 21:41:20 emonster kernel: [19993.772000] ieee1394: sbp2: Node 0-00:1023: Max speed [S400] - Max payload [2048]
Aug 23 21:41:26 emonster kernel: [19999.272000] ieee1394: sbp2: aborting sbp2 command
Aug 23 21:41:26 emonster kernel: [19999.272000] scsi 6:0:0:0:
Aug 23 21:41:26 emonster kernel: [19999.272000] command: Inquiry: 12 00 00 00 24 00
Aug 23 21:41:36 emonster kernel: [20009.272000] ieee1394: sbp2: aborting sbp2 command
Aug 23 21:41:36 emonster kernel: [20009.27...

Read more...

Tommy Trussell (tommy-trussell) wrote :
Download full text (3.8 KiB)

Here's the information from the PowerBook running Dapper aftern connecting and mounting the Firewire drive successfully.

twt@pbg3:~ $ uname -a
Linux pbg3 2.6.15-28-powerpc #1 Wed Jul 18 22:51:07 UTC 2007 ppc GNU/Linux
twt@pbg3:~ $ lspci
0000:00:00.0 Host bridge: Motorola MPC106 [Grackle] (rev 40)
0000:00:0d.0 ff00: Apple Computer Inc. Heathrow Mac I/O (rev 01)
0000:00:10.0 ff00: Apple Computer Inc. Heathrow Mac I/O (rev 01)
0000:00:11.0 Display controller: ATI Technologies Inc 3D Rage LT Pro (rev dc)
0000:00:13.0 CardBus bridge: Texas Instruments PCI1131 (rev 01)
0000:00:13.1 CardBus bridge: Texas Instruments PCI1131 (rev 01)
0000:01:00.0 FireWire (IEEE 1394): Texas Instruments TSB12LV23 IEEE-1394 Controller
0000:05:00.0 Ethernet controller: Abocom Systems Inc ADMtek Centaur-C rev 17 [D-Link DFE-680TX] CardBus Fast Ethernet Adapter (rev 11)
twt@pbg3:~ $

twt@pbg3:~ $ tail -30 /var/log/syslog
Aug 23 21:34:25 localhost ntpd[4728]: frequency error 512 PPM exceeds tolerance 500 PPM
Aug 23 21:35:29 localhost ntpd[4728]: synchronized to 82.211.81.145, stratum 2
Aug 23 21:35:33 localhost ntpd[4728]: frequency error 512 PPM exceeds tolerance 500 PPM
Aug 23 21:43:02 localhost ntpd[4728]: frequency error 512 PPM exceeds tolerance 500 PPM
Aug 23 21:47:08 localhost kernel: [824457.052777] ohci1394: fw-host0: SelfID received, but NodeID invalid (probably new bus reset occurred): 0000FFC0
Aug 23 21:47:17 localhost kernel: [824465.487705] ieee1394: The root node is not cycle master capable; selecting a new root node and resetting...
Aug 23 21:47:17 localhost kernel: [824465.763722] ieee1394: Node resumed: ID:BUS[0-00:1023] GUID[001010054000080c]
Aug 23 21:47:17 localhost kernel: [824465.764860] ieee1394: Node changed: 0-00:1023 -> 0-01:1023
Aug 23 21:47:17 localhost kernel: [824465.765858] scsi4 : SCSI emulation for IEEE-1394 SBP-2 Devices
Aug 23 21:47:17 localhost kernel: [824465.768731] ieee1394: sbp2: Node 0-00:1023: Using 36byte inquiry workaround
Aug 23 21:47:18 localhost kernel: [824466.875385] ieee1394: sbp2: Logged into SBP-2 device
Aug 23 21:47:18 localhost kernel: [824466.875563] ieee1394: Node 0-00:1023: Max speed [S400] - Max payload [2048]
Aug 23 21:47:18 localhost kernel: [824466.877595] Vendor: PI-101 Model: 1394/USB20 Drive Rev: 2.65Aug 23 21:47:18 localhost kernel: [824466.877660] Type: Direct-Access ANSI SCSI revision: 00
Aug 23 21:47:18 localhost kernel: [824466.881570] SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB)
Aug 23 21:47:18 localhost kernel: [824466.883163] SCSI device sda: drive cache: write back
Aug 23 21:47:18 localhost kernel: [824466.885099] SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB)
Aug 23 21:47:18 localhost kernel: [824466.886670] SCSI device sda: drive cache: write back
Aug 23 21:47:18 localhost kernel: [824466.886702] sda: sda1 sda3 sda4
Aug 23 21:47:18 localhost kernel: [824466.912607] sd 4:0:0:0: Attached scsi disk sda
Aug 23 21:47:18 localhost kernel: [824466.913917] sd 4:0:0:0: Attached scsi generic sg0 type 0
Aug 23 21:47:21 localhost kernel: [824470.187231] kjournald starting. Commit interval 5 seconds
Aug 23 21:47:22 localhost kernel: [824470.189419...

Read more...

Tommy Trussell (tommy-trussell) wrote :

One additional comment/question -- I don't know how a "server" installation of Ubuntu would handle a firewire drive. It might be relevant because that's actually how I'm using this one, and it's a bit annoying to have to log in to get whatever process in Gnome kicks off the removable drive detection. (I'm not a newbie but I've not bothered to learn the "proper" way to do it otherwise.)

SO, if ext3 wants to fsck the removable drive occasionally OR finds another problem with the filesystem, should there be a dialog that comes up in Gnome to tell you about it? I will enter that as a wishlist item...

Tommy Trussell (tommy-trussell) wrote :
Download full text (19.8 KiB)

While trying to get back up and running using my restored data, the NEW drive on Feisty started acting flaky. As I was making a copy of the data (first with cp then with rsync so I could monitor it more easily) the drive kept seeming slower and slower. I saw no relevant messages on dmesg or syslog. HOWEVER, when I cancelled the copy operation and umounted then remounted the drive, syslog showed complaints about filesystem errors and that fsck was needed. OH great.

I think I had better stop using Firewire on this system until I get this figured out. This system seemed reliable until this week.... but it's definitely not good now!

I have not re-mounted the drive, and dmesg shows no hardware errors:

twt@emonster:~$ dmesg | grep ieee
[ 68.110716] ieee1394: Initialized config rom entry `ip1394'
[ 12.276000] ieee1394: Node added: ID:BUS[0-00:1023] GUID[0090a992e0000001]
[ 12.276000] ieee1394: Host added: ID:BUS[0-01:1023] GUID[0090e63900000725]
[ 34.752000] ieee1394: sbp2: Logged into SBP-2 device
[ 34.752000] ieee1394: sbp2: Node 0-00:1023: Max speed [S400] - Max payload [2048]
twt@emonster:~$

Since I'm working from a copy anyway I decided to try fsck...

twt@emonster:~$ fsck /dev/sda2
fsck 1.40-WIP (14-Nov-2006)
e2fsck 1.40-WIP (14-Nov-2006)
mybook2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 7, i_blocks is 128648, should be 136720. Fix<y>? yes

[this was taking TOO LONG -- about an hour -- so I ctrl-c canceled and rebooted]

mybook2: e2fsck canceled.

mybook2: ***** FILE SYSTEM WAS MODIFIED *****

mybook2: ********** WARNING: Filesystem still has errors **********

twt@emonster:~$
Broadcast message from root@emonster
        (unknown) at 23:22 ...

The system is going down for reboot NOW!
Rebooted from gdm menu.
Connection to emonster closed by remote host.
Connection to emonster closed.

twt@pbg3:~ $ ssh -X twt@emonster
Linux emonster 2.6.20-16-generic #2 SMP Thu Jun 7 20:19:32 UTC 2007 i686

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.
Last login: Thu Aug 23 23:10:54 2007 from pbg3.twt-lan.conwaycorp.net

twt@emonster:~$ fsck -Cy /dev/sda2
fsck 1.40-WIP (14-Nov-2006)
e2fsck 1.40-WIP (14-Nov-2006)
mybook2 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found. Fix? yes

Inode 532417 was part of the orphaned inode list. FIXED.
Inode 532417 has imagic flag set. Clear? yes

Inode 532417, i_blocks is 33820648, should be 0. Fix? yes

Inode 1064897 was part of the orphaned inode list. FIXED.
Inode 1064897 has imagic flag set. Clear? yes

Inode 1064897 has compression flag set on filesystem without compression support. Clear? yes
Inode 1064897 has illegal block(s). Clear? yes

Illegal block #0 (758133038) in inode 1064897. CLEARED.
Illegal block #1 (1987208563) in inode 1064897. CLEARED.
Illegal block #2 (1882026597) in inode 1064897. CLEARED.
Illegal bloc...

Tommy Trussell (tommy-trussell) wrote :

Another data point -- on June 15, 2007 I had another Firewire drive die on Feisty (30GB laptop drive in a MacAlly PHR-250CC enclosure), but I assumed it was just its time to go. It suddenly went from working perfectly to being totally unresponsive on the same Feisty system, so I assumed the drive died and I set it aside.

Based on my experience today, I just plugged it into the Dapper system and it works! It says I need to run fsck, which I will NOT do until I learn more....

This small drive is formatted with one ext2 partition, and it just contains a bunch of iso files.

Tommy Trussell (tommy-trussell) wrote :

I just mounted the "new" drive (Western Digital MyBook Premium) using the USB port instead of firewire, and it shows no problem. I ran fsck and it said "no errors." So I'm again trying to get back up and running using my data, this time using USB. I'm copying the files now, and it's copying MUCH faster (maybe 50x faster) than it was running with Firewire on Feisty.

Tommy Trussell (tommy-trussell) wrote :

I recently discovered https://wiki.ubuntu.com/DebuggingRemovableDevices so I figured I would add that information to the bug report.

Strangely, TODAY the firewire drive is working fine in Feisty....

But I will add the information as listed with the note

Information from a WORKING test:

twt@emonster:~$ id
uid=1000(twt) gid=1000(twt) groups=4(adm),20(dialout),21(fax),24(cdrom),25(floppy),29(audio),30(dip),44(video),46(plugdev),104(scanner),112(netdev),113(lpadmin),115(powerdev),117(admin),1000(twt)
twt@emonster:~$ id hal
id: hal: No such user
twt@emonster:~$ id haldaemon
uid=107(haldaemon) gid=114(haldaemon) groups=114(haldaemon),24(cdrom),25(floppy),46(plugdev),115(powerdev)
twt@emonster:~$ uname -a
Linux emonster 2.6.20-16-generic #2 SMP Thu Jun 7 20:19:32 UTC 2007 i686 GNU/Linux
twt@emonster:~$

Tommy Trussell (tommy-trussell) wrote :

Earlier today I used the "flaky" firewire drive and it worked fine. Tonight (hours later) it didn't come up. I logged out and back in but have NOT rebooted linux.

I am attaching information for the same drive when it's FAILING.

twt@emonster:~$ id
uid=1000(twt) gid=1000(twt) groups=4(adm),20(dialout),21(fax),24(cdrom),25(floppy),29(audio),30(dip),44(video),46(plugdev),104(scanner),112(netdev),113(lpadmin),115(powerdev),117(admin),1000(twt)
twt@emonster:~$ id hal
id: hal: No such user
twt@emonster:~$ id haldaemon
uid=107(haldaemon) gid=114(haldaemon) groups=114(haldaemon),24(cdrom),25(floppy),46(plugdev),115(powerdev)
twt@emonster:~$ uname -a
Linux emonster 2.6.20-16-generic #2 SMP Fri Aug 31 00:55:27 UTC 2007 i686 GNU/Linux
twt@emonster:~$

NOTE I'm following the directions at https://wiki.ubuntu.com/DebuggingRemovableDevices but for the FAILING device, the udev log was empty, and there were no /dev/sd* devices, so I'm not attaching those empty files.

As I mentioned, the Ubuntu box had been running all day without a reboot. The Firewire drive has been turned off most of the day. (I used it earlier, umounted the partitions, and turned it off. When I turned it on tonight, it would not mount.)

I just rebooted the Ubuntu box (without powering it down) and when I turned on the Firewire drive all the partitions mounted correctly.

Stefan Richter (stefan-r-ubz) wrote :

So you have one machine with a 2.6.16 based kernel where it works and one with a 2.6.20 based kernel where it is unstable.

What does "cat /sys/module/sbp2/parameters/serialize_io" say? If it is 1, then it's OK.

Do you use a long cable or a front panel connector or something like that on the unstable PC?

Do live CDs of Ubuntu Feisty and Dapper exist to easily try them out without installing them? Then you could try Feisty on the so far working machine and/or Dapper on the unstable machine.

Stefan Richter (stefan-r-ubz) wrote :

As a side note:

The ext2 and ext3 filesystems have an option to count the number of how often a partition was mounted and when it was last checked with fsck, and to enforce an fsck after it was mounted a certain times or checked more than a certain time ago. This option is configured per partition by means of the tune2fs command; see the -c and -i options of tune2fs in its manual page.

Of course, like any large-scale filesystem manipulation, an fsck is a very bad idea to run on a system whose hardware or kernel don't work reliably.

sorry for the delayed response --

Unfortunately the Dapper system is totally different hardware from the newer system -- it's Dapper 6.06.1 running on a 266 Mhz PowerBook G3 and the newer system (since upgraded to Gutsy) is a 550 Mhz Pentium PC.

If anything the Firewire cable may be LONGER on the working setup than on the one that fails, but on the Pentium PC I added a generic PCI USB/Firewire card, and on the laptop I use a PCMCIA Firewire card made by IBM. Not certain of either chipset.

I like the suggestion of using live CDs to test. Unfortunately that won't work on the laptop, but if I can scrounge together the necessary parts I will attempt to set up an Intel PC that I can try different versions of Ubuntu and see if I can replicate the issue. I've held back one of the problematic drives so I'm not risking live data.

P.S.: I gather very few people use FireWire hard drives on linux... if I can work up a test case, what would happen next?

Stefan Richter (stefan-r-ubz) wrote :

The test with different OSs would serve to find out whether the problem is influenced by the kernel or by the hardware (or both). The different types of failures that your logs show point towards flaky hardware. If so, then it could be the card, the cable, or the combination of PC (with all its internal noise) and cable and disk.

BTW, I read that one large harddisk vendor sold series of FireWire disks with defective cables. I don't remember whether it was WD or Seagate or Maxtor.

I have a Western Digital MyBook Premium that I bought at Target. The included cable destroyed the Firewire ports, and they sent me a new cable in the mail, so I exchanged the original drive at the store and have used only the cable they mailed me with that drive. My failures have occurred on OTHER enclosures (two newish MacAlly ones and a generic Oxford 911 one I bought at OtherWorld Computing years ago). After having the bizarre flakiness I have avoided using Firewire at all... all my new enclosures have high-speed USB which so far has been acceptable. Some of the old enclosures could very well have buggy firmware or something, but I was shocked to discover they would work great for awhile but dramatically fail on Feisty (looking very much dead) but completely resurrect on the Dapper setup.

I have some business issues to attend to but soon I will try to establish a test setup so I can rule out the hardware. Thank you for your responses!

Hi Tommy,

If and when you get a chance to test, care to verify this is still an issue with the latest Hardy Heron 8.04 LTS release - http://www.ubuntu.com/download . Please let us know your results. Thanks.

Changed in linux:
status: New → Incomplete

The Ubuntu Kernel Team is planning to move to the 2.6.27 kernel for the upcoming Intrepid Ibex 8.10 release. As a result, the kernel team would appreciate it if you could please test this newer 2.6.27 Ubuntu kernel. There are one of two ways you should be able to test:

1) If you are comfortable installing packages on your own, the linux-image-2.6.27-* package is currently available for you to install and test.

--or--

2) The upcoming Alpha5 for Intrepid Ibex 8.10 will contain this newer 2.6.27 Ubuntu kernel. Alpha5 is set to be released Thursday Sept 4. Please watch http://www.ubuntu.com/testing for Alpha5 to be announced. You should then be able to test via a LiveCD.

Please let us know immediately if this newer 2.6.27 kernel resolves the bug reported here or if the issue remains. More importantly, please open a new bug report for each new bug/regression introduced by the 2.6.27 kernel and tag the bug report with 'linux-2.6.27'. Also, please specifically note if the issue does or does not appear in the 2.6.26 kernel. Thanks again, we really appreicate your help and feedback.

Michele Mangili (mangilimic) wrote :

We are closing this bug report because it lacks the information we need to investigate the problem, as described in the previous comments. Please reopen it if you can give us the missing information, and don't hesitate to submit bug reports in the future. To reopen the bug report you can click on the current status, under the Status column, and change the Status back to "New". Thanks again!

Changed in linux:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers