Bug #430333 “beta installer left ASUS EeePC 900 unbootable” : Bugs : grub2 package : Ubuntu

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-09-16:

#1

I'm learning grub's CLI now. Getting some cues from grub bug (DebianBug#430333) for things to try...

rescue:grub> probe -f hd0,1
ext2

[shouldn't that say ext4?]

probe -d hd0,1
biosdisk

probe -p hd0,1
part_msdos

probe -u hd0,1
e88f90e1-950e-f0f6-947a-4baa97945122

----SO it's reporting ext2 instead of ext4, and acting as if it cannot read anything on the disk.
Booting again from the USB stick... going into parted:

$ sudo parted /dev/sda
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print

Model: ATA Patriot Memory 3 (scsi)
Disk /dev/sda: 32.3GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos

Number Start End Size Type File system flags
1 32.3kB 30.9GB 30.9GB primary ext4 boot
2 30.9GB 32.3GB 1374MB extended
5 30.9GB 32.3GB 1374MB logical linux-swap(new)

(parted)

The partition table looks OK to me, but what do I know? I don't know why it starts at 32.3kB, and I don't know what (new) means after linux-swap. But it does say the first partition is ext4, and Ubuntu NBR live (from the USB stick) can mount the partition (after about a 15-second delay).

Is this a grub issue? Maybe it doesn't have a driver it needs to read this sort of drive as ext4?

Let me know if there's anything else needed to debug this. Fortunately it's not my primary machine so I'll leave it broken for a few days.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-09-16:

#2

I just found the Karmic Koala Testing forum at http://ubuntuforums.org/forumdisplay.php?f=359 and it looks like there's lots of good info there.

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-09-16: Re: [Bug 430333] Re: beta installer left ASUS EeePC 900 unbootable

#3

It's OK for it to say ext2 - GRUB's ext2 module handles all of ext[234].
You're not the first person to report problems with large ext4
filesystems, though (note that I couldn't reproduce this with a sample
small filesystem), and it may be an overflow of some kind within ext2.c;
this is definitely on our list to try to track down soon.

Can you tell us anything about how this filesystem was created? For
example, was it created from scratch as ext4, or upgraded from ext3?

Changed in grub2 (Ubuntu):
importance:	Undecided → High
Changed in ubiquity:
status:	New → Invalid
Changed in netbook-remix:
status:	New → Invalid

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-09-16:

#4

I just tried to fsck /dev/sda1, and it's reporting lots of errors. LOTS of errors, including missing inodes, etc. I let fsck fix stuff, then tried again. Now grub is completely hosed -- error: invalid extent.

root says "Unknown command 'root'
help says "Unknown command 'help'

SO that was the problem.

I will try the install again. If it still fails, I will try the latest daily iso just in case there was some random weirdness in the http://cdimage.ubuntu.com/ubuntu-netbook-remix/daily-live/20090914.1/ image.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-09-16:

#5

Colin: To create the filesystem, I booted from the cdimage listed above on a USB stick, ran the Install Ubuntu option from the boot menu, and chose to wipe out the existing Ubunt 9.04 installation with the new 9.10 installation.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-09-16:

#6

Colin: I will try the reinstall again and put 10GB or less on the root install. It doesn't seem like 32GB should be considered "large" for ext4 but maybe it is too big for this SSD disk??

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-09-16:

#7

I chose "manual partition" & created a 7000 MB (which became 7007 MB) root partition and kept the existing 1373MB swap partition.

I don't understand why, but the manual partitioner says it formatted, fsck'ed, THEN it says it resized (!!) the ext4 partition, and now DRAT! it looks like it filled the empty space??? SO I clicked create a NEW partition table, but the screen didn't change, so I hit the BACK button and HMMM... now it shows the smaller 6.5GB partition there, with the (expected) 22.3GB free space in the middle. So this partitioner has some issues with manual partitioning!

I'm going back through the manual partitioning, and I'm not going to format the partitions this time, because they look right. Now the partitioner is NOT showing the mount points, even if I specify them. It's also being coy about whether I'm choosing to format or not. Hard to tell what's going on. So I turned the "format this partition" checkbox back on and am proceeding.

On the Ready to Install panel, I'm clicking the Advanced... button. It does show Install Boot Loader and It also apparently remembers I had popcon turned on before, because it's still on. (The preference must be stored on the USB stick somewhere.)

I'm starting the installation now. This will take awhile, because this flash SSD is sloooow (just like the 4GB oem SSD, just larger). I'll file an update after it finishes.

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-09-16:

#8

I don't know exactly where the problem is right now so I'm unable to
advise on what limits are relevant. Indeed, it might not even be a limit
at all, but some other thing that didn't turn up in my tests ...

While I respect that you might need the computer to work, it's actually
a shame that you're getting rid of the evidence of this problem before
we have a chance to debug it properly. :-( If you can still reproduce
it, perhaps you could hop onto #ubuntu-installer on irc.freenode.net at
some time during European working hours and I may be able to have a look
at it with you?

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-09-16:

#9

Sorry I destroyed the evidence.

The fresh install with the smaller partition sizes DID work this time. HOWEVER it was strange... it took a long time to boot up, then after the desktop appeared GNOME complained about some non-functioning applets. I told it not to remove them, though it took numerous attempts to get it to stop asking. I used ctrl-alt-delete to get the shutdown/reboot menu. I chose reboot, and it seemed to boot itself twice, the second time apparently good to go. Everything looks OK now.

A restart -- and everything still OK (except for some cosmetic issues). A fsck didn't show anything significant.

I need to run now, but I will test this install a little while to see what does and doesn't work, then again try the full install all over again.

I will translate European (UTC?) working hours to my local UTC-0500 time and get back to you via irc if necessary.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-09-16:

#10

OK -- the bad news is I got the rescue:grub> prompt on the reinstall. The good news is the bug seems to be repeatable!

If I don't get to you on irc, let me know what to look for. I will leave it alone this time!

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-09-18:

#11

[looked at several things via irc yesterday with cjwatson at #ubuntu-installer on irc.freenode.net. The freshly-installed filesystem is trashed in a very strange way.]

I won't be available on irc very much today, but let me know if I should try something else.

Maybe I could create an iso of the trashed 32GB filesystem for further analysis? I think I can plug in a big external drive and do that, then compress it and upload it somewhere.

I could also try some other things like installing to a 31GB partition (in case the partitioner & formatter are getting the size wrong and going haywire when writing into nothingness).

OH and if you think I should try a newer daily build (since I know there were some fixes to the manual partitioner) let me know.

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-09-18:

#12

On Fri, Sep 18, 2009 at 01:57:18PM -0000, Tommy Trussell wrote:
> Maybe I could create an iso of the trashed 32GB filesystem for further
> analysis? I think I can plug in a big external drive and do that, then
> compress it and upload it somewhere.

I'm not sure where I'd put it :-), so there might be some technical
problems for me, but if the compressed size is vaguely tolerable then
that would be brilliant. I only have sub-1Mbit/s download,
unfortunately.

> OH and if you think I should try a newer daily build (since I know there
> were some fixes to the manual partitioner) let me know.

It's possible that that might make a difference, yes.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-09-18:

#13

I finally figured out to get dd to read the partition:

$ dd if=/dev/sda1 of=/media/label/asus.20090918.iso conv=noerror,sync

Lots of errors scrolled by (not captured). Should I have directed the errors to a file?

The last error (so far; still copying) was at 6143488 bytes (6.1MB). Once this is done (abt 4 hours if dd was right about 2.2 MB/s) I will see how well it compresses.

If such a file is not useful please don't be shy about telling me... maybe I should use a tool made for this purpose like ddrescue or GNU ddrescue.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-09-19:

#14

The file I created with dd was only 30902344704 bytes ("28.8 GB" according to Nautilus), so the bad blocks were probably just dropped. It did compress down to "3.6 GB". But I'm guessing the interesting stuff is probably not in there.

Tomorrow I will try today's daily build and see if the filesystem is good.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-09-19:

#15

Sorry to clutter the bug report with my notes, but after lots of research I have decided the utility that MIGHT have worked to create an image of the trashed filesystem is the "forensics" version of dd called dcfldd (available in universe repo).

I'm installing the 20090918 daily build now.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-09-20:

#16

I installed the 20090917 daily build. First the bad news:

When I installed using the full drive, the installation failed, though there was a difference -- at the end of the installation sequence, it put up a dialog saying
"Executing upgrade-grub failed. This is a fatal error."
Sure enough, I got the rescue:grub> prompt.

Here's the partition map for what failed (no changes to suggested map):
/dev/sda1 30902 MB (28.8 GB) ext4 /
/dev/sda5 1373 MB (1.3 GB) swap

SO, on the second attempt, I used manual partitioning.

/dev/sda1 30000 MB ext4 /
/dev/sda5 1500 MB swap
free space 773 MB

*** why when you use the manual partitioner, does it take time to create the filesystem, check the filesystem, RESIZE the filesystem (??), and THEN when the installer starts up it says it's CREATING the filesystem AGAIN??? ***

At the end of this install it reported success. Then after clicking the restart button, there was a strange loop where it switched between the desktop and the console frequently -- X seemed to be stopping and respawning -- maybe 20 times. Then finally whatever it was trying to kill off died and it restarted. Then the first boot was strange, too, with an unusually long blank screen while the SDD had its busy light on.

Subsequent boots have been more normal.

I have not yet run fsck to check this filesystem.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-10-22:

#17

update: I downloaded and installed the 20091020.2 image http://cdimage.ubuntu.com/ubuntu-netbook-remix/daily-live/20091020.2/ -- I chose to use the entire disk (which failed so badly before) but this time the installation completed normally. After the first boot, there are lots of GNOME warnings (four applets are broken, and the gvfs-gdu-volume-monitor closed unexpectedly.) a sudo fsck -fn showed a few problems with the filesystem:

aem@eee-aem:~$ sudo fsck -fn
fsck from util-linux-ng 2.16
e2fsck 1.41.9 (22-Aug-2009)
Warning! /dev/sda1 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 110 has zero dtime. Fix? no

Inodes that were part of a corrupted orphan linked list found. Fix? no

Inode 1120 was part of the orphaned inode list. IGNORED.
Inode 1121 was part of the orphaned inode list. IGNORED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (6919396, counted=6919402).
Fix? no

Inode bitmap differences: -110 -(1120--1121) -1126
Fix? no

Free inodes count wrong for group #0 (1195, counted=1194).
Fix? no

Free inodes count wrong (1762464, counted=1762463).
Fix? no

/dev/sda1: ********** WARNING: Filesystem still has errors **********

/dev/sda1: 126192/1888656 files (0.1% non-contiguous), 625121/7544517 blocks
aem@eee-aem:~$

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-11-06:

#18

I am wondering if this is actually bug 459839 (which has been flagged as a duplicate of bug 453579 ). I'm installing onto an SSD and these reports mention some sort of SSD-related bug with ext4.

Soon I will try an "expert" install and see if I can stop it after the creation of the filesystem (before the files get copied) and try to fsck the filesystem exactly as partman has created it.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-11-07:

#19

OK now I've installed the release version several times. http://releases.ubuntu.com/9.10/ubuntu-9.10-netbook-remix-i386.iso ...except of course I used the torrent link. ;-)

The first time I installed the release version I saw errors running fsck on the new root partition. There were broken applets and such, too, corroborating the brokenness.

SO this time I killed the installer after it was through formatting the root partition. I booted again from the install image, and ran fsck and it showed clean (on the mostly empty file system).

SO I restarted the installation and let it complete, and ran fsck before the first boot -- it was clean. Then I ran fsck again after the first boot and it is still completely clean.

So now this has gone from being completely reproduce-able to mostly NOT reproduce-able.

I may try one more install (letting it go through the entire process completely uninterrupted) but apparently some tweak to the code has made this much better.... I cannot explain the first bad install...

Revision history for this message

Dominik George (natureshadow) wrote on 2009-11-30:

#20

@Tommy:

So, is this bug fixed for you?

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-11-30:

#21

I have been using the release for an hour or so daily for several weeks. I believe this problem is much less noticeable but not completely fixed, and probably will blow up worse again with time. This past weekend I noticed Evince wouldn't open. I cannot remember the last time I used Evince on this machine, but a critical .so file was corrupted, and a reinstall of the package made it work again. fsck has shown nothing, but I suspect if I went through and tried to read all the files I would see additional corruption.

Today I was looking around and found a discussion at http://forum.eeeuser.com/viewtopic.php?id=78939 which pointed me to Bug 387272, which is marked as a duplicate of Bug 445852 -- and the erratic behavior of the SSD sounds like what I am seeing. I'm guessing that not only does it occasionally freeze, but it's also corrupting the data. For some reason the early installers triggered the bug more than the later ones, but the symptoms persist.

BY THE WAY I have tried installing ureadahead as described at http://undacuvabrutha.wordpress.com/2009/11/09/still-not-happy-with-the-speed-of-your-boot-in-9-10/ but it made no difference in my boot time. SO whatever problems ureadahead addresses it is NOT an improvement for the kind of upgraded SSD I have in my EeePC 900.

Revision history for this message

Dominik George (natureshadow) wrote on 2009-11-30:

#22

I think this might be a Linux bug, i.e., not specific to Ubuntu.

As this is a UNIX-style application, no userland application should contain code that can produce the corruption you describe.

This said, the component most likely being responsible is the libscsi driver module doing the real disk operation.

Here are some things I´d propose:

- Test behaviour with default BIOS options
- Test behaviour with other distro
- Burn-in test, that is, writing defined patterns to the drive and check the results

Please remember that there is still a slight chance of a hardware failure. A module with a defect like this is very unlikely to enter the Linux stable branch.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-12-01:

#23

@Diminik: Thanks for the suggestions. I know an expert could learn more about it with some write and read tests, especially using the earlier beta releases where the installation failed every time. I don't think I saw it at any of the links in my comment above, but somewhere I was reading that people were speculating about a bug in the SSD firmware that's getting exposed with the newer driver.

HOWEVER I am a little hesitant to write TOO much to this drive -- it's flash-based, and when I recently looked at its SMART status, it claimed to be at 20% of its rated life already (and I've owned the drive for less than a year). In fact, something about the way Karmic runs -- it very frequently pauses for tens of seconds with the drive busy light on solid, especially when it has resumed from a suspend. I don't know if it's updating atimes, the ext4 journal, or what. Maybe something is wearing out the drive.

My next step will be to try another standard install of Karmic 9.10 NBR, this time choosing to use ext3, just to see if there's a difference from the default ext4.

If its behavior still seems worrisome I will probably revert to Jaunty 9.04, which was faster booting and reasonably good once the updates get installed.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-12-01:

#24

I just completed a fresh install using Ubuntu Karmic 9.10 NBR release version, except I chose to format the root partition as ext3 instead of ext4. On reboot, I get the rescue:grub> prompt.

I imagine the filesystem is as hosed as before, but I'll leave it a couple of days in case someone wants to suggest some things to look at.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-12-01:

#25

@Dominik: If you have a recommended distro that is known to work well on an ASUS netbook, I can certainly try it. All the ones I have tried on it so far (other than its stock Xandros) have been Ubuntu-based. (I started with the one now called Easy-Peasy, and also Eeebuntu, and maybe one other I'm not remembering.) It has been a long time since I've installed stock Debian but it doesn't scare me to try it, for example.

I'm also reminded that the filesystem corruption seems to be worst when I accept the Ubuntu partitioner defaults (fill my 32gig drive with root and swap). I don't know how/why that matters but in this case I suspect that if I again decrease the size of the root partition a bit, I would be able to get the installation to complete with ext3. I don't know if it's worth trying ext2 or Reiser.

I suppose I could also remove the upgraded drive and reinstall the original stock ASUS 4gig flash SSD drive and test it. I wouldn't mind so much if I destroy IT in the name of science. ;-)

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-12-01:

#26

@Dominik: OH and I don't know what "Default BIOS options" you are suggesting -- there aren't many I have changed on this unit. It has been a few weeks since I have reverted to Ubuntu Karmic 9.04 NBR, but I fully expect it to work as it did before. (Boots quickly; No noticeable corruption; Most hardware features work well.)

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-12-02:

#27

I just replaced the 32gig upgrade SSD with the original 4gig SSD and ... WOW. It works great (so far)....

Much faster boot. No noticeable corruption. No long pauses with the disk activity light on. Given the good performance I don't anticipate finding any corruption, but I will try a few things to see if I can detect any.

I will contact the manufacturer of the upgrade SSD to see if they can/will acknowledge any reported bugs. (It's a Patriot Lite SSD, 32GB model PL32GPEPCSSDR -- the box and manual say it supports Windows XP and "Linux," which may mean only the stock Xandros.)

Revision history for this message

Dominik George (natureshadow) wrote on 2009-12-02:

#28

@Tommy

I suggest asking on one of the many forums whether there is anybody with
the same model and have him or her try to reproduce the error.

We can not consider this bug confirmed if a hardware failure is still
the most likely cause.

Revision history for this message

Dominik George (natureshadow) wrote on 2009-12-02:

#29

I am marking this as invalid for grub 2 as it is either a Linux bug or a hardware failure.

We can not consider this bug confirmed for any package as long as a hardware failure is still the most likey cause of the problem.

I suggest finding someone on one of the many forums who has the exact syme drive model and have him or her reproduce the error.

Changed in grub2 (Ubuntu):
status:	New → Invalid

Revision history for this message

shadowblast101 (shadowblast101) wrote on 2009-12-02:

#30

I have the exact same Patriot SSD in my AsusEEE 900, and have basically the exact same corruption issues.

I'm running vanilla Ubuntu 9.10 right now, and the system will last for about a month before something breaks. Before I had Kubuntu 9.04, and never had a problem. I updated to 9.10 and the system ran fine for about a month, then gave me the grub error. After dicking around trying to fix it, I just reinstalled straight Ubuntu 9.10 on an ext4 partition, using default settings. System worked fine for about two-three months, then had the exact same corruption. Reinstalled with the newest Ubuntu 9.10 iso, and a month later, it's broken again. Except this time I can't even access the drive from a live CD.

I'm on the live CD right now. I can't run a FSCK, because it reports:
--
e2fsck 1.41.9 (22-Aug-2009)
fsck.ext2: Attempt to read block from filesystem resulted in short read while trying to open /dev/sda1
Could this be a zero-length partition?
--

GParted reports this: http://i50.tinypic.com/10omkc0.jpg
Palimpsest Disk Utility reports this: http://i50.tinypic.com/dzugli.jpg
Palimpsest Disk Utility SMART reports this: http://i50.tinypic.com/2vklpia.jpg

(Ignore the attachment, this is my first time using the bugs page.)

Revision history for this message

Dominik George (natureshadow) wrote on 2009-12-03:

#31

From what we found out, this is a kernel bug. To report it upstream, it should still be tested whether it also resides in other distributions (preferrably with a vanilla kernel).

I'd suggest to test Debian first, then perhaps Gentoo with a minimalistic set of kernel modules.

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

shadowblast101 (shadowblast101) wrote on 2009-12-03:

#32

After attempting to install Arch, Debian, and Windows XP, I can safely say that my SSD is entirely corrupted to the point were all of the installations failed. I don't know if it's bad hardware or what, but I can no longer use the SSD for anything.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-12-03:

#33

@shadowblast101: As per the eeeuser thread mentioned above, have you tried "zeroing" out the SSD?

Boot from a Live distro on a USB key or SD card and issue the following command at a terminal prompt:

dd if=/dev/zero of=/dev/sda bs=1M

(Oh, and it might make sense to be absolutely sure your SSD is /dev/sda before you issue the command-- this will wipe everything on whatever device is at /dev/sda.)

Revision history for this message

Dominik George (natureshadow) wrote on 2009-12-04:

#34

I would suggest to overwrite the device with a random, but defined pattern instead of all zeroes (like DEADBEEF or something :P). This way, you can verify the integrity afterwards. Zero bytes can be produced by accident, DEADBEEF can't. So you will then see where the drive fails, i.e., whether the defined pattern is really written on the device.

As far as the trouble with Arch, Debian and Windoze is concerned, I am pretty sure that no code in Linux is capable of producing a failure like this if it wasn't for (at least another) bug in the drive firmware or hardware.

Please verify that the issue does not occur if Windows (or another distribution) is installed on an entirely drive.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-12-04:

#35

My comment to shadowblast101 was so he might "rescue" his drive by writing zeroes to it (so the regular utilities won't barf when they look at it)...

I looked into how to use dd to write different patterns -- the trick is to pipe the output of /dev/zero through /usr/bin/tr to convert to whatever pattern you want. But surely you then have to go back and read what was written and compare and I haven't learned how to do that.

For those of us who aren't coders, maybe we should use the badblocks utility?

I will see if I can try some different kernels when booting from a USB stick or SD card -- it seems like that might be the easiest strategy to narrow down which kernel introduces the troublesome code. For that purpose I think I might start with Debian...

Revision history for this message

shadowblast101 (shadowblast101) wrote on 2009-12-04:

#36

Hey, thanks. I zeroed the drive, and was able to reformat it afterwards. I'll try a couple of different distros as well. Starting with Arch. (Mainly to see what it's like to have to stick you hands into your system files.) I have some basic linux/coding skill, but nothing much beyond being able to follow directions and basic Java.

Should we try some ubuntu derivatives like Mint to see if the bug carries over?

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-12-04:

#37

Rather than going "downstream" from Ubuntu, I think I will go "upstream" to Debian. I was thinking I would create a boot volume on a USB stick or SD card -- enough to at least get me a terminal prompt and networking. If the current working hypothesis is correct, some revision in kernel 2.6.31 (or earlier) caused a serious regression. I believe Jaunty's released kernel was 2.6.28, so the problem should be in between somewhere.

It might be useful to look at Mint or some other distros, but keep track of the kernel version too.

Revision history for this message

shadowblast101 (shadowblast101) wrote on 2009-12-07:

#38

I finally got Arch up and running with Gnome on the SSD, and so far there's no problems. It's kernel version 2.6.31-ARCH. It may take some time before the bug propagates again on my machine as it was stable for about a month before eating itself.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-12-08:

#39

and I finally got Debian up and running on an SD card -- no desktop, just the terminal. What a PAIN. I'm hoping to start testing with badblocks soon. I just hope I can get the SSD to fail in a way I can test it.

I suspect this may be the same underlying bug as http://bugzilla.kernel.org/show_bug.cgi?id=14583 as mentioned in Bug 445852.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-12-14:

#40

I have finally seen a corrupted block after several hours of activity using badblocks. Unfortunately, the corrupted block wasn't in the place I was TRYING to make one. :-(

A comment: it's blasted hard to make a bootable USB or SD card using the ASUS EEEpc alone. Grub apparently has a bug where it writes everything to the SSD regardless of where you specify it, AND /dev/ sometimes populates the removeable devices differently when you have different devices plugged in, or when you have the installer image running, or the phase of the moon, or something. :-P

OK... back to my progress (or lack thereof):

There is a page describing how to make a bootable image on an SD card in Debian 5.0.3 "lenny," so I was able to boot and test two kernels: 2.6.26-2-686, and 2.6.30-bpo.2-686 (2.6.30 backported to Lenny via backports.org). Running badblocks I was able to see the filesystem damage from earlier Ubuntu Karmic installations, HOWEVER, once I "zeroed" out the SSD using dd, the device stayed "clean" through several rounds of badblocks tests. (I did limit myself to five minute runs of the write tests -- in my experience under Karmic installations I felt like I should see the problem by then.)

I was not able to make a grub-bootable SD or USB image to test different Ubuntu kernels, but I have the Ubuntu 9.10 "Karmic" NBR live image, and I created a Kubuntu 9.10 "Karmic" netbook image, too, but of course it seemed pretty similar. I don't tend to see the problem when booting from the SD card. The Karmic kernel I used is 2.6.31-14-generic.

I also used the Ubuntu 9.10 "Karmic" Alternate installer to create a bootable partition on the SSD card. I booted from it, and used badblocks to exercise the other partition I created. Several times, and in several different ways. I NEVER saw badblocks CAUSE any problems on the test partition.

HOWEVER, after all that, the INSTALLED Karmic OS partition developed a bad block at sector 110655. That block is found by badblocks no matter what OS I boot from, though fsck -f does not see it (so it doesn't happen to be one of the nodes fsck looks at). It causes kernel error messages (as described in Bug 445852 ) and parted takes a long time to come up when I try to look at the partition table.

SO I think I can say that JUST writing random patterns using badblocks doesn't make the corruption happen, or at least not quickly. Specifically I used:

# badblocks -sn /dev/sda

After the problem block develops, it's visible using the read-only test:

# badblocks -s /dev/sda

I haven't let badblocks "churn" away on the SSD all day. I would like to come up with something that elicits the filesystem damage almost immediately, like I was seeing with the beta NBR installers. I'm starting to think some other process has to be running at the same time to trigger the corruption.

Maybe it would be enough to read some other place on the SSD at the same time badblocks is reading and writing its random patterns. Maybe another instance of badblocks running in another VT, or something more clever.

I have finally seen a corrupted block after several hours of activity using badblocks. Unfortunately, the corrupted block wasn't in the place I was TRYING to make one. :-(

A comment: it's blasted hard to make a bootable USB or SD card using the ASUS EEEpc alone. Grub apparently has a bug where it writes everything to the SSD regardless of where you specify it, AND /dev/ sometimes populates the removeable devices differently when you have different devices plugged in, or when you have the installer image running, or the phase of the moon, or something. :-P

OK... back to my progress (or lack thereof):

There is a page describing how to make a bootable image on an SD card in Debian 5.0.3 "lenny," so I was able to boot and test two kernels: 2.6.26-2-686, and 2.6.30-bpo.2-686 (2.6.30 backported to Lenny via backports.org). Running badblocks I was able to see the filesystem damage from earlier Ubuntu Karmic installations, HOWEVER, once I "zeroed" out the SSD using dd, the device stayed "clean" through several rounds of badblocks tests. (I did limit myself to five minute runs of the write tests -- in my experience under Karmic installations I felt like I should see the problem by then.)

I was not able to make a grub-bootable SD or USB image to test different Ubuntu kernels, but I have the Ubuntu 9.10 "Karmic" NBR live image, and I created a Kubuntu 9.10 "Karmic" netbook image, too, but of course it seemed pretty similar. I don't tend to see the problem when booting from the SD card. The Karmic kernel I used is 2.6.31-14-generic.

I also used the Ubuntu 9.10 "Karmic" Alternate installer to create a bootable partition on the SSD card. I booted from it, and used badblocks to exercise the other partition I created. Several times, and in several different ways. I NEVER saw badblocks CAUSE any problems on the test partition.

HOWEVER, after all that, the INSTALLED Karmic OS partition developed a bad block at sector 110655. That block is found by badblocks no matter what OS I boot from, though fsck -f does not see it (so it doesn't happen to be one of the nodes fsck looks at). It causes kernel error messages (as described in Bug 445852 ) and parted takes a long time to come up when I try to look at the partition table.

SO I think I can say that JUST writing random patterns using badblocks doesn't make the corruption happen, or at least not quickly. Specifically I used:

# badblocks -sn /dev/sda

After the problem block develops, it's visible using the read-only test:

# badblocks -s /dev/sda

I haven't let badblocks "churn" away on the SSD all day. I would like to come up with something that elicits the filesystem damage almost immediately, like I was seeing with the beta NBR installers. I'm starting to think some other process has to be running at the same time to trigger the corruption.

Maybe it would be enough to read some other place on the SSD at the same time badblocks is reading and writing its random patterns. Maybe another instance of badblocks running in another VT, or something more clever.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-12-14:

#41

i've been so focused on getting an installation running I lost track of Bug 445852 -- lots of new info there the last few days.

Revision history for this message

Felix Zielcke (fzielcke) wrote on 2009-12-14:

#42

Am Montag, den 14.12.2009, 15:52 +0000 schrieb Tommy Trussell:
> A comment: it's blasted hard to make a bootable USB or SD card using
> the
> ASUS EEEpc alone. Grub apparently has a bug where it writes everything
> to the SSD regardless of where you specify it, AND /dev/ sometimes
> populates the removeable devices differently when you have different
> devices plugged in, or when you have the installer image running, or
> the
> phase of the moon, or something. :-P

grub-install always uses /boot/grub if you don't use --root-directory=
option, else it uses ${root-directory}/boot/grub
The device you give it is only for the MBR/boot sector code and the
embed copy of core.img if there's space to embed it. But the file which
gets embed also gets first created in /boot/grub.

The grub-pc package runs grub-install on the devices which are stored in
the debconf database for it.
If you want to change/disable that one, then run `sudo dpkg-reconfigure
grub-pc'
But as I just replied to #495423, disabling it can make your system
unbootable if the package gets upgraded and update-grub generates a
grub.cfg which isn't anymore 100% compatible with the older GRUB 2.
--
Felix Zielcke
Proud Debian Maintainer and GNU GRUB developer

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-12-15:

#43

@Felix Zielcke: Thank you for the explanation. Maybe the bug is not with grub but maybe the Debian installer not specifying the right device, or something getting confused about which is the target device... ? I was finally able to get a small Debian installation working on an SD card using http://wiki.debian.org/DebianEeePC/HowTo/InstallOnSDcardOrUsbStick (but of course that is grub not grub2).

After Ubuntu's alternate installer (Debian Installer) failed I also tried to get grub2's update-grub to write its files using the procedure at http://www.ubuntu-inside.me/2009/06/howto-recover-grub2-after-windows.html -- The steps are: mount the target device (for me it was /dev/sdc), and then mount the booted system's /dev and /proc to the target directory, then chroot into the target directory and run grub-install /dev/sdc. From watching the LEDs it looked like grub wrote or read something from /dev/sdb (the booted USB stick) as well as /dev/sda (the SSD I was trying to leave alone).

It sounds like you're saying that even though I mounted the /dev and /proc and chroot-ed into the mounted target I still must explicitly specify the device for update-grub's root, so the command would be (?? haven't tested):

grub-install /dev/sdc --root-directory=/dev/sdc/

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-12-15:

#44

I have finally confirmed that the errors appear when I invoke /lib/udev/devkit-disks-probe-ata-smart as described in Comment #108 of Bug 445852. So I will declare this bug to be a duplicate of that one. (Plus I like the bug's dramatic new description "devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death") :-0

Revision history for this message

Felix Zielcke (fzielcke) wrote on 2009-12-15:

#45

Am Dienstag, den 15.12.2009, 17:11 +0000 schrieb Tommy Trussell:
> It sounds like you're saying that even though I mounted the /dev and
> /proc and chroot-ed into the mounted target I still must explicitly
> specify the device for update-grub's root, so the command would be (??
> haven't tested):
>
> grub-install /dev/sdc --root-directory=/dev/sdc/

If you mean grub-install then please say grub-install and not
update-gub.
update-grub is now just a stub for `grub-mkconfig
-o /boot/grub/grub.cfg'
It does not do anything more then generating grub.cfg

If you use chroot command and so /boot/grub inside there is the
disk/partition which should get GRUB you don't need --root-directory=.
If it's not then you have to give it a path not a device.
Like grub-install --root-directory=/mnt /dev/sdc
and then it uses /mnt/boot/grub instead of /boot/grub.

--
Felix Zielcke
Proud Debian Maintainer and GNU GRUB developer

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-12-17:

#46

@Felix: my deepest apologies for writing it wrongly. At some point I will try again and try to identify where I went wrong. Currently I am finding the filesystem corruption to be more of a problem than grub wrestling! If you can point me to the best place to discuss such grub issues I will take this discussion there when I can.

Revision history for this message

Tommy Trussell (tommy-trussell) wrote on 2009-12-17:

#47

Note the workaround in Bug 445852 involving editing udev is NOT workable for a new Ubuntu installation, because even if you update the udev rule before the installer reboots, the udev change gets reverted with the first set of software updates.

Ubuntu
grub2 package

beta installer left ASUS EeePC 900 unbootable

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
Ubuntu Netbook Remix	Invalid	Undecided	Unassigned
ubiquity	Invalid	Undecided	Unassigned
grub2 (Ubuntu)	Invalid	High	Unassigned
linux (Ubuntu)	Confirmed	Undecided	Unassigned

Ubuntugrub2 package

beta installer left ASUS EeePC 900 unbootable

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
grub2 package