hald crashed with SIGSEGV in hotplug_event_begin_add_blockdev when assembling mdraid devices

Bug #361689 reported by Noumayos on 2009-04-15
82
This bug affects 9 people
Affects Status Importance Assigned to Milestone
HAL
Fix Released
Critical
hal (Ubuntu)
Medium
Chris Coulson
Jaunty
Medium
Chris Coulson

Bug Description

Binary package hint: hal

dla@optimus:~$ lsb_release -rd
Description: Ubuntu 9.04
Release: 9.04

dla@optimus:~$ uname -a
Linux optimus 2.6.28-11-generic #41-Ubuntu SMP Wed Apr 8 04:39:23 UTC 2009 x86_64 GNU/Linux

dla@optimus:~$ apt-cache policy hal
hal:
  Installé : 0.5.12~rc1+git20090403-0ubuntu1
  Candidat : 0.5.12~rc1+git20090403-0ubuntu1
 Table de version :
 *** 0.5.12~rc1+git20090403-0ubuntu1 0
        500 http://ftp.free.org jaunty/main Packages
        100 /var/lib/dpkg/status

I want use my raid 0 volume and it's run. After a reboot, I don't have keyboard and mouse on X.

dla@optimus:~$ sudo mdadm --assemble /dev/md0
mdadm: /dev/md0 has been started with 2 drives.

dla@optimus:~$ dmesg
[ 123.903076] md: md0 still in use.
[ 124.029637] md: bind<sdc1>
[ 124.029816] md: bind<sdb1>
[ 124.031945] md: raid0 personality registered for level 0
[ 124.032052] md0: setting max_sectors to 128, segment boundary to 32767
[ 124.032056] raid0: looking at sdb1
[ 124.032058] raid0: comparing sdb1(488383936) with sdb1(488383936)
[ 124.032061] raid0: END
[ 124.032063] raid0: ==> UNIQUE
[ 124.032064] raid0: 1 zones
[ 124.032066] raid0: looking at sdc1
[ 124.032068] raid0: comparing sdc1(488383936) with sdb1(488383936)
[ 124.032070] raid0: EQUAL
[ 124.032072] raid0: FINAL 1 zones
[ 124.032075] raid0: done.
[ 124.032076] raid0 : md_size is 976767872 blocks.
[ 124.032078] raid0 : conf->hash_spacing is 976767872 blocks.
[ 124.032080] raid0 : nb_zone is 1.
[ 124.032082] raid0 : Allocating 8 bytes for hash.
[ 124.033695] md0: p1
[ 124.091416] hald[2821]: segfault at 0 ip 0000000000435b05 sp 00007fff9db69b30 error 4 in hald[400000+57000]

Chris Coulson (chrisccoulson) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. Please try to obtain a backtrace following the instructions at http://wiki.ubuntu.com/DebuggingProgramCrash and upload the backtrace (as an attachment) to the bug report. This will greatly help us in tracking down your problem.

Changed in hal (Ubuntu):
status: New → Incomplete
Noumayos (noumayos) wrote :

Thank you for your work.

StoatWblr (stoatwblr) wrote :

I am seeing the same segfault when hald probes my md raid1 devices.

Chris Coulson (chrisccoulson) wrote :

Thanks. It would be useful also for you to be able to run "sudo hald --verbose=yes --daemon=no 2>&1 | tee ~/hald.log", then recreate the steps to trigger the crash and attach the log here.

Thanks

Chris Coulson (chrisccoulson) wrote :

Would you also mind trying the build of HAL from my PPA [1]?

[1] - https://launchpad.net/~chrisccoulson/+archive/ppa

Noumayos (noumayos) on 2009-04-17
summary: - hald segfault when using a raid 0 volume
+ hald segfault when using a raid volume

Please find attached the requested log

Noumayos (noumayos) wrote :

Your build seems to work.

Chris Coulson (chrisccoulson) wrote :

Thanks. I don't know if my patch is the right way to fix it. With my build, could you also please run "lshal > lshal.log" after assembling the raid volume, and attaching "lshal.log" to the bug report. Once that is done, I will send all this upstream.

Thanks

Changed in hal (Ubuntu):
importance: Undecided → Medium
StoatWblr (stoatwblr) wrote :

Your patch is working for me.

I'll leave it to the original poster to post his lshal.log unless you'd like mine as well.

This bug only manifested on 2.6.28-11 - booting 2.6.27-11 on Jaunty beta was fine.

Chris Coulson (chrisccoulson) wrote :

If you can provide the output, then it would be appreciated (from both kernels).

Noumayos (noumayos) wrote :

Please find the log attached.

Chris Coulson (chrisccoulson) wrote :

Thanks Noumayos. That was before you assembled your raid array though wasn't it?

Noumayos (noumayos) wrote :

The result of lshal is the same before and after the raid array.

Chris Coulson (chrisccoulson) wrote :

Would you mind running "lshal -m", assembling your array and then posting any output?

Thanks

Noumayos (noumayos) wrote :

I have no output when assembling my array.

StoatWblr (stoatwblr) wrote :

Here's my lshal.log. As I said, your updated package is working for me (RAID1)

I hope this helps.

Works for me too. Good work, many thanks! :)

Hi,

I upgraded today from Intrepid to Jaunty, and had the same problem with no mouse/keyboard in X, and a message in system log about a segfault with HAL. I am using software raid.

I downloaded and installed the package supplied by Chris and it works. I hope this patch will be applied to the official package as well.

Thanks for the help!

Chris Morgan (chmorgan) wrote :

Also used the hald supplied by Chris and the keyboard and mouse work again.

BobMcD (mcbobbo) wrote :

One more: I also used the hald supplied by Chris and it fixed it. Same symptom - md's causing hal to crash.

Russell Davies (russelldavies) wrote :

I can confirm this when using a RAID 10 array. Chris's HAL builds also fixed the problem for me.

Sergey Nizovtsev (snizovtsev) wrote :

Chris's HAL builds helped for me too. I think that the bug status should be 'In progress' instead of 'Incomplete'.

mobrien118 (mobrien118) wrote :

How is it possible that this bug is only listed as "Medium" importance?!?!?

I discovered what I think is this bug when I upgraded my system using system update. Upon a later reboot I found that I had no access to the system whatsoever. After trying multiple things to get back in from the root recovery console, I re-formatted, losing a lot of configuration.

Fortunately, I am somewhat tech savvy and I didn't lose everything, but this cost me 2 days and will probably cost more. And for the average user, this could equate to complete data loss. I would think this would be a high importance bug since it disables a working system!

Does anyone agree with me, or am I missing something here?

--mobrien118

mobrien118 (mobrien118) wrote :

FYI, it looks like Chris's PPA packages fixed it, though.

Thank goodness! If I had been remote from this machine (which I will be for the next few months) it would have been a nightmare!

Chris Morgan (chmorgan) wrote :

Mobrien118, I agree that there should be some consideration of its importance. Maybe it isn't that big of a deal since few people have raid sets on their computers. It cost me several hours of rebooting and googling before I thought to boot into recovery console, install openssh and then log in with another machine to look at the dmsg output. I considered myself fortunate to have stumbled upon the solution here since the failure case is very confusing. If hald is so critical maybe there should be a better way of reporting these errors to the user.

Chris Coulson (chrisccoulson) wrote :

I've not had much time to look any further at this, and I haven't proposed my patch as a fix yet because I don't know if it is the right way to fix it. What I need to do really is send this bug report upstream, and also have a play around with a mdraid setup myself, but I don't have a clue how to set one of those up.

Perhaps someone here could help me set one up ;)

Changed in hal (Ubuntu):
status: Incomplete → Confirmed
Download full text (5.8 KiB)

When assembling certain MD raid devices, hald crashes:

#0 0x0000000000435b05 in hotplug_event_begin_add_blockdev (sysfs_path=0x26c0130 "/sys/devices/virtual/block/md0/md0p1", device_file=<value optimized out>, is_partition=<value optimized out>, parent=0x260bca0, end_token=0x26c0020) at blockdev.c:1501
 sysfs_path_len = <value optimized out>
 is_physical_partition = <value optimized out>
 volume_label = 0x2681390 ""
 buf = "Volume\000\0009\001l\002\000\000\000\0001\n\000\000\000\000\000\000Ù\202\\\005Ê\177\000\000\000Þ\202\005Ê\177\000\000\037^Y\005Ê\177\000\000\000\206`\002\000\000\000\000 \000l\002\000\000\000"
 major_minor = <value optimized out>
 d = (HalDevice *) 0x267de80
 major = 259
 minor = 0
 is_fakevolume = 0
 sysfs_path_real = 0x2693670 "/sys/devices/virtual/block/md0/md0p1"
 floppy_num = <value optimized out>
 is_device_mapper = 0
 is_md_device = 1
 is_cciss_device = 0
 md_number = 0
 __func__ = "hotplug_event_begin_add_blockdev"
#1 0x0000000000425d72 in hotplug_event_begin_sysfs (hotplug_event=0x26c0020) at hotplug.c:220
 parent = (HalDevice *) 0x0
 range = 1
 is_partition = 1
 d = (HalDevice *) 0x0
 subsystem = "0Rë\rÿ\177\000\000S£D\000\000\000\000\0000\001l\002\000\000\000\0008\227Á\004Ê\177\000\000\001\200­û\000\000\000\0000\001l\002\000\000\000\0000\001l\002\000\000\000\0000\001l\002\000\000\000\0000\001l\002\000\000\000\000T\001l\002\000\000\000\000/\003l\002\000\000\000\0000\001l\002\000\000\000\000/\003l\002", '\0' <repeats 44 times>, " \000\000\000\004\000\000\000 \020\000\000\000\000\000\000\000\000è\004Ê\177\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000ªè\004Ê\177\000\0000\020", '\0' <repeats 14 times>, "\b\000\000\000\000\000\000\000pªè\004Ê\177\000\000ÿÿÿÿ\000\000\000\000Tl\\\005Ê\177"...
 subsystem_target = <value optimized out>
 __func__ = "hotplug_event_begin_sysfs"
#2 0x00000000004261c8 in hotplug_event_process_queue () at hotplug.c:295
 hotplug_event = (HotplugEvent *) 0x26c0020
 lp = (GList *) 0x2683da0
 lp2 = (GList *) 0x0
 processing = 1
 __func__ = "hotplug_event_process_queue"
#3 0x0000000000424f82 in hald_udev_data (source=<value optimized out>, condition=<value optimized out>, user_data=<value optimized out>) at osspec.c:259
 fd = <value optimized out>
 smsg = {msg_name = 0x0, msg_namelen = 0, msg_iov = 0x7fff0deb53d0, msg_iovlen = 1, msg_control = 0x7fff0deb63e0, msg_controllen = 32, msg_flags = 0}
 cmsg = <value optimized out>
 iov = {iov_base = 0x7fff0deb53e0, iov_len = 4096}
 cred = <value optimized out>
 cred_msg = "\034\000\000\000\000\000\000\000\001\000\000\000\002\000\000\000Í\034", '\0' <repeats 13 times>
 buf = "add@/devices/virtual/block/md0/md0p1\000UDEV_LOG=3\000ACTION=add\000DEVPATH=/devices/virtual/block/md0/md0p1\000SUBSYSTEM=block\000DEVTYPE=partition\000SEQNUM=1723\000MAJOR=259\000MINOR=0\000DEVLINKS=/dev/block/259:0\000DEVNAME=/d"...
 bufpos = 209
 action = 0x7fff0deb5417 "add"
 __func__ = "hald_udev_data"
#4 0x00007fca055a420a in g_main_context_dispatch () from /usr/lib/libglib-2.0.so.0
No symbol table info available.
#5 0x00007fca055a78e0 in ?? () from /usr/lib/libglib-2.0.so.0
No symbol table info available.
#6 0x00007fca055a7dad in g_main_loop...

Read more...

Created an attachment (id=25568)
Patch which fixes the issue (don't assume that the parent has storage.drive_type property)

I think I can get you started.

I mean, the first step to creating a RAID volume is to think it out. Where do you need redundancy? Where do you need speed? Then map out your partitions (especially if you have different sized disks).

Remember that RAID will cause a slight (in the case of RAID0 or RAID1) to slightly greater (RAID5 or RAID6) processor and I/O load. That is the trade-off for getting better overall disk performance.

Although it is supposedly not needed to RAID0 your swap partitions across disks (supposedly the swap daemon manages multiple disks very well) it doesn't hurt to do so and is an easy and safe way to get started with RAID. You might consider making this your test case.

This page lays out mdadm and Linux RAID pretty well: http://ubuntuforums.org/showthread.php?t=408461

Basically:
1. format disks you want to use as "RAID" partitions
2. create a RAID array using "mdadm --create /dev/md0 --level=[level] --raid-devices=[number of devices] [device1] [device2]...[deviceN]
3. assemble the array
4. format the array in the filesystem of your choice (like any other partition)
5. mount it like you would any other disk partition

Pretty simple, eh?

Chris Coulson (chrisccoulson) wrote :

Thanks for your help mobrien118 and Russell (who also contacted me privately with some help). I've managed to recreate the crash now.

Changed in hal (Ubuntu):
assignee: nobody → Chris Coulson (chrisccoulson)
status: Confirmed → In Progress
Changed in hal:
status: Unknown → Confirmed
Chris Coulson (chrisccoulson) wrote :

Now I understand it properly, I've re-written the patch, tested it and sent upstream.

Changed in hal (Ubuntu):
status: In Progress → Triaged
mobrien118 (mobrien118) wrote :

To the main Ubuntu repos or to your PPA?

Chris Coulson (chrisccoulson) wrote :

The patch is in my bzr branch, waiting to be merged in to the ubuntu-core-dev branch. I've also sent the patch to the upstream freedesktop bug tracker: https://bugs.freedesktop.org/show_bug.cgi?id=21603

summary: - hald segfault when using a raid volume
+ hald crashed with SIGSEGV in hotplug_event_begin_add_blockdev when
+ assembling mdraid devices

Also confirmed, problem appeared after creating mdadm raidset and reboot.

Chris' build fixed my problem as well
Thanks!

mgcsinc (mgcsinc) wrote :

Adding my voice to the chorus - same problem.

Haven't tried the patch yet (I'm away from the box right now), but will ASAP. I agree that there should be consideration of increasing importance if that's still appropriate.

Thank you! Committed in b35bf1f

Martin Pitt (pitti) wrote :

I committed the fix upstream, thanks Chris!

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package hal - 0.5.12+git20090512-0ubuntu2

---------------
hal (0.5.12+git20090512-0ubuntu2) karmic; urgency=low

  * debian/patches/50_no_crash_on_md_blockdev.patch:
    - When adding a block device, don't assume that the parent
      has storage capability. This fixes a crash where the device
      is re-parented to the root computer device object (such as
      with mdraid devices). LP: #361689.

 -- Chris Coulson <email address hidden> Fri, 15 May 2009 18:34:58 +0200

Changed in hal (Ubuntu):
status: Triaged → Fix Released
Changed in hal:
status: Confirmed → Fix Released

So... may I suggest the new hald build be uploaded to the jaunty repository?

I've spent the last 3 hours debugging what seemed to be a DBus problem, but turned out to be this. Then I needed to pull the fix from karmic because the fix is not available in the jaunty repo.

Eric D (ericdeshayes) wrote :

excuse my ignorance, but do we have any idea when that fix will be available when I update my system?
shouldn't the severity be high as it breaks any installation that is using raid afaik? shouldn't that issue be listed in the release note (i would not have upgraded if I had known..).

I've updated on saturday from 8.10 and now nothing is working and I am not too keen on applying a temporary fix, knowing that it has few depencies (libblkid1). the alternative would be to re-install 8.10 unless I am told a fix would be available in the next few days.

many thanks for your work and for your answer.
eric

ded (ded-launchpad) wrote :

Same deal here, except that the fixed hal .deb depends on version 2.15 of libblkid1 which is not available, at least on amd64, in the repositories either.

My system is now working (after some painful googling with links) after applying the following:

http://launchpadlibrarian.net/26631965/libblkid1_2.15-1ubuntu2_amd64.deb

Then applying Chris's patch from above.

Gigthanks, Chris. And like everyone else, I think the raid-running world ought to be warned off of Jaunty until this is fixed in the distribution.

Regards,
ded

ded (ded-launchpad) wrote :

Spoke too soon. After installing both Chris's fixed hal and the libblkid1 update above, my AMD64 system failed to boot, pausing about 1/5 of the way through. If I hit Alt-Ctl-Del, I could get the boot to resume but with my / mounted read-only. I suspect something is wrong with the libblkid1 that is causing the problem.

Anyone else seeing this?

Eric D (ericdeshayes) wrote :

Yes, I have the same problem.
From my quick investigation, the problem was when the findfs binary was called and it was stuck on that binary execution.

ded (ded-launchpad) wrote :

Eric, thanks for confirming. Is your system AMD64 or i386?

Chris Coulson (chrisccoulson) wrote :

Please don't do silly things like install libblkid from karmic - that's totally unsupported and is likely to break your machine.

Everyone is at UDS at the moment but I'll see if this could be considered for a SRU when everyone gets back.

Eric D (ericdeshayes) wrote :

my system is AMD64.

ded (ded-launchpad) wrote :

It's not something I would have thought of on my own, but it appears to be a dependency in the amd .deb hal package you posted above:

root@saturn:/home/ded/Downloads# dpkg -i hal_0.5.12+git20090512-0ubuntu2_amd64.deb
(Reading database ... 94344 files and directories currently installed.)
Preparing to replace hal 0.5.12~rc1+git20090403-0ubuntu1 (using hal_0.5.12+git20090512-0ubuntu2_amd64.deb) ...
 * Stopping Hardware abstraction layer hald [ OK ]
Unpacking replacement hal ...
dpkg: dependency problems prevent configuration of hal:
 hal depends on libblkid1 (>= 2.15~rc2-1ubuntu1); however:
  Version of libblkid1 on system is 1.41.4-1ubuntu1.
dpkg: error processing hal (--install):
 dependency problems - leaving unconfigured
Processing triggers for man-db ...
Errors were encountered while processing:
 hal

Is karmic poison? Why the warning?

Thanks.

ded (ded-launchpad) wrote :

Chris et. al.,

OK, I get it now. karmic is the next release---my bad, just an end user here.

Still, is there some repository that has version 2.15 of libblkid1 for jaunty? Would someone please post a link or the sources entry for such a repository? Looks like Chris's fix to hal needs it from somewhere at least on the 64-bit systems.

Thanks.

ded (ded-launchpad) wrote :

All,

Has everyone else been able to work around this issue, or is it just me? Since my jaunty update, I can't get mouse or keyboard with mdadm installed.

I was hoping someone would tell me what to do about the libblkid1 dependency in Chris's hal update---at least on amd64---but no traffic here for several days.

Chris, any help? Any one else? Does it work on an i386?

Regards,

Martin Pitt (pitti) on 2009-06-04
tags: added: regression-release
Changed in hal (Ubuntu Jaunty):
assignee: nobody → Chris Coulson (chrisccoulson)
importance: Undecided → Medium
status: New → In Progress
Chris Coulson (chrisccoulson) wrote :

Here's a debdiff for the Jaunty update

Changed in hal (Ubuntu Jaunty):
status: In Progress → Triaged
Martin Pitt (pitti) wrote :

Accepted hal into jaunty-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in hal (Ubuntu Jaunty):
status: Triaged → Fix Committed
tags: added: verification-needed
Patryk Bajer (bayger) wrote :

The patch from jaunty-proposed WORKS for me! Thank you!

ded (ded-launchpad) wrote :

Patch also worked for me on AMD64. Thanks, Chris.

Martin Pitt (pitti) on 2009-06-09
tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package hal - 0.5.12~rc1+git20090403-0ubuntu2

---------------
hal (0.5.12~rc1+git20090403-0ubuntu2) jaunty-proposed; urgency=low

  * debian/patches/50_no_crash_on_md_blockdev.patch:
    - When adding a block device, don't assume that the parent
      has storage capability. This fixes a crash where the device
      is re-parented to the root computer device object (such as
      with mdraid devices). LP: #361689.

 -- Chris Coulson <email address hidden> Fri, 05 Jun 2009 12:25:50 +0200

Changed in hal (Ubuntu Jaunty):
status: Fix Committed → Fix Released
Oli Wade (olithered) wrote :

After applying this update I am having trouble with my X server. There are the following errors in the log:

====
(EE) config/hal: NewInputDeviceRequest failed (8)
(EE) config/hal: NewInputDeviceRequest failed (8)
(EE) config/hal: NewInputDeviceRequest failed (8)
(EE) config/hal: NewInputDeviceRequest failed (8)
(EE) config/hal: NewInputDeviceRequest failed (8)
(EE) config/hal: NewInputDeviceRequest failed (8)
(EE) config/hal: NewInputDeviceRequest failed (8)
(EE) config/hal: NewInputDeviceRequest failed (8)
(EE) config/hal: NewInputDeviceRequest failed (8)
(EE) config/hal: NewInputDeviceRequest failed (8)
(EE) config/hal: NewInputDeviceRequest failed (8)
(EE) config/hal: NewInputDeviceRequest failed (8)
(EE) config/hal: NewInputDeviceRequest failed (8)
====

Do you think it could be a side effect?

Oli Wade [2009-06-17 9:41 -0000]:
> After applying this update I am having trouble with my X server. There
> are the following errors in the log:
>
> ====
> (EE) config/hal: NewInputDeviceRequest failed (8)
>
> Do you think it could be a side effect?

The hal update didn't change anything wrt. input devices. If you
downgrade to the previous hal again [1], does it work again?

Does that only happen right after the package upgrade, or also after a
restart of the machine?

[1] sudo apt-get install hal/jaunty-updates
--
Martin Pitt | http://www.piware.de
Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)

Oli Wade (olithered) wrote :

It happened after a reboot - I blamed this update due to the "hal" in the error message.

I've downgraded ("sudo apt-get install hal/jaunty libhal1/jaunty libhal-storage1/jaunty") but the problem remained through several reboots until I did a shutdown and then poweron.

Therefore I suspect some part(s) of the hardware might have been in a weird state and there is nothing wrong with the update.

Martin Pitt (pitti) wrote :

Oli Wade [2009-06-18 8:56 -0000]:
> Therefore I suspect some part(s) of the hardware might have been in a
> weird state and there is nothing wrong with the update.

OK, thanks for checking!

mobrien118 (mobrien118) wrote :

Ahhh! Is this bug back?

The server I was having a problem with is 1000 miles away from me now and I rebooted it and it didn't come back up. Thinking back a few hours, I remember that "update-manager" installed a HAL update.

Noooooooooooo! I need my server and this is going to force me into weeks of downtime! How is this not a "Critical" bug, and how did this update cause a regression?

This is absolutely horrible. The first time I experienced this bug, it cost me hours/possibly days of troubleshooting, now I have indefinite unscheduled downtime. Seriously CRITICAL!

Anyone have any suggestions?

mobrien118 (mobrien118) wrote :

Didn't mean to sound upset at anyone in my previous post. I know that Chris did an awesome job with the first patch. I'm just upset and looking for a support group :-)

The sooner we can get a permanent fix for this issue, the sooner I can get a good night's sleep.

Please, anyone who is capable, help out with this!

Also, can we change the status back to "confirmed" or "incomplete" so it will bubble back up and get noticed?

Chris Coulson (chrisccoulson) wrote :

This bug hasn't regressed, as the recent HAL update was completely unrelated, and didn't even touch any code AFAICT. If you're experiencing any issues, it's defaintely not related to this bug, even if you are experiencing a HAL crash.

You should open a new bug report, preferably by submitting a crash report using Apport. You might need to enable apport in /etc/default/apport and restart though.

Changed in hal:
importance: Unknown → Critical
Changed in hal:
importance: Critical → Unknown
Changed in hal:
importance: Unknown → Critical
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.