Bug #334994 “Degraded RAID boot fails: kobject_add_internal fail...” : Bugs : linux package : Ubuntu

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2009-02-26:

#1

Screenshot.png Edit (16.8 KiB, image/png)

Screenshot attached.

Changed in linux:
importance:	Undecided → High

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2009-02-26:

#2

Using the mainline builds, I have verified that this problem is not exhibited by the 2.6.29rc1 or 2.6.29rc6 kernels, as installed from:
* http://kernel.ubuntu.com/~kernel-ppa/mainline/

Hopefully that's enough to help isolate the patch. I suspect it's in or around the md driver.

If you can build a test kernel, I'm happy to verify a fix

:-Dustin

Changed in linux:
assignee:	nobody → apw
milestone:	none → jaunty-alpha-6
status:	New → Triaged

Revision history for this message

Andy Whitcroft (apw) wrote on 2009-03-09:

#3

@Dustin -- it appears there is a whole bunch of change in this area to allow the raids to be created on the fly. I suspect that this is occuring becase mdadmin made the raid with all devices which then failed, but did not clear it down before rebuilding it in degraded mode, and as such a disk was already attached to the md0.

We need to find out what command sequence is being applied here in both the normal case and the degraded case, ie. exactly which commands have been attempted.

Changed in linux:
status:	Triaged → In Progress

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2009-03-10:

#4

Are you thinking this is a kernel or userspace problem, Andy?

:-Dustin

Revision history for this message

Andy Whitcroft (apw) wrote on 2009-03-10: Re: [Bug 334994] Re: Degraded RAID boot fails: kobject_add_internal failed for dev-sda1 with -EEXIST, don't try to register things with the same name in the same directory

#5

On Tue, Mar 10, 2009 at 12:39:35PM -0000, Dustin Kirkland wrote:
> Are you thinking this is a kernel or userspace problem, Andy?

I do not know for sure right now. I would say that the kernel thinks it
is doing something sensible. That the array is already present in some
sense and it is rejecting the later degraded rebuild. The changes in
the space look to point to a new way of handing these where they go away
automatically. It is possible we have userspace/kernel skew rather than
any specific error on either side.

Dustin Kirkland  (kirkland) on 2009-03-16

Changed in linux (Ubuntu):
milestone:	jaunty-alpha-6 → ubuntu-9.04-beta

Dustin Kirkland  (kirkland) on 2009-03-23

Changed in linux (Ubuntu Jaunty):
status:	In Progress → Invalid

Revision history for this message

Andy Whitcroft (apw) wrote on 2009-03-23:

#6

ok i have been unable to reproduce with the images I have. however it seems that the issue is a userspace interaction and is fixed in the debian mdadm. we need to sync that into ubuntu. the kernel task is now closed, and this targetted against mdadm.

Changed in mdadm:
importance:	Undecided → High

Steve Beattie (sbeattie) on 2009-03-23

Changed in mdadm:
assignee:	nobody → canonical-server
status:	New → Triaged

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2009-03-23:

#7

Actually, I have to re-open the kernel task.

This is a problem exhibited by the Ubuntu 2.6.28 kernel, which is not present in 2.6.29. We can hack our way around this perhaps with an updated userspace mdadm package, but the error message in this bug's description still appears, and we get kernel oops information in dmesg.

I'll upload a better image overnight.

:-Dustin

Changed in linux (Ubuntu Jaunty):
status:	Invalid → Confirmed

Revision history for this message

Mathias Gug (mathiaz) wrote on 2009-03-23:

#8

2.6.28-11-server oops Edit (10.4 KiB, text/plain)

After some testing done a system with a RAID1 array here are my results:

mdadm 2.6.7.2: unable to boot from a degraded RAID1 array.

mdadm 2.6.8: able to boot from a degraded RAID1 array.
- jaunty kernel 2.6.28-11-server: error messages are printed on the console (as shown in the attached Screenshot), 2 oopses reported in the kernel log. I've attached the relevant entries from /var/log/kern.log.
- upstream kernel 2.6.29-020629rc8-generic: no error message, no kernel oops.

Andy Whitcroft (apw) on 2009-03-24

Changed in linux:
status:	Confirmed → In Progress

Revision history for this message

Andy Whitcroft (apw) wrote on 2009-03-25:

#9

DMESG Edit (31.7 KiB, text/plain)

Ok, I have managed to reproduce this in a KVM here. Console output is as below, dmesg output attached:

** WARNING: There appears to be one or more degraded RAID devices **

The system may have suffered a hardware fault, such as a disk drive
failure. The root device may depend on the RAID devices being online. One
or more of the following RAID devices are degraded:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : inactive sda5[0](S)
1999936 blocks

md0 : inactive sda1[0](S)
96256 blocks

unused devices: <none>
Attempting to start the RAID in degraded mode...
mdadm: CREATE user root not found
mdadm: CREATE group disk not found
[ 35.024024] kobject_add_internal failed for dev-sda1 with -EEXIST, don't try to register things with the same name in the same directory.
mdadm: failed to add /dev/sda1 to /dev/md0: File exists
mdadm: failed to RUN_ARRAY /dev/md0: Invalid argument
mdadm: Not enough devices to start the array.
[ 35.074648] kobject_add_internal failed for dev-sda5 with -EEXIST, don't try to register things with the same name in the same directory.
mdadm: failed to add /dev/sda5 to /dev/md1: File exists
mdadm: failed to RUN_ARRAY /dev/md1: Invalid argument
mdadm: Not enough devices to start the array.
Could not start the RAID in degraded mode.
Gave up waiting for root device. Common problems:
- Boot args (cat /proc/cmdline)
- Check rootdelay= (did the system wait long enough?)
- Check root= (did the system wait for the right device?)
- Missing modules (cat /proc/modules; ls /dev)
ALERT! /dev/md1 does not exist. Dropping to a shell!

BusyBox v1.10.2 (Ubuntu 1:1.10.2-2ubuntu6) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs)

Revision history for this message

Andy Whitcroft (apw) wrote on 2009-03-25:

#10

Download full text (3.2 KiB)

I should also note that the kernel is not lying, these file are visibly present in sysfs:

    (initramfs) ls /sys/devices/virtual/block/md0/md
    dev-sda1 safe_mode_delay resync_start raid_disks
    reshape_position new_dev component_size layout
    array_state metadata_version chunk_size level
    (initramfs) ls /sys/devices/virtual/block/md1/md
    dev-sda5 safe_mode_delay resync_start raid_disks
    reshape_position new_dev component_size layout
    array_state metadata_version chunk_size level
    (initramfs)

Note the dev-sda1 in the md0/md directory in sysfs, and the dev-sda5 in the md1/md directory. These are the ones it complains about on insertion:

    [ 35.023792] WARNING: at /build/buildd/linux-2.6.28/fs/sysfs/dir.c:462 sysfs_add_one+0x4c/0x50()
    [ 35.023794] sysfs: duplicate filename 'dev-sda1' can not be created
    [...]
    [ 35.074528] WARNING: at /build/buildd/linux-2.6.28/fs/sysfs/dir.c:462 sysfs_add_one+0x4c/0x50()
    [ 35.074529] sysfs: duplicate filename 'dev-sda5' can not be created

Whatever registered this directory seems to have done it properly, it has appropriate links etc internally:
    (initramfs) ls -l /sys/devices/virtual/block/md0/md/dev-sda1
    lrwxrwxrwx 1 0 0 0 block -> ../../../../../pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda1
    -rw-r--r-- 1 0 0 4096 size
    -rw-r--r-- 1 0 0 4096 offset
    -rw-r--r-- 1 0 0 4096 slot
    -rw-r--r-- 1 0 0 4096 errors
    -rw-r--r-- 1 0 0 4096 state

Ok so where do these come from. They are made by bind_rdev_to_array() and undone by unbind_rdev_from_array(). From the logs we can see that that basically the kernel is making, unmaking, and remaking the array to degrade it:

    [ 3.371474] md: bind<sda1>
    [ 3.381990] md: bind<sda5>
    [...]
    [ 35.003029] md: md0 stopped.
    [ 35.003043] md: unbind<sda1>
    [ 35.020198] md: export_rdev(sda1)
    [ 35.023745] md: bind<sda1>
    [ 35.023787] ------------[ cut here ]------------
    [ 35.023792] WARNING: at /build/buildd/linux-2.6.28/fs/sysfs/dir.c:462 sysfs_add_one+0x4c/0x50()
    [ 35.023794] sysfs: duplicate filename 'dev-sda1' can not be created

If we look at the unbind_rdev_from_array() call it uses delayed work to remove the actual entries:

    static void unbind_rdev_from_array(mdk_rdev_t * rdev)
    {
[...]
        synchronize_rcu();
        INIT_WORK(&rdev->del_work, md_delayed_delete);
        kobject_get(&rdev->kobj);
        schedule_work(&rdev->del_work);
    }

And it appears to be this this is removing the objects finally:

    static void md_delayed_delete(struct work_struct *ws)
    {
        mdk_rdev_t *rdev = container_of(ws, mdk_rdev_t, del_work);
        kobject_del(&rdev->kobj);
        kobject_put(&rdev->kobj);
    }

So if this was not waited for appropriatly we might well then sometimes manage to get back to binding the new one before this has been done. This being a race would also fit with the transient nature of...

I should also note that the kernel is not lying, these file are visibly present in sysfs:

(initramfs) ls /sys/devices/virtual/block/md0/md
    dev-sda1          safe_mode_delay   resync_start      raid_disks
    reshape_position  new_dev           component_size    layout
    array_state       metadata_version  chunk_size        level
    (initramfs) ls /sys/devices/virtual/block/md1/md
    dev-sda5          safe_mode_delay   resync_start      raid_disks
    reshape_position  new_dev           component_size    layout
    array_state       metadata_version  chunk_size        level
    (initramfs)

Note the dev-sda1 in the md0/md directory in sysfs, and the dev-sda5 in the md1/md directory.  These are the ones it complains about on insertion:

[   35.023792] WARNING: at /build/buildd/linux-2.6.28/fs/sysfs/dir.c:462 sysfs_add_one+0x4c/0x50()
    [   35.023794] sysfs: duplicate filename 'dev-sda1' can not be created
    [...]
    [   35.074528] WARNING: at /build/buildd/linux-2.6.28/fs/sysfs/dir.c:462 sysfs_add_one+0x4c/0x50()
    [   35.074529] sysfs: duplicate filename 'dev-sda5' can not be created

Whatever registered this directory seems to have done it properly, it has appropriate links etc internally:
    (initramfs) ls -l  /sys/devices/virtual/block/md0/md/dev-sda1
    lrwxrwxrwx    1 0        0               0 block -> ../../../../../pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda1
    -rw-r--r--    1 0        0            4096 size
    -rw-r--r--    1 0        0            4096 offset
    -rw-r--r--    1 0        0            4096 slot
    -rw-r--r--    1 0        0            4096 errors
    -rw-r--r--    1 0        0            4096 state

Ok so where do these come from.  They are made by bind_rdev_to_array() and undone by unbind_rdev_from_array().  From the logs we can see that that basically the kernel is making, unmaking, and remaking the array to degrade it:

[    3.371474] md: bind<sda1>
    [    3.381990] md: bind<sda5>
    [...]
    [   35.003029] md: md0 stopped.
    [   35.003043] md: unbind<sda1>
    [   35.020198] md: export_rdev(sda1)
    [   35.023745] md: bind<sda1>
    [   35.023787] ------------[ cut here ]------------
    [   35.023792] WARNING: at /build/buildd/linux-2.6.28/fs/sysfs/dir.c:462 sysfs_add_one+0x4c/0x50()
    [   35.023794] sysfs: duplicate filename 'dev-sda1' can not be created

If we look at the unbind_rdev_from_array() call it uses delayed work to remove the actual entries:

static void unbind_rdev_from_array(mdk_rdev_t * rdev)
    {
	[...]
        synchronize_rcu();
        INIT_WORK(&rdev->del_work, md_delayed_delete);
        kobject_get(&rdev->kobj);
        schedule_work(&rdev->del_work);
    }

And it appears to be this this is removing the objects finally:

static void md_delayed_delete(struct work_struct *ws)
    {
        mdk_rdev_t *rdev = container_of(ws, mdk_rdev_t, del_work);
        kobject_del(&rdev->kobj);
        kobject_put(&rdev->kobj);
    }

So if this was not waited for appropriatly we might well then sometimes manage to get back to binding the new one before this has been done.  This being a race would also fit with the transient nature of the issue.

Will patch this to wait for the pending work and see if that resolves the issue or not.

Steve Langasek (vorlon) on 2009-03-26

Changed in linux:
milestone:	ubuntu-9.04-beta → ubuntu-9.04

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2009-03-27:

#11

I'm marking this invalid against mdadm. I have tried a variety of fixes from upstream source related to incremental md assembly, merged both 2.6.8 and 2.6.7.2 from Debian into Jaunty, and none of these solve the issue.

Meanwhile, Andy has confirmed that he can see the race condition in the kernel and he is currently working on a solution for that. Look for this to be solved by an updated linux-image build.

:-Dustin

Changed in mdadm (Ubuntu Jaunty):
status:	Triaged → Invalid

Revision history for this message

Andy Whitcroft (apw) wrote on 2009-03-27:

#12

Added some debugging to the teardown code and managed to reproduce this. What we find is that we unbind and then attempt and fail a bind on the array, then we see the deletes for the unbind complete. This leads to the bind failure:

    [ 3.476504] md: bind<sda1>
    [...]
    [ 35.097882] md: md0 stopped.
    [ 35.097897] md: unbind<sda1>
    [ 35.097907] APW: sysfs_remove_link ret<0>
    [ 35.110198] md: export_rdev(sda1)
    [ 35.113254] md: bind<sda1>
    [ 35.113297] ------------[ cut here ]------------
    [ 35.113300] WARNING: at /home/apw/build/jaunty/ubuntu-jaunty/fs/sysfs/dir.c:462 sysfs_add_one+0x4c/0x50()
    [...]
    [ 35.115126] APW: deleted something

Here where we happened to mount successfully, note the delete falls in
the expected place:

    [ 3.479917] md: bind<sda5>
    [...]
    [ 35.118235] md: md1 stopped.
    [ 35.118240] md: unbind<sda5>
    [ 35.118244] APW: sysfs_remove_link ret<0>
    [ 35.140164] md: export_rdev(sda5)
    [ 35.142276] APW: deleted something
    [ 35.143848] md: bind<sda1>
    [ 35.152288] md: bind<sda5>
    [ 35.158571] raid1: raid set md1 active with 1 out of 2 mirrors

If we look at the code for stopping the array we see the following:

    static int do_md_stop(mddev_t * mddev, int mode, int is_open)
    {
    [...]
      rdev_for_each(rdev, tmp, mddev)
       if (rdev->raid_disk >= 0) {
        char nm[20];
        sprintf(nm, "rd%d", rdev->raid_disk);
        sysfs_remove_link(&mddev->kobj, nm);
       }

/* make sure all md_delayed_delete calls have finished */
flush_scheduled_work();

export_array(mddev);
[...]

Note that we flush_scheduled_work() to wait for md_delayed_deletes and then
export the array. However it is export_array() which triggers these
deletes:

    static void export_array(mddev_t *mddev)
    {
    [...]
     rdev_for_each(rdev, tmp, mddev) {
      if (!rdev->mddev) {
       MD_BUG();
       continue;
      }
      kick_rdev_from_array(rdev);
     }
    [...]
    }

It does this via unbind_rdev_from_array():

    static void kick_rdev_from_array(mdk_rdev_t * rdev)
    {
     unbind_rdev_from_array(rdev);
     export_rdev(rdev);
    }

Which triggers the delated delete:

    static void unbind_rdev_from_array(mdk_rdev_t * rdev)
    {
    [...]
     rdev->sysfs_state = NULL;
     /* We need to delay this, otherwise we can deadlock when
      * writing to 'remove' to "dev/state". We also need
      * to delay it due to rcu usage.
      */
     synchronize_rcu();
     INIT_WORK(&rdev->del_work, md_delayed_delete);
     kobject_get(&rdev->kobj);
     schedule_work(&rdev->del_work);
    }

So in reality we do not want to wait for this before the export_array()
but after. Testing with a patch to do this seems to resolve the issue.

Added some debugging to the teardown code and managed to reproduce this.  What we find is that we unbind and then attempt and fail a bind on the array, then we see the deletes for the unbind complete.  This leads to the bind failure:

[    3.476504] md: bind<sda1>
    [...]
    [   35.097882] md: md0 stopped.
    [   35.097897] md: unbind<sda1>
    [   35.097907] APW: sysfs_remove_link ret<0>
    [   35.110198] md: export_rdev(sda1)
    [   35.113254] md: bind<sda1>
    [   35.113297] ------------[ cut here ]------------
    [   35.113300] WARNING: at /home/apw/build/jaunty/ubuntu-jaunty/fs/sysfs/dir.c:462 sysfs_add_one+0x4c/0x50()
    [...]
    [   35.115126] APW: deleted something

Here where we happened to mount successfully, note the delete falls in
the expected place:

[    3.479917] md: bind<sda5>
    [...]
    [   35.118235] md: md1 stopped.
    [   35.118240] md: unbind<sda5>
    [   35.118244] APW: sysfs_remove_link ret<0>
    [   35.140164] md: export_rdev(sda5)
    [   35.142276] APW: deleted something
    [   35.143848] md: bind<sda1>
    [   35.152288] md: bind<sda5>
    [   35.158571] raid1: raid set md1 active with 1 out of 2 mirrors

If we look at the code for stopping the array we see the following:

static int do_md_stop(mddev_t * mddev, int mode, int is_open)
    {
    [...]
		    rdev_for_each(rdev, tmp, mddev)
			    if (rdev->raid_disk >= 0) {
				    char nm[20];
				    sprintf(nm, "rd%d", rdev->raid_disk);
				    sysfs_remove_link(&mddev->kobj, nm);
			    }

/* make sure all md_delayed_delete calls have finished */
		    flush_scheduled_work();

export_array(mddev);
    [...]

Note that we flush_scheduled_work() to wait for md_delayed_deletes and then
export the array.  However it is export_array() which triggers these
deletes:

static void export_array(mddev_t *mddev)
    {
    [...]
	    rdev_for_each(rdev, tmp, mddev) {
		    if (!rdev->mddev) {
			    MD_BUG();
			    continue;
		    }
		    kick_rdev_from_array(rdev);
	    }
    [...]
    }

It does this via unbind_rdev_from_array():

static void kick_rdev_from_array(mdk_rdev_t * rdev)
    {
	    unbind_rdev_from_array(rdev);
	    export_rdev(rdev);
    }

Which triggers the delated delete:

static void unbind_rdev_from_array(mdk_rdev_t * rdev)
    {
    [...]
	    rdev->sysfs_state = NULL;
	    /* We need to delay this, otherwise we can deadlock when
	     * writing to 'remove' to "dev/state".  We also need
	     * to delay it due to rcu usage.
	     */
	    synchronize_rcu();
	    INIT_WORK(&rdev->del_work, md_delayed_delete);
	    kobject_get(&rdev->kobj);
	    schedule_work(&rdev->del_work);
    }

So in reality we do not want to wait for this before the export_array()
but after.  Testing with a patch to do this seems to resolve the issue.

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2009-03-27:

#13

I tested Andy's test kernels on my host, with my host cpu in each of the following configurations:

Pinned
  * 2.4GHz
  * 2.0GHz
  * 1.6GHz
  * 1.2GHz
  * 800MHz

Dynamic
* Conservative
* Ondemand
* Performance
* Powersave

I could no longer reproduce the issue, it appears to be fixed! Big thanks for the intense kernel debugging on Andy's part...

:-Dustin

Revision history for this message

Andy Whitcroft (apw) wrote on 2009-03-28:

#14

@Dustin -- thanks for your testing, I could no longer reproduce the issue locally either with this patch in place. Will push it to the kernel-team.

Andy Whitcroft (apw) on 2009-04-03

Changed in linux (Ubuntu Jaunty):
status:	In Progress → Fix Committed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2009-04-03:

#15

This bug was fixed in the package linux - 2.6.28-11.40

---------------
linux (2.6.28-11.40) jaunty; urgency=low

[ Amit Kucheria ]

* Disable DEVKMEM for all archs on Jaunty
- LP: #354221

[ Andy Whitcroft ]

* SAUCE: md: wait for possible pending deletes after stopping an array
- LP: #334994

[ Brad Figg ]

  * ARM: Setting the bootloader for imx51 flavour.
    - LP: #348382
  * ARM: Add bootloader package Recomendation to iop32x and ixp4xx flavours
    - LP: #348382

[ Tim Gardner ]

* SAUCE: [i915] allocate MCHBAR space & enable if necessary
- LP: #349314

[ Upstream Kernel Changes ]

* hpilo: open/close fix
- LP: #353496

-- Amit Kucheria <email address hidden> Thu, 02 Apr 2009 11:26:22 -0400

Changed in linux (Ubuntu Jaunty):
status:	Fix Committed → Fix Released

Revision history for this message

Rakotomandimby Mihamina (tech-infogerance) wrote on 2009-05-22:

#16

Wouldn't #358054 be somehow related?

Revision history for this message

Rakotomandimby Mihamina (tech-infogerance) wrote on 2009-05-22:

#17

https://bugs.launchpad.net/ubuntu/+bug/358054

Ubuntu
linux package

Degraded RAID boot fails: kobject_add_internal failed for dev-sda1 with -EEXIST, don't try to register things with the same name in the same directory

Bug Description

Related branches

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
linux (Ubuntu)	Fix Released	High	Andy Whitcroft	Ubuntu ubuntu-9.04
Jaunty	Fix Released	High	Andy Whitcroft	Ubuntu ubuntu-9.04
mdadm (Ubuntu)	Invalid	High	Canonical Server
Jaunty	Invalid	High	Canonical Server

Ubuntulinux package

Degraded RAID boot fails: kobject_add_internal failed for dev-sda1 with -EEXIST, don't try to register things with the same name in the same directory

Bug Description

Related branches

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package