Degraded RAID boot fails: kobject_add_internal failed for dev-sda1 with -EEXIST, don't try to register things with the same name in the same directory

Bug #334994 reported by Dustin Kirkland  on 2009-02-26
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Andy Whitcroft
Jaunty
High
Andy Whitcroft
mdadm (Ubuntu)
High
Canonical Server Team
Jaunty
High
Canonical Server Team

Bug Description

Booting degraded RAID has regressed in Jaunty.

When I try to boot a degraded RAID, I get the error messages in the attached screenshot.

There appears to be a nasty bug in the kernel md code that's causing this.
  kobject_add_internal failed for dev-sda1 with -EEXIST, don't try to register things with the same name in the same directory

This is on an up-to-date Jaunty server.

:-Dustin

Dustin Kirkland  (kirkland) wrote :

Screenshot attached.

Changed in linux:
importance: Undecided → High
Dustin Kirkland  (kirkland) wrote :

Using the mainline builds, I have verified that this problem is not exhibited by the 2.6.29rc1 or 2.6.29rc6 kernels, as installed from:
 * http://kernel.ubuntu.com/~kernel-ppa/mainline/

Hopefully that's enough to help isolate the patch. I suspect it's in or around the md driver.

If you can build a test kernel, I'm happy to verify a fix

:-Dustin

Changed in linux:
assignee: nobody → apw
milestone: none → jaunty-alpha-6
status: New → Triaged
Andy Whitcroft (apw) wrote :

@Dustin -- it appears there is a whole bunch of change in this area to allow the raids to be created on the fly. I suspect that this is occuring becase mdadmin made the raid with all devices which then failed, but did not clear it down before rebuilding it in degraded mode, and as such a disk was already attached to the md0.

We need to find out what command sequence is being applied here in both the normal case and the degraded case, ie. exactly which commands have been attempted.

Changed in linux:
status: Triaged → In Progress
Dustin Kirkland  (kirkland) wrote :

Are you thinking this is a kernel or userspace problem, Andy?

:-Dustin

On Tue, Mar 10, 2009 at 12:39:35PM -0000, Dustin Kirkland wrote:
> Are you thinking this is a kernel or userspace problem, Andy?

I do not know for sure right now. I would say that the kernel thinks it
is doing something sensible. That the array is already present in some
sense and it is rejecting the later degraded rebuild. The changes in
the space look to point to a new way of handing these where they go away
automatically. It is possible we have userspace/kernel skew rather than
any specific error on either side.

Changed in linux (Ubuntu):
milestone: jaunty-alpha-6 → ubuntu-9.04-beta
Changed in linux (Ubuntu Jaunty):
status: In Progress → Invalid
Andy Whitcroft (apw) wrote :

ok i have been unable to reproduce with the images I have. however it seems that the issue is a userspace interaction and is fixed in the debian mdadm. we need to sync that into ubuntu. the kernel task is now closed, and this targetted against mdadm.

Changed in mdadm:
importance: Undecided → High
Steve Beattie (sbeattie) on 2009-03-23
Changed in mdadm:
assignee: nobody → canonical-server
status: New → Triaged
Dustin Kirkland  (kirkland) wrote :

Actually, I have to re-open the kernel task.

This is a problem exhibited by the Ubuntu 2.6.28 kernel, which is not present in 2.6.29. We can hack our way around this perhaps with an updated userspace mdadm package, but the error message in this bug's description still appears, and we get kernel oops information in dmesg.

I'll upload a better image overnight.

:-Dustin

Changed in linux (Ubuntu Jaunty):
status: Invalid → Confirmed
Mathias Gug (mathiaz) wrote :

After some testing done a system with a RAID1 array here are my results:

mdadm 2.6.7.2: unable to boot from a degraded RAID1 array.

mdadm 2.6.8: able to boot from a degraded RAID1 array.
 - jaunty kernel 2.6.28-11-server: error messages are printed on the console (as shown in the attached Screenshot), 2 oopses reported in the kernel log. I've attached the relevant entries from /var/log/kern.log.
 - upstream kernel 2.6.29-020629rc8-generic: no error message, no kernel oops.

Andy Whitcroft (apw) on 2009-03-24
Changed in linux:
status: Confirmed → In Progress
Andy Whitcroft (apw) wrote :

Ok, I have managed to reproduce this in a KVM here. Console output is as below, dmesg output attached:

** WARNING: There appears to be one or more degraded RAID devices **

The system may have suffered a hardware fault, such as a disk drive
failure. The root device may depend on the RAID devices being online. One
or more of the following RAID devices are degraded:
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : inactive sda5[0](S)
      1999936 blocks

md0 : inactive sda1[0](S)
      96256 blocks

unused devices: <none>
Attempting to start the RAID in degraded mode...
mdadm: CREATE user root not found
mdadm: CREATE group disk not found
[ 35.024024] kobject_add_internal failed for dev-sda1 with -EEXIST, don't try to register things with the same name in the same directory.
mdadm: failed to add /dev/sda1 to /dev/md0: File exists
mdadm: failed to RUN_ARRAY /dev/md0: Invalid argument
mdadm: Not enough devices to start the array.
[ 35.074648] kobject_add_internal failed for dev-sda5 with -EEXIST, don't try to register things with the same name in the same directory.
mdadm: failed to add /dev/sda5 to /dev/md1: File exists
mdadm: failed to RUN_ARRAY /dev/md1: Invalid argument
mdadm: Not enough devices to start the array.
Could not start the RAID in degraded mode.
Gave up waiting for root device. Common problems:
 - Boot args (cat /proc/cmdline)
   - Check rootdelay= (did the system wait long enough?)
   - Check root= (did the system wait for the right device?)
 - Missing modules (cat /proc/modules; ls /dev)
ALERT! /dev/md1 does not exist. Dropping to a shell!

BusyBox v1.10.2 (Ubuntu 1:1.10.2-2ubuntu6) built-in shell (ash)
Enter 'help' for a list of built-in commands.

(initramfs)

Andy Whitcroft (apw) wrote :
Download full text (3.2 KiB)

I should also note that the kernel is not lying, these file are visibly present in sysfs:

    (initramfs) ls /sys/devices/virtual/block/md0/md
    dev-sda1 safe_mode_delay resync_start raid_disks
    reshape_position new_dev component_size layout
    array_state metadata_version chunk_size level
    (initramfs) ls /sys/devices/virtual/block/md1/md
    dev-sda5 safe_mode_delay resync_start raid_disks
    reshape_position new_dev component_size layout
    array_state metadata_version chunk_size level
    (initramfs)

Note the dev-sda1 in the md0/md directory in sysfs, and the dev-sda5 in the md1/md directory. These are the ones it complains about on insertion:

    [ 35.023792] WARNING: at /build/buildd/linux-2.6.28/fs/sysfs/dir.c:462 sysfs_add_one+0x4c/0x50()
    [ 35.023794] sysfs: duplicate filename 'dev-sda1' can not be created
    [...]
    [ 35.074528] WARNING: at /build/buildd/linux-2.6.28/fs/sysfs/dir.c:462 sysfs_add_one+0x4c/0x50()
    [ 35.074529] sysfs: duplicate filename 'dev-sda5' can not be created

Whatever registered this directory seems to have done it properly, it has appropriate links etc internally:
    (initramfs) ls -l /sys/devices/virtual/block/md0/md/dev-sda1
    lrwxrwxrwx 1 0 0 0 block -> ../../../../../pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda1
    -rw-r--r-- 1 0 0 4096 size
    -rw-r--r-- 1 0 0 4096 offset
    -rw-r--r-- 1 0 0 4096 slot
    -rw-r--r-- 1 0 0 4096 errors
    -rw-r--r-- 1 0 0 4096 state

Ok so where do these come from. They are made by bind_rdev_to_array() and undone by unbind_rdev_from_array(). From the logs we can see that that basically the kernel is making, unmaking, and remaking the array to degrade it:

    [ 3.371474] md: bind<sda1>
    [ 3.381990] md: bind<sda5>
    [...]
    [ 35.003029] md: md0 stopped.
    [ 35.003043] md: unbind<sda1>
    [ 35.020198] md: export_rdev(sda1)
    [ 35.023745] md: bind<sda1>
    [ 35.023787] ------------[ cut here ]------------
    [ 35.023792] WARNING: at /build/buildd/linux-2.6.28/fs/sysfs/dir.c:462 sysfs_add_one+0x4c/0x50()
    [ 35.023794] sysfs: duplicate filename 'dev-sda1' can not be created

If we look at the unbind_rdev_from_array() call it uses delayed work to remove the actual entries:

    static void unbind_rdev_from_array(mdk_rdev_t * rdev)
    {
 [...]
        synchronize_rcu();
        INIT_WORK(&rdev->del_work, md_delayed_delete);
        kobject_get(&rdev->kobj);
        schedule_work(&rdev->del_work);
    }

And it appears to be this this is removing the objects finally:

    static void md_delayed_delete(struct work_struct *ws)
    {
        mdk_rdev_t *rdev = container_of(ws, mdk_rdev_t, del_work);
        kobject_del(&rdev->kobj);
        kobject_put(&rdev->kobj);
    }

So if this was not waited for appropriatly we might well then sometimes manage to get back to binding the new one before this has been done. This being a race would also fit with the transient nature of...

Read more...

Steve Langasek (vorlon) on 2009-03-26
Changed in linux:
milestone: ubuntu-9.04-beta → ubuntu-9.04
Dustin Kirkland  (kirkland) wrote :

I'm marking this invalid against mdadm. I have tried a variety of fixes from upstream source related to incremental md assembly, merged both 2.6.8 and 2.6.7.2 from Debian into Jaunty, and none of these solve the issue.

Meanwhile, Andy has confirmed that he can see the race condition in the kernel and he is currently working on a solution for that. Look for this to be solved by an updated linux-image build.

:-Dustin

Changed in mdadm (Ubuntu Jaunty):
status: Triaged → Invalid
Andy Whitcroft (apw) wrote :

Added some debugging to the teardown code and managed to reproduce this. What we find is that we unbind and then attempt and fail a bind on the array, then we see the deletes for the unbind complete. This leads to the bind failure:

    [ 3.476504] md: bind<sda1>
    [...]
    [ 35.097882] md: md0 stopped.
    [ 35.097897] md: unbind<sda1>
    [ 35.097907] APW: sysfs_remove_link ret<0>
    [ 35.110198] md: export_rdev(sda1)
    [ 35.113254] md: bind<sda1>
    [ 35.113297] ------------[ cut here ]------------
    [ 35.113300] WARNING: at /home/apw/build/jaunty/ubuntu-jaunty/fs/sysfs/dir.c:462 sysfs_add_one+0x4c/0x50()
    [...]
    [ 35.115126] APW: deleted something

Here where we happened to mount successfully, note the delete falls in
the expected place:

    [ 3.479917] md: bind<sda5>
    [...]
    [ 35.118235] md: md1 stopped.
    [ 35.118240] md: unbind<sda5>
    [ 35.118244] APW: sysfs_remove_link ret<0>
    [ 35.140164] md: export_rdev(sda5)
    [ 35.142276] APW: deleted something
    [ 35.143848] md: bind<sda1>
    [ 35.152288] md: bind<sda5>
    [ 35.158571] raid1: raid set md1 active with 1 out of 2 mirrors

If we look at the code for stopping the array we see the following:

    static int do_md_stop(mddev_t * mddev, int mode, int is_open)
    {
    [...]
      rdev_for_each(rdev, tmp, mddev)
       if (rdev->raid_disk >= 0) {
        char nm[20];
        sprintf(nm, "rd%d", rdev->raid_disk);
        sysfs_remove_link(&mddev->kobj, nm);
       }

      /* make sure all md_delayed_delete calls have finished */
      flush_scheduled_work();

      export_array(mddev);
    [...]

Note that we flush_scheduled_work() to wait for md_delayed_deletes and then
export the array. However it is export_array() which triggers these
deletes:

    static void export_array(mddev_t *mddev)
    {
    [...]
     rdev_for_each(rdev, tmp, mddev) {
      if (!rdev->mddev) {
       MD_BUG();
       continue;
      }
      kick_rdev_from_array(rdev);
     }
    [...]
    }

It does this via unbind_rdev_from_array():

    static void kick_rdev_from_array(mdk_rdev_t * rdev)
    {
     unbind_rdev_from_array(rdev);
     export_rdev(rdev);
    }

Which triggers the delated delete:

    static void unbind_rdev_from_array(mdk_rdev_t * rdev)
    {
    [...]
     rdev->sysfs_state = NULL;
     /* We need to delay this, otherwise we can deadlock when
      * writing to 'remove' to "dev/state". We also need
      * to delay it due to rcu usage.
      */
     synchronize_rcu();
     INIT_WORK(&rdev->del_work, md_delayed_delete);
     kobject_get(&rdev->kobj);
     schedule_work(&rdev->del_work);
    }

So in reality we do not want to wait for this before the export_array()
but after. Testing with a patch to do this seems to resolve the issue.

Dustin Kirkland  (kirkland) wrote :

I tested Andy's test kernels on my host, with my host cpu in each of the following configurations:

Pinned
  * 2.4GHz
  * 2.0GHz
  * 1.6GHz
  * 1.2GHz
  * 800MHz

Dynamic
 * Conservative
 * Ondemand
 * Performance
 * Powersave

I could no longer reproduce the issue, it appears to be fixed! Big thanks for the intense kernel debugging on Andy's part...

:-Dustin

Andy Whitcroft (apw) wrote :

@Dustin -- thanks for your testing, I could no longer reproduce the issue locally either with this patch in place. Will push it to the kernel-team.

Andy Whitcroft (apw) on 2009-04-03
Changed in linux (Ubuntu Jaunty):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 2.6.28-11.40

---------------
linux (2.6.28-11.40) jaunty; urgency=low

  [ Amit Kucheria ]

  * Disable DEVKMEM for all archs on Jaunty
    - LP: #354221

  [ Andy Whitcroft ]

  * SAUCE: md: wait for possible pending deletes after stopping an array
    - LP: #334994

  [ Brad Figg ]

  * ARM: Setting the bootloader for imx51 flavour.
    - LP: #348382
  * ARM: Add bootloader package Recomendation to iop32x and ixp4xx flavours
    - LP: #348382

  [ Tim Gardner ]

  * SAUCE: [i915] allocate MCHBAR space & enable if necessary
    - LP: #349314

  [ Upstream Kernel Changes ]

  * hpilo: open/close fix
    - LP: #353496

 -- Amit Kucheria <email address hidden> Thu, 02 Apr 2009 11:26:22 -0400

Changed in linux (Ubuntu Jaunty):
status: Fix Committed → Fix Released

Wouldn't #358054 be somehow related?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers