Fails to install because of brain damage in init script that should be converted to udev rules

Bug #98518 reported by Tore Anderson
4
Affects Status Importance Assigned to Milestone
multipath-tools (Ubuntu)
Fix Released
Medium
Fabio Massimo Di Nitto
Dapper
Fix Released
Medium
Fabio Massimo Di Nitto
Edgy
Invalid
Medium
Fabio Massimo Di Nitto
Feisty
Fix Released
Medium
Fabio Massimo Di Nitto

Bug Description

Binary package hint: multipath-tools

root@hiro:~# dpkg --configure -a
Setting up multipath-tools (0.4.7-1ubuntu7) ...
Starting multipath START STOP UNIT command error: SCSI status: Check Condition
 Fixed format, current; Sense key: Not Ready
 Additional sense: Logical unit not ready, manual intervention required
plus...: Driver_status=0x08 [DRIVER_SENSE, SUGGEST_OK]
START STOP UNIT command failed
invoke-rc.d: initscript multipath-tools-boot, action "start" failed.
dpkg: error processing multipath-tools (--configure):
 subprocess post-installation script returned error exit status 1
Errors were encountered while processing:
 multipath-tools

Looking at /etc/init.d/multipath-tools-boot I see the following function which is called unconditionally if sg3-utils is installed:

hsg80_init() {
        dummy_capa=2097152

        for i in $(grep -rl 2097152 /sys/block/sd*/size|awk -F/ '{print $4}')
        do
                sg_start -start /dev/$i
                sleep 1
                echo 1>/sys/block/$i/device/rescan
        done
}

This function really, really, REALLY needs to look at the contents of the "vendor" and/or "model" files in sysfs before doing anything. Right now it ends up unconditionally attempting to start any device that's exactly 10 GB large, which fails if you (like me) have this LUN on midrange active/passive EMC gear where START STOP UNIT will fail on all paths to the passive controller. I have some StorageTek gear I'm fairly certain will trigger the same behaviour too, although I didn't test that.

This really should be updated in Dapper, in my opinion. It can easily break production systems;

1) Everything runs and is hunky-dory
2a) Admin already has a 10GB volume and installs sg3_utils, or
2b) Creates 10GB volume on his midrange storage device, rescans the scsi bus, adds new dm maps, etc.
3) Everything remains hunky-dory for several months, until
4) Admin boots the server and realises the multipaths didn't start, no services works, and everything is generally sad.

It's not like 10GB is an "exotic" size unlikely to be found in the wild either, so I'm really surprised if I'm the only one being bitten by this bug.

Tore

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

confirmed.. and you are right. it needs fixing.

Fabio

Changed in multipath-tools:
assignee: nobody → fabbione
status: Unconfirmed → Confirmed
Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

BTW.. since you have such kind of equipment.. is there anything else needs to be tuned or needs to be added to the multipath documentation that can be of any help?

Thanks
Fabio

Revision history for this message
Tore Anderson (toreanderson) wrote : Re: [Bug 98518] Re: Fails to install because of brain damage in init script

* Fabio Massimo Di Nitto

> BTW.. since you have such kind of equipment.. is there anything else
> needs to be tuned or needs to be added to the multipath documentation
> that can be of any help?

   Hmm, can't think of anything specific really, they work fine if you
  configure multipath.conf correctly. You were thinking of example
  device sections for my arrays to put in /usr/share/doc maybe? I'll
  include them below:

         # Works for EMC AX100 and CX200 (probably all CLARiiON arrays)
         device {
                 vendor "DGC "
                 product "*"
                 hardware_handler "1 emc"
                 path_grouping_policy group_by_prio
                 prio_callout "/sbin/mpath_prio_emc /dev/%n"
                 path_checker emc_clariion
                 failback immediate
                 no_path_retry queue
         }

         # Works for the Sun StorageTek 6140 (rebranded Engenio 3994).
         # Probably other arrays from LSI Logic as well after adjusting
         # vendor/product. Note that it needs AVT mode which can be
         # selected on the 6140 by using OS type AIX_FO. No hardware
         # handler exists at the time of writing for RDAC mode yet
         # (OS type Linux on the 6140).
         device {
                 vendor "SUN "
                 product "CSM200_R "
                 path_grouping_policy group_by_serial
                 prio_callout "/sbin/mpath_prio_tpc /dev/%n"
                 path_checker tur
                 failback immediate
                 no_path_retry queue
         }

   If there's anything else you'd like to know just ask. :-)

Ciao
--
Tore Anderson

Changed in multipath-tools:
assignee: nobody → fabbione
importance: Undecided → Medium
status: Unconfirmed → Confirmed
assignee: nobody → fabbione
importance: Undecided → Medium
status: Unconfirmed → Confirmed
importance: Undecided → Medium
Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Hi Tore,

thanks for the information, I am changing bug information to make sure we do the very right thing
including getting rid of this brain damaged init script and let the kernel+udev handle everything properly.

Fabio

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Here is the proposed patch for dapper.

Problems:

- init script does not perform proper check before attempting to initialize hsg80 multibus failover

- hsg80 failover is not handled properly causing multipathd to spin 100% of the CPU without readding the proper device to dm-multipath

Solutions:

- remove hsg80 init from init script.

- add a separate command /sbin/hsg80_start:
  - the script is more robust, it performs better error checking and it solves the race issue caused by multipathd attempting to readd the passive leg
    of the hsg80 to dm-multipath *before* started properly.

- change /etc/udev/rules.d/85-multipath.rules to invoke /sbin/hsg80_start only if we detect that we are really on an HSG80 and with the recognized passive path (size is set to 10GB by HSG firmware).

a similar solution will be applied to edgy and feisty. debdiffs will follow with the same logic. udev rules and sg_start invocation will be different.

Fabio

Changed in multipath-tools:
status: Confirmed → In Progress
Revision history for this message
Tore Anderson (toreanderson) wrote : Re: [Bug 98518] Re: Fails to install because of brain damage in init script that should be converted to udev rules

* Fabio Massimo Di Nitto

> Here is the proposed patch for dapper.

   Looks good to me, it would certainly ensure that non-HSG80-using
  people like me won't run into trouble. Can't comment much on the
  correctness of the new HSG80-specific code though.

--
Tore Anderson

Revision history for this message
Tore Anderson (toreanderson) wrote :

* Fabio Massimo Di Nitto

> - change /etc/udev/rules.d/85-multipath.rules to invoke
> /sbin/hsg80_start only if we detect that we are really on an HSG80
> and with the recognized passive path (size is set to 10GB by HSG
> firmware).

   Hm, wait. Isn't there a better way to detect this? You'll end up
  starting active paths to volumes that for some reason happen to be 10GB
  this way. Not sure if that's a problem at all though, but won't maybe
  sg_turs be a better test to see if it indeed is a passive path?

   I also noticed that you don't check the vendor attribute in sysfs when
  matching the HSG80. Probably not necessary, but for the sake of
  completeness...

--
Tore Anderson

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

i did try to check for the vendor but it's tricky because it depends from the HSG firmware release.

For instance on mine it's still DEC (old firmware) but i know that most recent onces have been published
by Compaq and HP and the vendor has been changed.

I will see what sg_turs tells me about passive path..

Thanks for the suggestions..

Fabio

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

ok i tested sg_turs and yes it makes sense to add it as check. Thanks for spotting it.

Fabio

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Updated debdiff for dapper to include sg_turs test.

Fabio

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Here is an updated debdiff for dapper

We cannot remove the init script call for different reasons:

- copy hsg80_start in the initramfs will pull in a lot of junk.
- running hsg80_start from initramfs is subject to a race condition while initramfs init is moving
  mountpoints from / to /root before running the real init.
- not installing hsg80_start in the initramfs will leave passive devices in a bad state when devices
  are discovered before real root and real udev rules are available.

So in this patch we readd the hsg80_start call in init script but done with proper checks.

Fabio

Revision history for this message
Martin Pitt (pitti) wrote :

Alright, sounds good. Can we get this fixed in Feisty soon? I'm not comfortable with a fix in stables that isn't already fixed in Feisty (that's also what the SRU rules prescribe).

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Martin I am working to fix it in feisty, but feisty has a totally different udev interaction and it shows other issues. I also found another bug in dapper that we must address.

Fabio

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Hi guys,

I managed to move all the crap into udev for feisty and it works fine here.

Martin the way in which udev works in feisty allows me to remove the init script that's causing troubles.

Tore, sorry i didn't manage to get the documentation in for this upload due to time pressure.
I will fix that as soon as feisty+1 opens.

Fabio

Changed in multipath-tools:
status: Confirmed → Fix Released
Revision history for this message
Martin Pitt (pitti) wrote :

Fabio, the Feisty version of debian/hsg80_start seems to be a bit more elaborate, and it has a different if logic than the dapper one (Feisty restarts the device if sg_turs is false, dapper's script doesn't). Does that need some updates? The rest of the diff looks good.

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Martin,

yes, the patch for dapper needs to redone from scratch but in order for me to test, I need to fix another blocker that was fixed in edgy SRU but not dapper
related to loading firmwares from initramfs.

Until i can't sort that blocker, my tests would be incomplete.

Fabio

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Hi guys,

please scratch all the previous patches.

This should be final.

Tore it would be great if you can test it on your systems too and (if you can) add a couple of partitions to the
SAN export. This should trigger the dev/mapper/multipathname[partition] to be created too (known to be broken in dapper at the moment).

Fabio

Revision history for this message
Tore Anderson (toreanderson) wrote :

* Fabio Massimo Di Nitto

> This should be final.
>
> Tore it would be great if you can test it on your systems too and (if
> you can) add a couple of partitions to the SAN export. This should
> trigger the dev/mapper/multipathname[partition] to be created too
> (known to be broken in dapper at the moment).

   I'm not sure I'll be able to do so in the near future, as the 10GB LUN
  that triggered the bug has since been resized, and I don't think I have
  time to get a Dapper test rig online before the weekend (and next week
  I'll be away).

   I trust you, though. ;-)

--
Tore Anderson

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Hi Tore,

i am ok if you can't test the 10GB thingy, but i would like to know if the package works with the other changes too on setups that are just not mine.

Even if i need to wait a bit longer i am good.

thanks
Fabio

Revision history for this message
Tore Anderson (toreanderson) wrote :

* Fabio Massimo Di Nitto

> i am ok if you can't test the 10GB thingy, but i would like to know
> if the package works with the other changes too on setups that are
> just not mine.
>
> Even if i need to wait a bit longer i am good.

   Just got back from holiday so I need to take care of a few loose ends
  first, and tomorrow is Constitution Day here in Norway so nobody's
  going to work. I will try to test it early next week, hopefully that's
  okay with you.

--
Tore Anderson

Revision history for this message
Tore Anderson (toreanderson) wrote :

* Tore Anderson

> Just got back from holiday so I need to take care of a few loose ends
> first, and tomorrow is Constitution Day here in Norway so nobody's
> going to work. I will try to test it early next week, hopefully that's
> okay with you.

   Built and installed OK. Can't reboot these machines now, sorry. :-/

--
Tore Anderson

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Tore, ok thanks. I will wait for when you can reboot one of them.

This bug is really annoying and i understand it's not easy for you to do such operation, but it's also really
important for me to know that I didn't break existing setups.

Specially note the fact that we are moving all the operations into udev rules. That means that if there are issues
they will show up only on boot or when you plug/8unplug devices.

Thanks
Fabio

Revision history for this message
Tore Anderson (toreanderson) wrote :

* Fabio Massimo Di Nitto

> Tore, ok thanks. I will wait for when you can reboot one of them.
>
> This bug is really annoying and i understand it's not easy for you to
> do such operation, but it's also really important for me to know that
> I didn't break existing setups.
>
> Specially note the fact that we are moving all the operations into
> udev rules. That means that if there are issues they will show up
> only on boot or when you plug/8unplug devices.

   Hey. I finally got the chance to test it on one machine, and booted
  it (after manually having rebuilt the initramfs image, don't know if
  that was necessary though). No problems at all! :-)

--
Tore Anderson

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

we are not going to fix this one in edgy.

Changed in multipath-tools:
status: Confirmed → Rejected
Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Package uploaded to dapper-proposed after Tore testing of the diff as agreed with Martin & co.

Fabio

Changed in multipath-tools:
status: In Progress → Fix Committed
status: Fix Committed → In Progress
Revision history for this message
Martin Pitt (pitti) wrote :

Accepted into dapper-proposed, please go ahead with QA testing.

Changed in multipath-tools:
status: In Progress → Fix Committed
Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Hi Martin,

I am not sure how our SRU team can do QA on this bug without a SAN...

how should we handle it?

I did clearly tested it as much as possible and on your request asked and waited for Tore
to test it too...

Any idea?

Fabio

Revision history for this message
Martin Pitt (pitti) wrote : Re: [Bug 98518] Re: Fails to install because of brain damage in init script that should be converted to udev rules

Hi,

Fabio Massimo Di Nitto [2007-06-13 11:30 -0000]:
> I am not sure how our SRU team can do QA on this bug without a SAN...
>
> how should we handle it?

If Tore could test the actual packages in -proposed again, and you do
too, that's fine for our purposes.

Revision history for this message
Tore Anderson (toreanderson) wrote :

* Martin Pitt

> If Tore could test the actual packages in -proposed again, and you do
> too, that's fine for our purposes.

   I'm going away on holidays soon, so I won't be able to do so before
  coming home in about three weeks.

--
Tore Anderson

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Martin, ok i do that asap.

Tore, i guess we will wait for you. This bug has been taking a long time, week more/week less won't make much of a difference :)

better to do QA and get it done properly.

Fabio

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

Martin, I completed the tests with dapper-proposed. It is all good here. Please move the package to -updates.

Fabio

Revision history for this message
Martin Pitt (pitti) wrote :

This requires very special hardware and otherwise does not affect dapper, so I take Fabio's test as sufficient here.

Revision history for this message
Martin Pitt (pitti) wrote :

Copied to dapper-updates.

Changed in multipath-tools:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.