multipathd changes the underlying disks of mpathX devices when filesystems are mounted

Bug #1445973 reported by Mauricio Faria de Oliveira
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
partman-multipath (Ubuntu)
Fix Released
High
Mathieu Trudel-Lapierre
Trusty
Fix Released
Undecided
Unassigned

Bug Description

This is a critical bug.
An inconsistency in multipath bindings configuration between initramfs time and init scripts time might crash the system and cause data corruption.

Background:
----------------

When multipath and multipathd (multipath daemon) run, they can assign an alias/user_friendly_name (mpathX) to each path group (group of underlying devices with the same WWID) that they discover.
The actual alias used (e.g., mpath0, mpath1) for a WWID can be configured in /etc/multipath/bindings (format: <alias> <WWID>).

The problem is that multipathd can (and it does) reconfigure any in-use alias to a different path group, even if the corresponding multipath device is mounted (including the root filesystem).
This is done with an "ACT_RELOAD <different major:minor numbers>" call to that map.

If multipathd found that it should use a different alias than the one currently being used (detected as "path group topology change"), it will simply change the underlying devices of that alias.
The result is that on the surface the same map stills exists, but its underlying disks (i.e., data) changed *totally*.

If the changed map is that of the root filesystem, you immediately get a lot of filesystem errors (e.g., reads don't get the expected/correct data) and likely data corruption (e.g., active writes/delayed flushes to a certain mpathX disk happening when its underlying disks change).

The error is fixed by upstream commit "libmultipath: Use existing user friendly name if possible" (not present on Ubuntu currently).

Current problem:
-----------------------

Ubuntu 15.04 is affected (including at first boot), because (in addition to lacking that commit) there's no /etc/multipath/bindings file present (neither in the root filesystem, thus nor in the initramfs), which creates the scenario for the following to happen:

When initramfs runs multipath, it assigns some aliases to the path groups, but later, when init scripts run multipathd, it finds other aliases should be used, and re-assigns different path groups to the aliases (including that of the root filesystem).

Then, the system doesn't boot to the login prompt:

 [ OK ] Started LSB: multipath daemon.
 [ OK ] Started /etc/rc.local Compatibility.
 [ OK ] Started D-Bus System Message Bus.
 [ 18.321464] EXT4-fs error (device dm-19): ext4_iget:3898: inode #9178396: comm dbus-daemon: bad extra_isize (36146 != 256)
 [ 18.329670] Aborting journal on device dm-19-8.
 [ 18.330630] EXT4-fs (dm-19): Remounting filesystem read-only
 [ 18.331108] EXT4-fs error (device dm-19): ext4_iget:3898: inode #9178396: comm dbus-daemon: bad extra_isize (36146 != 256)
 [FAILED] Failed to start Login Service.

       (the fs errors differ on every boot).

This is the comparison of the multipath topology at initramfs time, and by manually running multipathd after the system booted (its init scripts were disabled). Notice the fs errors occur slightly after multipathd changed the aliases/path groups.

On initramfs (multipath command):

 mpath0
 - 8:144
 - 8:48

 mpath1
 - 8:176
 - 8:80

 mpath2
 - 8:112
 - 8:16

 mpath3
 - 8:160
 - 8:64

 mpath4
 - 8:128
 - 8:32

 mpath5
 - 8:96
 - 8:0

On rootfs (multipathd command):

 mpath0
 - 8:96
 - 8:0

 mpath1
 - 8:112
 - 8:16

 mpath2
 - 8:128
 - 8:32

 mpath3
 - no change

 mpath4
 - 8:144
 - 8:48

 mpath5
 - 8:176
 - 8:80

 as a result of:

 ...
 Apr 18 17:00:04 | Found matching wwid [1IBM IPR-0 5EDA1E0000000080] in bindings file. Setting alias to mpath0
 Apr 18 17:00:04 | mpath0: set ACT_RELOAD (path group topology change)
 Apr 18 17:00:05 | mpath0: load table [0 554287104 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 1 1 8:96 1000 round-robin 0 1 1 8:0 1000]
 ...
 Apr 18 17:00:05 | Found matching wwid [1IBM IPR-0 5EDA1E0000000060] in bindings file. Setting alias to mpath1
 Apr 18 17:00:05 | mpath1: set ACT_RELOAD (path group topology change)
 ...
 Apr 18 17:00:05 | mpath1: load table [0 554287104 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 1 1 8:112 1000 round-robin 0 1 1 8:16 1000]
 ...
 Apr 18 17:00:05 | Found matching wwid [1IBM IPR-0 5EDA1E0000000040] in bindings file. Setting alias to mpath2
 Apr 18 17:00:05 | mpath2: set ACT_RELOAD (path group topology change)
 Apr 18 17:00:05 | mpath2: load table [0 554287104 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 1 1 8:128 1000 round-robin 0 1 1 8:32 1000]
 ...
 Apr 18 17:00:05 | Found matching wwid [1IBM IPR-0 5EDA1E0000000020] in bindings file. Setting alias to mpath4
 Apr 18 17:00:05 | mpath4: set ACT_RELOAD (path group topology change)
 Apr 18 17:00:05 | mpath4: load table [0 554287104 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 1 1 8:144 1000 round-robin 0 1 1 8:48 1000]
 ...
 Apr 18 17:00:05 | Found matching wwid [1IBM IPR-0 5EDA1E00000000C0] in bindings file. Setting alias to mpath3
 Apr 18 17:00:05 | mpath3: set ACT_NOTHING (map unchanged)
 ...
 Apr 18 17:00:05 | Found matching wwid [1IBM IPR-0 5EDA1E00000000A0] in bindings file. Setting alias to mpath5
 Apr 18 17:00:05 | mpath5: set ACT_RELOAD (path group topology change)
 Apr 18 17:00:05 | mpath5: load table [0 554287104 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 1 1 8:176 1000 round-robin 0 1 1 8:80 1000]
 ...
 [ 908.542771] EXT4-fs error (device dm-18): htree_dirblock_to_tree:896: inode #12977345: block 51912949: comm systemd-tmpfile: bad entry in directory: directory entry across range - offset=936(936), inode=823603223, rec_len=209428, name_len=23
 [ 908.542903] Aborting journal on device dm-18-8.
 [ 908.542958] EXT4-fs (dm-18): Remounting filesystem read-only
 [ 908.544519] EXT4-fs error (device dm-18): htree_dirblock_to_tree:896: inode #12978211: block 51912825: comm systemd-tmpfile: bad entry in directory: directory entry across range - offset=968(968), inode=4253809803, rec_len=261768, name_len=139
 Apr 18 17:00:25 | sda: get_state
 Apr 18 17:00:25 | sda: state = running
 ...

Solution proposals:
--------------------------

- Option #1: quick fix, small impact

Copying the installer's /etc/multipath/bindings file to /target will fix this.. until new bindings that could confuse multipathd show up (i.e., the user adds new disks/path groups that were not present during the installer / in it's multipath bindings file) .

It doesn't fix the issue, but prevents it on most cases (no multipath topology additions after install).

This works around the issue by making sure the same /etc/multipath/bindings file is present at the root filesystem *and* initramfs, so multipath and multipathd have the same configuration for the aliases they will use.

(patch attached)

- Option #2: intermediary fix, medium-large impact

Disable user_friendly_names (requires patch in LP #1432062).

This doesn't requires many changes to the code (an upstream commit + trivial change to one udev rules file), but changes what users are used to (from /dev/mapper/mpathX to /dev/mapper/WWIDs).

Still doesn't fix the issue, just works around it at most cases (user doesn't enable user_friendly_names).

- Option #3: actual fix, large impact.

This requires pulling the upstream commit with the fix, and some dependencies.. which changes a non-trivial amount of code right before GA.
Another option is backporting only that commit to the currently packaged code, which adds less changes, but the changes are non-upstream ones.. again not really desired right before GA.

Patch attached for option #1.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "partman-multipath_copy-bindings.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
description: updated
Changed in multipath-tools (Ubuntu):
importance: Undecided → High
status: New → In Progress
assignee: nobody → Mathieu Trudel-Lapierre (mathieu-tl)
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

We landed a fix for this in partman-multipath (4ubuntu2); the bug was not closed because I forgot to add a bug tag. Closing as Fix Released.

Changed in multipath-tools (Ubuntu):
status: In Progress → Fix Released
affects: multipath-tools (Ubuntu) → partman-multipath (Ubuntu)
Changed in partman-multipath (Ubuntu Trusty):
status: New → In Progress
Revision history for this message
Adam Conrad (adconrad) wrote : Please test proposed package

Hello Mauricio, or anyone else affected,

Accepted partman-multipath into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/partman-multipath/4ubuntu0.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in partman-multipath (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hello,

Verification done on Trusty; tags changed.

The bindings file are copied from the installer to the /target system, and updated in its initramfs.

 ~ # cmp /etc/multipath/bindings /target/etc/multipath/bindings
 ~ # echo $?
 0

 ~ # chroot /target
 # mkdir /tmp/initramfs
 # cd /tmp/initramfs
 # gzip -dc /boot/initrd.img | cpio -imd --quiet
 # cmp /etc/multipath/bindings etc/multipath/bindings
 # echo $?
 0

tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package partman-multipath - 4ubuntu0.1

---------------
partman-multipath (4ubuntu0.1) trusty; urgency=medium

  * post-base-installer.d/60multipath: copy multipath bindings to target.
    (LP: #1445973)
  * Fix multipath support device naming (LP: #1430074):
    - commit.d/partition_multipath:
      Use 'p' (not '-part') as multipath disk-partition separator.
    - (new file) finish.d/fstab_hd_entries_multipath:
      Use '-part' (not 'p') as multipath disk-partition separator in fstab
      everywhere, as it is how devices will be detected after boot.

 -- Mathieu Trudel-Lapierre <email address hidden> Fri, 17 Jul 2015 12:48:29 -0400

Changed in partman-multipath (Ubuntu Trusty):
status: Fix Committed → Fix Released
Revision history for this message
Adam Conrad (adconrad) wrote : Update Released

The verification of the Stable Release Update for partman-multipath has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.