multipathd changes the underlying disks of mpathX devices when filesystems are mounted
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
partman-multipath (Ubuntu) |
Fix Released
|
High
|
Mathieu Trudel-Lapierre | ||
Trusty |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
This is a critical bug.
An inconsistency in multipath bindings configuration between initramfs time and init scripts time might crash the system and cause data corruption.
Background:
----------------
When multipath and multipathd (multipath daemon) run, they can assign an alias/user_
The actual alias used (e.g., mpath0, mpath1) for a WWID can be configured in /etc/multipath/
The problem is that multipathd can (and it does) reconfigure any in-use alias to a different path group, even if the corresponding multipath device is mounted (including the root filesystem).
This is done with an "ACT_RELOAD <different major:minor numbers>" call to that map.
If multipathd found that it should use a different alias than the one currently being used (detected as "path group topology change"), it will simply change the underlying devices of that alias.
The result is that on the surface the same map stills exists, but its underlying disks (i.e., data) changed *totally*.
If the changed map is that of the root filesystem, you immediately get a lot of filesystem errors (e.g., reads don't get the expected/correct data) and likely data corruption (e.g., active writes/delayed flushes to a certain mpathX disk happening when its underlying disks change).
The error is fixed by upstream commit "libmultipath: Use existing user friendly name if possible" (not present on Ubuntu currently).
Current problem:
-------
Ubuntu 15.04 is affected (including at first boot), because (in addition to lacking that commit) there's no /etc/multipath/
When initramfs runs multipath, it assigns some aliases to the path groups, but later, when init scripts run multipathd, it finds other aliases should be used, and re-assigns different path groups to the aliases (including that of the root filesystem).
Then, the system doesn't boot to the login prompt:
[ OK ] Started LSB: multipath daemon.
[ OK ] Started /etc/rc.local Compatibility.
[ OK ] Started D-Bus System Message Bus.
[ 18.321464] EXT4-fs error (device dm-19): ext4_iget:3898: inode #9178396: comm dbus-daemon: bad extra_isize (36146 != 256)
[ 18.329670] Aborting journal on device dm-19-8.
[ 18.330630] EXT4-fs (dm-19): Remounting filesystem read-only
[ 18.331108] EXT4-fs error (device dm-19): ext4_iget:3898: inode #9178396: comm dbus-daemon: bad extra_isize (36146 != 256)
[FAILED] Failed to start Login Service.
(the fs errors differ on every boot).
This is the comparison of the multipath topology at initramfs time, and by manually running multipathd after the system booted (its init scripts were disabled). Notice the fs errors occur slightly after multipathd changed the aliases/path groups.
On initramfs (multipath command):
mpath0
- 8:144
- 8:48
mpath1
- 8:176
- 8:80
mpath2
- 8:112
- 8:16
mpath3
- 8:160
- 8:64
mpath4
- 8:128
- 8:32
mpath5
- 8:96
- 8:0
On rootfs (multipathd command):
mpath0
- 8:96
- 8:0
mpath1
- 8:112
- 8:16
mpath2
- 8:128
- 8:32
mpath3
- no change
mpath4
- 8:144
- 8:48
mpath5
- 8:176
- 8:80
as a result of:
...
Apr 18 17:00:04 | Found matching wwid [1IBM IPR-0 5EDA1E0000000080] in bindings file. Setting alias to mpath0
Apr 18 17:00:04 | mpath0: set ACT_RELOAD (path group topology change)
Apr 18 17:00:05 | mpath0: load table [0 554287104 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 1 1 8:96 1000 round-robin 0 1 1 8:0 1000]
...
Apr 18 17:00:05 | Found matching wwid [1IBM IPR-0 5EDA1E0000000060] in bindings file. Setting alias to mpath1
Apr 18 17:00:05 | mpath1: set ACT_RELOAD (path group topology change)
...
Apr 18 17:00:05 | mpath1: load table [0 554287104 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 1 1 8:112 1000 round-robin 0 1 1 8:16 1000]
...
Apr 18 17:00:05 | Found matching wwid [1IBM IPR-0 5EDA1E0000000040] in bindings file. Setting alias to mpath2
Apr 18 17:00:05 | mpath2: set ACT_RELOAD (path group topology change)
Apr 18 17:00:05 | mpath2: load table [0 554287104 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 1 1 8:128 1000 round-robin 0 1 1 8:32 1000]
...
Apr 18 17:00:05 | Found matching wwid [1IBM IPR-0 5EDA1E0000000020] in bindings file. Setting alias to mpath4
Apr 18 17:00:05 | mpath4: set ACT_RELOAD (path group topology change)
Apr 18 17:00:05 | mpath4: load table [0 554287104 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 1 1 8:144 1000 round-robin 0 1 1 8:48 1000]
...
Apr 18 17:00:05 | Found matching wwid [1IBM IPR-0 5EDA1E00000000C0] in bindings file. Setting alias to mpath3
Apr 18 17:00:05 | mpath3: set ACT_NOTHING (map unchanged)
...
Apr 18 17:00:05 | Found matching wwid [1IBM IPR-0 5EDA1E00000000A0] in bindings file. Setting alias to mpath5
Apr 18 17:00:05 | mpath5: set ACT_RELOAD (path group topology change)
Apr 18 17:00:05 | mpath5: load table [0 554287104 multipath 1 queue_if_no_path 1 alua 2 1 round-robin 0 1 1 8:176 1000 round-robin 0 1 1 8:80 1000]
...
[ 908.542771] EXT4-fs error (device dm-18): htree_dirblock_
[ 908.542903] Aborting journal on device dm-18-8.
[ 908.542958] EXT4-fs (dm-18): Remounting filesystem read-only
[ 908.544519] EXT4-fs error (device dm-18): htree_dirblock_
Apr 18 17:00:25 | sda: get_state
Apr 18 17:00:25 | sda: state = running
...
Solution proposals:
-------
- Option #1: quick fix, small impact
Copying the installer's /etc/multipath/
It doesn't fix the issue, but prevents it on most cases (no multipath topology additions after install).
This works around the issue by making sure the same /etc/multipath/
(patch attached)
- Option #2: intermediary fix, medium-large impact
Disable user_friendly_names (requires patch in LP #1432062).
This doesn't requires many changes to the code (an upstream commit + trivial change to one udev rules file), but changes what users are used to (from /dev/mapper/mpathX to /dev/mapper/WWIDs).
Still doesn't fix the issue, just works around it at most cases (user doesn't enable user_friendly_
- Option #3: actual fix, large impact.
This requires pulling the upstream commit with the fix, and some dependencies.. which changes a non-trivial amount of code right before GA.
Another option is backporting only that commit to the currently packaged code, which adds less changes, but the changes are non-upstream ones.. again not really desired right before GA.
Patch attached for option #1.
description: | updated |
Changed in multipath-tools (Ubuntu): | |
importance: | Undecided → High |
status: | New → In Progress |
assignee: | nobody → Mathieu Trudel-Lapierre (mathieu-tl) |
affects: | multipath-tools (Ubuntu) → partman-multipath (Ubuntu) |
Changed in partman-multipath (Ubuntu Trusty): | |
status: | New → In Progress |
The attachment "partman- multipath_ copy-bindings. debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.
[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]