Trusty & Vivid multipath-tools (multipathd) seg-fault core dump

Bug #1535898 reported by Rafael David Tinoco on 2016-01-19
10
Affects Status Importance Assigned to Milestone
multipath-tools (Ubuntu)
High
Mathieu Trudel-Lapierre
Precise
High
Louis Bouchard
Trusty
High
Louis Bouchard

Bug Description

[SRU justification]
Without this patch, multipathd may exit in SEGV in trying to add a map that aleady exists

[Impact]
multipathd crashes with SIGSEGV
A typical trace of such a situation is a message similar to this one in /var/log/syslog :

multipathd: 360060160164034004cd59cfdb22ce611: failed in domap for addition of new path sdr

[Fix]
Check if the map already exists and do a RELOAD in domap() instead of failing.

[Test Case]
Problem was encountered in a complex Openstack test environment where the following was done :
A test tool which runs which :
- first boots a number of virtual machines.
- then it creates a number of threads and in each thread it
creates volumes, takes snapshots of the volumes, and attaches the volumes to the initially booted virtual machines. After a short while the volumes are detached, and snapshots and volumes are deleted.

Running this tool overnight normally result in running in the multipathd SEGV situation.

[Regression]
This is a straight backport of the code being used in 0.5.0. No regression is to be expected.

It is important to note that the reproducer in the original description did not lead to such a problem.

[Original description of the problem]

We have a problem on multipath-tools.

Usually after a path removal and a re-scan, the multipathd process dies.

I created 2 hosts:

iscsi-server
iscsi-client

With 4 NICs in between them and with a simple multibus multipath. With that I was able to check that there is a regression in multipath-tools.

It looks like the patches brought from upstream:

0017-multipath-get-right-sysfs-value-for-checker_timeout.patch
0018-multipath-handle-offlined-paths.patch
#
# from here
#
0019-multipath-fix-scsi-timeout-code.patch
0020-multipath-make-tgt_node_name-work-for-iscsi-devices.patch
0021-multipath-cleanup-dev_loss_tmo-issues.patch
0022-Fix-for-setting-0-to-fast_io_fail.patch
0023-Fix-fast_io_fail-capping.patch
0024-multipath-enable-getting-uevents-through-libudev.patch
0025-Use-devpath-as-argument-for-sysfs-functions.patch
0026-multipathd-remove-references-to-sysfs_device.patch
0027-multipathd-use-struct-path-as-argument-for-event-pro.patch
0028-Add-global-udev-reference-pointer-to-config.patch
0029-Use-udev-enumeration-during-discovery.patch
0030-use-struct-udev_device-during-discovery.patch
0031-More-debugging-output-when-synchronizing-path-states.patch
0032-Use-struct-udev_device-instead-of-sysdev.patch
0033-discovery-Fixup-cciss-discovery.patch
0035-Use-udev-devices-during-discovery.patch
0036-Remove-all-references-to-hand-craftes-sysfs-code.patch
#
# to here
#
# 0037-multipath-libudev-cleanup-and-bugfixes.patch
# 0038-multipath-check-if-a-device-belongs-to-multipath.patch
# 0039-multipath-and-wwids_file-multipath.conf-option.patch
# 0040-multipath-Check-blacklists-as-soon-as-possible.patch
# 0041-add-wwids-file-cleanup-options.patch
# 0042-add-find_multipaths-option.patch
# 0043-alloc-keywords.patch
# lp1503305_libmultipath_info_on_1st_path_down_dbd131e.patch

In the range 19-36 caused a regression.

Whenever I generate the package (for trusty) including those patches I'm able to generate a core dump indicating a possible double-free or null-dereference related to a path removal (that is why I can reproduce with the test case). Unfortunately it usually explodes inside malloc() or somewhere in glibc.

Using valgrind I was able to verify some free() errors:

==30415== Invalid free() / delete / delete[] / realloc()
==30415== at 0x4C2BDEC: free (vg_replace_malloc.c:473)
==30415== by 0x54E243C: vector_del_slot (vector.c:95)
==30415== by 0x550A516: _remove_map (structs_vec.c:139)
==30415== by 0x550A5C3: _remove_maps (structs_vec.c:170)
==30415== by 0x550A64B: remove_maps (structs_vec.c:181)
==30415== by 0x40713F: configure (main.c:1153)
==30415== by 0x407A74: child (main.c:1419)
==30415== by 0x40837D: main (main.c:1618)

And they are exactly aligned to a core dump (multipathd) I got from another user. (wrong free was coming from _remove_map).

This crash is from Trusty using my reproducer. It includes the dump.

I have generated multipath-tools & kpartx packages from upstream code around those commits:

0019-multipath-fix-scsi-timeout-code.patch
0020-multipath-make-tgt_node_name-work-for-iscsi-devices.patch
0021-multipath-cleanup-dev_loss_tmo-issues.patch
0022-Fix-for-setting-0-to-fast_io_fail.patch
0023-Fix-fast_io_fail-capping.patch
0024-multipath-enable-getting-uevents-through-libudev.patch
0025-Use-devpath-as-argument-for-sysfs-functions.patch
0026-multipathd-remove-references-to-sysfs_device.patch
0027-multipathd-use-struct-path-as-argument-for-event-pro.patch
0028-Add-global-udev-reference-pointer-to-config.patch
0029-Use-udev-enumeration-during-discovery.patch
0030-use-struct-udev_device-during-discovery.patch
0031-More-debugging-output-when-synchronizing-path-states.patch
0032-Use-struct-udev_device-instead-of-sysdev.patch
0033-discovery-Fixup-cciss-discovery.patch
0035-Use-udev-devices-during-discovery.patch
0036-Remove-all-references-to-hand-craftes-sysfs-code.patch

And I could not reproduce the same problem with upstream code (only with our package).

For me, it looks like this backport was unsuccessful.

For Trusty, I have suggested the following SRU:

https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1532789

And for Precise I have suggested the following SRU (bringing all patches also to precise):

https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1520192

With this test case only Wily is good.

Summarising:

Precise suffers from bug LP: #1532789 (needs update).

Trusty suffers from bug LP: #1520192 & from this bug LP: #1535898

Vivid suffers from this bug LP: #1535898

Wily looks good for this reproducer (since it was merged with 0.5.0 without so many backports).

description: updated
Changed in multipath-tools (Ubuntu):
status: New → In Progress
Download full text (4.0 KiB)

The crash you're hitting is something very different than what valgrind is finding (though I agree technically valgrind is correct in pointing this out, and it looks as though some of it is addressed upstream).

Program terminated with signal SIGSEGV, Segmentation fault.
#0 malloc_consolidate (av=av@entry=0x7f9aa4000020) at malloc.c:4151
[Current thread is 1 (LWP 3163)]
(gdb) bt full
#0 malloc_consolidate (av=av@entry=0x7f9aa4000020) at malloc.c:4151
        fb = <optimized out>
        maxfb = 0x7f9aa4000070
        p = 0x7f9aa4000078
        nextp = 0x7f9aa40009a0
        unsorted_bin = 0x7f9aa4000078
        first_unsorted = <optimized out>
        nextchunk = 0xff3548000a18
        size = 140302153157024
        nextsize = <optimized out>
        prevsize = <optimized out>
        nextinuse = <optimized out>
        bck = <optimized out>
        fwd = <optimized out>
        __func__ = "malloc_consolidate"
#1 0x00007f9acb099df8 in _int_malloc (av=0x7f9aa4000020, bytes=16384) at malloc.c:3423
        nb = 16400
        idx = 114
        bin = <optimized out>
        victim = <optimized out>
        size = <optimized out>
        victim_index = <optimized out>
        remainder = <optimized out>
        remainder_size = <optimized out>
        block = <optimized out>
        bit = <optimized out>
        map = <optimized out>
        fwd = <optimized out>
        bck = <optimized out>
        errstr = 0x0
        __func__ = "_int_malloc"
#2 0x00007f9acb09c7b0 in __GI___libc_malloc (bytes=16384) at malloc.c:2891
        ar_ptr = 0x7f9aa4000020
        victim = 0x511
        __func__ = "__libc_malloc"
#3 0x00007f9acbaa94d7 in dm_task_run () from /tmp/apport_sandbox_S4eo5o/lib/x86_64-linux-gnu/libdevmapper.so.1.02.1
No symbol table info available.
#4 0x00007f9acb3eed9a in dm_map_present (str=0x7f9aa4000ef0 "lun01") at devmapper.c:304
        r = 0
        dmt = 0x7f9aa40008e0
        info = {exists = -871807232, suspended = 32666, live_table = -888551504, inactive_table = 32666, open_count = -871809664, event_nr = 32666, major = 3423160064, minor = 32666,
          read_only = -871809840, target_count = 32666}
#5 0x0000000000404a77 in ev_add_map (dev=0x7f9ac40020fb "dm-3", alias=0x7f9aa4000ef0 "lun01", vecs=0xb6b6b0) at main.c:256
        refwwid = 0x600000000 <error: Cannot access memory at address 0x600000000>
        mpp = 0x7f9aa4000ef0
        map_present = 32666
        r = 1
#6 0x0000000000404a3c in uev_add_map (uev=0x7f9ac4002020, vecs=0xb6b6b0) at main.c:243
        alias = 0x7f9aa4000ef0 "lun01"
        major = -1
        minor = -1
        rc = 32666
#7 0x00000000004061ed in uev_trigger (uev=0x7f9ac4002020, trigger_data=0xb6b6b0) at main.c:755
        r = 0
        vecs = 0xb6b6b0
#8 0x00007f9acb40d29d in service_uevq (tmpq=0x7f9acc093de0) at uevent.c:118
        uev = 0x7f9ac4002020
        tmp = 0x7f9acc093de0
#9 0x00007f9acb40d4ac in uevent_dispatch (uev_trigger=0x406130 <uev_trigger>, trigger_data=0xb6b6b0) at uevent.c:167
        uevq_tmp = {next = 0x7f9acc093de0, prev = 0x7f9acc093de0}
#10 0x0000000000406436 in uevqloop (ap=0xb6b6b0) at main.c:814
No locals.
#11 0x00007f9acbcc3182 in start_thread (arg=0x7f9acc09470...

Read more...

Changed in multipath-tools (Ubuntu):
status: In Progress → Incomplete
assignee: nobody → Mathieu Trudel-Lapierre (mathieu-tl)
Changed in multipath-tools (Ubuntu):
importance: Undecided → High

And now that I did some more testing with Louis on this, we were able to "run into" a crash with mpp->alias attempted to be freed but failing, which isn't quite the same backtrace as I had pasted earlier. It does look like it might be similar to the issue reported by valgrind (depends largely on the presence of the debug symbols).

Pending further testing, but I've prepared the attached debdiff, which should address the state of mpp->alias.

Louis Bouchard (louis) wrote :

To further add to Mathieu's comment, here is the backtrace of one of the recent core that we got :

(gdb) bt
#0 0x00007f17a122a0d5 in __GI_raise (sig=<optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007f17a122d83b in __GI_abort () at abort.c:91
#2 0x00007f17a126732e in __libc_message (do_abort=2, fmt=0x7f17a13715d8 "*** glibc detected *** %s: %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:201
#3 0x00007f17a1271b26 in malloc_printerr (action=3, str=0x7f17a13717c8 "double free or corruption (fasttop)",
    ptr=<optimized out>) at malloc.c:5051
#4 0x00007f17a15c8f27 in free_multipath (mpp=0x7f177c009160, free_paths=0) at structs.c:174
#5 0x00007f17a15ec09a in _remove_map (mpp=0x7f177c009160, vecs=0xbaea70, stop_waiter=1, purge_vec=1) at structs_vec.c:143
#6 0x00007f17a15ec0f8 in remove_map_and_stop_waiter (mpp=0x7f177c009160, vecs=0xbaea70, purge_vec=1) at structs_vec.c:156
#7 0x00000000004075f5 in mpvec_garbage_collector (vecs=0xbaea70) at main.c:949
#8 0x00007f177c007060 in ?? ()
#9 0x0000000000baea70 in ?? ()
#10 0x00007f177c009160 in ?? ()
#11 0x0000000200000003 in ?? ()
#12 0x00007f17a2274e20 in ?? ()
#13 0x00000000004080f0 in checkerloop (ap=0x7f17a13717c8) at main.c:1162
#14 0x0000000000000000 in ?? ()
(gdb) f 4
#4 0x00007f17a15c8f27 in free_multipath (mpp=0x7f177c009160, free_paths=0) at structs.c:174
warning: Source file is more recent than executable.
174 FREE(mpp->dmi);
(gdb) l
169 FREE(mpp->alias);
170 mpp->alias = NULL;
171 }
172
173 if (mpp->dmi) {
174 FREE(mpp->dmi);
175 mpp->dmi = NULL;
176 }
177
178 /*
(gdb)

Louis Bouchard (louis) on 2016-01-21
Changed in multipath-tools (Ubuntu Precise):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Louis Bouchard (louis-bouchard)
tags: added: patch
Louis Bouchard (louis) on 2016-06-09
Changed in multipath-tools (Ubuntu Trusty):
status: New → In Progress
assignee: nobody → Louis Bouchard (louis-bouchard)
importance: Undecided → High

Hello Rafael, or anyone else affected,

Accepted multipath-tools into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/multipath-tools/0.4.9-3ubuntu7.14 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in multipath-tools (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Louis Bouchard (louis) on 2016-07-22
tags: added: verification-done
removed: verification-needed
Dragan S. (dragan-s) on 2016-07-25
Changed in multipath-tools (Ubuntu Precise):
assignee: Louis Bouchard (louis-bouchard) → Dragan S. (dragan-s)
Changed in multipath-tools (Ubuntu Trusty):
assignee: Louis Bouchard (louis-bouchard) → Dragan S. (dragan-s)
Changed in multipath-tools (Ubuntu):
assignee: Mathieu Trudel-Lapierre (cyphermox) → Dragan S. (dragan-s)
Steve Langasek (vorlon) wrote :

Louis, what is the verification that was done here? The bug description includes neither a test case, nor a description of the regression potential of this upload.

Changed in multipath-tools (Ubuntu Precise):
assignee: Dragan S. (dragan-s) → Louis Bouchard (louis-bouchard)
Changed in multipath-tools (Ubuntu Trusty):
assignee: Dragan S. (dragan-s) → Louis Bouchard (louis-bouchard)
Changed in multipath-tools (Ubuntu):
assignee: Dragan S. (dragan-s) → Mathieu Trudel-Lapierre (cyphermox)
Martin Pitt (pitti) wrote :

Resetting to v-needed as this is currently unclear.

tags: added: verification-neededd
removed: verification-done
tags: added: verification-needed
removed: verification-neededd
Dragan S. (dragan-s) wrote :

Steve-

Louis is out and in his absence I am still looking into this that's why I assigned it to myself.

Louis Bouchard (louis) on 2016-09-09
description: updated
Steve Langasek (vorlon) on 2016-09-09
tags: added: verification-done
removed: verification-needed
Martin Pitt (pitti) on 2016-09-12
tags: added: verification-needed
removed: verification-done
Louis Bouchard (louis) wrote :

A new set of test was run with the .14 version in trusty-proposed and no regresssion was found as well as no new coredumps. Marking this verification-done

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package multipath-tools - 0.4.9-3ubuntu7.14

---------------
multipath-tools (0.4.9-3ubuntu7.14) trusty; urgency=medium

  * d/p/0045-fix-mpp_alias-freeing.patch, d/p/0046-revert-act_reload.patch:
    Fix double-free situation that generate segfaults with multipathd
    (LP: #1535898)

 -- Louis Bouchard <email address hidden> Tue, 28 Jun 2016 11:53:32 +0200

Changed in multipath-tools (Ubuntu Trusty):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for multipath-tools has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Robie Basak (racb) wrote :

Bug 1629644 claims that this SRU regressed multipath-tools.

Nish Aravamudan (nacc) wrote :

Hello, Precise is EOL and we are no longer providing bug-fixes to it. It would appear this particular issue is fixed in Trusty (the only current release it is present) -- In Bug 1629644, it was determined this version did not regress Trusty (a different upload did), and it has since expired due to inactivity, unfortunately. I am unsubscribing the server team and marking the precise task as "Won't Fix". Thank you for your contributions to Ubuntu!

Changed in multipath-tools (Ubuntu Precise):
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers