PCI RoCe IB perftest Aborted (core dumped)

Bug #1553185 reported by bugproxy
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
High
Unassigned
perftest (Ubuntu)
Fix Released
High
Unassigned

Bug Description

SRU:
====

[Impact]

 * the perftest tools (ib_*) incl. in the perftest package cannot be used at all, they all always core dump on all platforms
 * a backport is required for getting again a working perftest package / tool set
 * the fix was officially provided by Mellanox and fixes the version comparison that was broken before (partly using int, partly string compares)

[Test Case]

 * the bug can easily be reproduced on two systems both with RoCE cards installed
   and by starting a perftest run like this:
   on one machine as 'server': sudo ./ib_read_bw -d mlx4_0 -a
   and on a second machine as 'client', pointing to the servers IP address: sudo ./ib_read_bw <server IP> -d mlx4_0

detailed instructions how to reproduce the bug

 * install the perftest package including all dependencies
 * configure the RoCE devices as network devices using a private network range in /etc/network/interfaces like this
 # The 1st RoCE interface configuration
 auto enP1p0s0
 iface enP1p0s0 inet static
         address 192.168.1.141
         netmask 255.255.255.0
         network 192.168.1.0
         broadcast 192.168.1.255
 * test if the network is okay, with ping (or rping, udaddy rdma_client/rdma_server)
 * and run the Test Case above

[Regression Potential]

 * the regression is low due to the fact that the current tools that are part of the perftest package just segfault and are unusable
 * the target for this patched perftest package is Zesty (and higher), because only Zesty has the depending packages in an up-to-date version

 * people may ask to SRU that to Xenial as well, but that would require the update of several other packages ...

--------%<----------------%<----------------%<----------------%<--------

== Comment: #0 - Helmut Grauer - 2016-03-04 06:46:50 ==
Hi
Configure IB for perftest
Ethernet Interface
np0s0 Link encap:Ethernet HWaddr 82:01:14:32:f0:90
          inet addr:10.100.80.2 Bcast:10.100.255.255 Mask:255.255.0.0
          inet6 addr: fe80::8001:14ff:fe32:f090/64 Scope:Link
          inet6 addr: fd00:10:100::ff:80:2/80 Scope:Global
          inet6 addr: fd00:10:100:0:8001:14ff:fe32:f090/64 Scope:Global
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:25938 errors:0 dropped:0 overruns:0 frame:0
          TX packets:253 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:8228251 (8.2 MB) TX bytes:21494 (21.4 KB)

Installing related packages for dapltest

librdmacm-dev install
librdmacm1 install
librdmacm1-dbg install
dapl2-utils install
libibumad3 install
libibverbs-dev install
libibverbs1 install
libmlx4-1 install
libmlx4-1-dbg install
libmlx4-dev install
libmlx5-1 install
libmlx5-1-dbg install
libmlx5-dev install
perftest install

++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
+++ PCI-Overview: +++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++

======================================
DEVICE_List:
-------------
0000:00:00.0
0001:00:00.0

======================================
FunctionID_List:
-----------------
0x0000003e
0x0000003f

======================================
PCHID_List:
------------
0x0108
0x013c

======================================
Interface_List:
----------------
/sys/bus/pci/devices/0000:00:00.0/net/:
enp0s0
enp0s0d1

/sys/bus/pci/devices/0001:00:00.0/net/:
enP1p0s0
enP1p0s0d1

======================================
Infiniband_List:
----------------
/sys/bus/pci/devices/0000:00:00.0/infiniband/:
mlx4_0

/sys/bus/pci/devices/0001:00:00.0/infiniband/:
mlx4_1

--------------------------------------------------------------------------

server

root@s83lp02:~# dpkg -S /etc/dat.conf
libdapl2: /etc/dat.conf
root@s83lp02:~# ib_read_bw -d mlx4_0 -a

************************************
* Waiting for client to connect... *
************************************
*** stack smashing detected ***: ib_read_bw terminated
Aborted (core dumped)

-----------------------------------------------------------------------------
root@s83lp18:~# ./xpci.sh

++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++
+++ PCI-Overview: +++
++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++

======================================
DEVICE_List:
-------------
0000:00:00.0
0001:00:00.0

======================================
FunctionID_List:
-----------------
0x00000040
0x00000041

======================================
PCHID_List:
------------
0x0108
0x013c

======================================
Interface_List:
----------------
/sys/bus/pci/devices/0000:00:00.0/net/:
ens40
ens40d1

/sys/bus/pci/devices/0001:00:00.0/net/:
enP1s41
enP1s41d1

======================================
Infiniband_List:
----------------
/sys/bus/pci/devices/0000:00:00.0/infiniband/:
mlx4_0

/sys/bus/pci/devices/0001:00:00.0/infiniband/:
mlx4_1

Client

root@s83lp18:~# ib_read_bw 10.100.80.2 -d mlx4_1
Couldn't connect to 10.100.80.2:18515
Unable to open file descriptor for socket connection Unable to init the socket connection
root@s83lp18:~# ib_read_bw 10.100.80.2 -d mlx4_1
*** stack smashing detected ***: ib_read_bw terminated
Aborted (core dumped)
(reverse-i-search)`':

I will add SOSReport and dgbinfo.sh

Revision history for this message
bugproxy (bugproxy) wrote : Dbginfo File

Default Comment by Bridge

tags: added: architecture-s39064 bugnameltc-138382 severity-high targetmilestone-inin1604
Revision history for this message
bugproxy (bugproxy) wrote : SOSReport

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Crash Report perftest

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote : Re: PCI RoCe IB pertest Aborted (core dumped)

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1553185/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
dann frazier (dannf)
affects: ubuntu → perftest (Ubuntu)
summary: - PCI RoCe IB pertest Aborted (core dumped)
+ PCI RoCe IB perftest Aborted (core dumped)
Changed in perftest (Ubuntu):
assignee: Skipper Bug Screeners (skipper-screen-team) → Canonical Server Team (canonical-server)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

FYI - blocked on RoCE setup in RT 89459

Changed in perftest (Ubuntu):
status: New → Triaged
importance: Undecided → High
Changed in perftest (Ubuntu):
milestone: none → ubuntu-16.04
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-03-09 05:37 EDT-------
Hi installed package where problem reside is

perftest 3.0+0.16.gb2 s390x Infiniband verbs performance test

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Lacking a system for us I was visiting IBM and taking a look.
The setup looks sound, and it appears to

So far it seemed like what it reported a stack smashing so expect some mem overwrite.

Unfortunately further debug was blocked by lacking some tools and more time.
We wanted to check with memcheck and exp-sgcheck for potential sources with Valgrind.
But Valgrind seems to still have issues on Ubuntu for s390x.
IBM will report a bug for that soon after looking into it.

Other than that - as usual - it is hard to debug overwrites.
We might consider adding a ppa that has -fno-stack-protector set to have it break where it has the malicious access instead of getting a SIGABRT which is a bit more indirect.
But then we might just build that for us once our RoCE setup is ready.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

And thanks Helmut already for your support on debugging this.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

From our debug session I wrote a summary for reproduction

# On Both Systems
apt-get install perftest libibverb* libmlx* dapl* libmth*
* configure RoCE network cards to have IPs
* restart
ulimit -l 32000
modprobe mlx4_ibv
modprobe ib_mthca
modprobe ib_mad
modprobe ib_core
./xpci.sh

# prove that something works over these cards
#SERVER
dapltest -T S -D ofa-v2-scm-roe-mlx4_0-1

#CLIENT
dapltest -T P -D ofa-v2-scm-roe-mlx4_0-1 -s 10.100.80.2 -i 100 RW 4096 2

# start one of the broken ib_* perftest
#SERVER
ib_read_bw -d mlx4_0

# CLIENT
ib_read_bw 10.100.80.2 -d mlx4_1

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Revision history for this message
bugproxy (bugproxy) wrote : SOSReport

Default Comment by Bridge

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

@hws

Please post-pone to 16.04.1 milestone target.

Regards,

Dimitri.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Yeah still waiting on a working RoCE setup for this for you or us debug locally :-/
I didn't find 16.04.1 in milestones so I picked Xenial updates.

Changed in perftest (Ubuntu):
assignee: Canonical Server Team (canonical-server) → nobody
milestone: ubuntu-16.04 → xenial-updates
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-04-21 03:20 EDT-------
Postponed to 16.04.01.

tags: added: targetmilestone-inin16041
removed: targetmilestone-inin1604
dann frazier (dannf)
Changed in ubuntu-z-systems:
status: New → Triaged
Changed in perftest (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → High
Changed in perftest (Ubuntu Yakkety):
milestone: xenial-updates → ubuntu-16.10
Mathew Hodson (mhodson)
Changed in perftest (Ubuntu Xenial):
milestone: none → ubuntu-16.04.1
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
importance: Undecided → High
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Setup should now be working, could one please give this a proper try and triage now?

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-11-09 07:14 EDT-------
Canonical, can someone comment on the status this LP? Thx

Revision history for this message
Frank Heimes (fheimes) wrote :

RDMA works fine with RoCE.
IB is still a todo.
OFED driver stable version (3.18.2) info says:
http://www.openfabrics.org/downloads/OFED/release_notes/OFED_3.18-2_release_notes
1.2 Supported Platforms and Operating Systems
---------------------------------------------
  o CPU architectures:
 - x86_64
 - x86
 - ppc64
  o Linux Operating Systems:
 - RedHat EL6.5 2.6.32-431.el6
 - RedHat EL6.6 2.6.32-504.el6
 - RedHat EL6.7 2.6.32-573.el6
 - RedHat EL7.0 3.10.0-123.el7
 - RedHat EL7.1 3.10.0-229.el7
 - RedHat EL7.2 3.10.0-327.el7
 - SLES11 SP3 3.0.76-0.9.1
 - SLES11 SP4 3.0.101-63
 - SLES12 3.12.28-4
 - SLES12.1 3.12.49-11.1
 - kernel.org 3.18 *
"kernel.org 3.18

So even if I read it in a way that OFED driver got incl. into upstream kernel with 3.18, it looks like s390x support is missing?!
Needs clarification.

tags: added: roce
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (6.6 KiB)

Debugging Session

Debug symbols:
echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse" | sudo tee -a /etc/apt/sources.list.d/ddebs.list
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 5FDFF622

What do we need:
get https://wiki.ubuntu.com/DebuggingProgramCrash?action=AttachFile&do=view&target=list-symbols-packages-v2.sh

$ ldd /usr/bin/ib_read_bw
        libm.so.6 => /lib/s390x-linux-gnu/libm.so.6 (0x000003ff84380000)
        librdmacm.so.1 => /usr/lib/s390x-linux-gnu/librdmacm.so.1 (0x000003ff84300000)
        libibverbs.so.1 => /usr/lib/s390x-linux-gnu/libibverbs.so.1 (0x000003ff84280000)
        libc.so.6 => /lib/s390x-linux-gnu/libc.so.6 (0x000003ff84080000)
        libpthread.so.0 => /lib/s390x-linux-gnu/libpthread.so.0 (0x000003ff84000000)
        /lib/ld64.so.1 (0x000002aa10500000)
        libnl-route-3.so.200 => /usr/lib/s390x-linux-gnu/libnl-route-3.so.200 (0x000003ff83f80000)
        libnl-3.so.200 => /lib/s390x-linux-gnu/libnl-3.so.200 (0x000003ff83f00000)
        libdl.so.2 => /lib/s390x-linux-gnu/libdl.so.2 (0x000003ff83e80000)

Check with script what debug pkg:
./list-symbols-packages-v2.sh /usr/bin/ib_read_bw
=> perftest-dbgsym - debug symbols for package perftest
Same for all other binary libraries that are linked
$for lib in $(ldd /usr/bin/ib_read_bw | awk '{print $3}'); do ./list-symbols-packages-v2.sh $lib; done
=> librdmacm1-dbgsym
=> libibverbs1-dbgsym

$ sudo apt-get install perftest-dbgsym librdmacm1-dbgsym libibverbs1-dbgsym

Server (e.g. s1lp14):
ib_read_bw -d mlx4_0 -a

Client (e.g. s1lp15):
ib_read_bw 10.245.236.14 -d mlx4_0 (ib_read_bw -d mlx4_0 192.168.1.141)

(10.245.236.14 is address of OSA device)
(192.168.1.141 is address of RoCE device)

1. run in gdb (expect valgrind later)

$ gdb ib_read_bw
# check that symbols are loaded
(gdb) run 10.245.236.14 -d mlx4_0

#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:58
#1 0x000003fffdb3db66 in __GI_abort () at abort.c:89
#2 0x000003fffdb7e3fe in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x3fffdc595ee "*** %s ***: %s terminated\n")
    at ../sysdeps/posix/libc_fatal.c:175
#3 0x000003fffdc069d4 in __GI___fortify_fail (msg=msg@entry=0x3fffdc595d0 "stack smashing detected") at fortify_fail.c:37
#4 0x000003fffdc06978 in __stack_chk_fail () at stack_chk_fail.c:28
#5 0x000002aa000088ea in check_mtu (context=<optimized out>, user_param=0x3ffffffef70, user_comm=<optimized out>)
    at src/perftest_communication.c:1759
#6 0x000002aa000044ba in main (argc=<optimized out>, argv=0x3fffffff4c8) at src/read_bw.c:119

__stack_chk_fail hits when returning from check_mtu it seems, main was clobbered it seems.

Step 1 - read extra carful on overwrites in that function

Stack vars:
1705 »···int curr_mtu=0, rem_mtu=0;
1706 »···char cur[2];
1707 »···char rem[2];
1708 »···int size_of_cur;

- curr_mtu is int, writes from enum ibv_mtu should be safe
- rem_mtu assignment cass to int

Warn:
spri...

Read more...

Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Triaged → Confirmed
Changed in perftest (Ubuntu):
status: Triaged → Confirmed
Changed in perftest (Ubuntu Xenial):
status: Triaged → Confirmed
Changed in perftest (Ubuntu Yakkety):
status: Triaged → Confirmed
Revision history for this message
Frank Heimes (fheimes) wrote :

A patch has been pushed by Mellanox to their git-hub repository.
It can be found here:
https://github.com/linux-rdma/perftest/commit/4dc033eb5ba51b4609ce20f3161a3471ce01f2e8

Changed in perftest (Ubuntu):
milestone: ubuntu-16.10 → ubuntu-17.06
Frank Heimes (fheimes)
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package perftest - 3.4+0.6.gc3435c2-1ubuntu1

---------------
perftest (3.4+0.6.gc3435c2-1ubuntu1) artful; urgency=medium

  * Cherrypick upstream patch to fix version string comparisons. Resolves
    a crash LP: #1553185.

 -- Dimitri John Ledkov <email address hidden> Mon, 19 Jun 2017 09:55:14 +0100

Changed in perftest (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Frank Heimes (fheimes) wrote :

I tested the new and patched perftest package (3.4+0.6.gc3435c2-1ubuntu1) on Zesty - because the additional required packages for RoCE and perftest are the same on Zesty and Artful (today).

Everything worked fine - I successfully tested ib_atomic_bw, ib_read_bw, ib_send_bw (in different modes) and ib_send_lat.

See attached document for more details ...

Changed in perftest (Ubuntu Zesty):
status: New → In Progress
importance: Undecided → High
no longer affects: perftest (Ubuntu Yakkety)
no longer affects: perftest (Ubuntu Xenial)
Changed in perftest (Ubuntu Zesty):
assignee: nobody → Dimitri John Ledkov (xnox)
milestone: none → zesty-updates
status: In Progress → Triaged
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: Confirmed → Triaged
no longer affects: perftest (Ubuntu Zesty)
Changed in ubuntu-z-systems:
status: Triaged → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-10-16 08:03 EDT-------
IBM Bugzilla status -> closed; Fix Released by Canonical

Frank Heimes (fheimes)
tags: added: universe
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.