hang in mlx5_create_map_eq in Ubuntu 15.04 due to not getting interrupts (mlx5) (Mellanox)

Bug #1419938 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Leann Ogasawara

Bug Description

---Problem Description---
While installing Ubuntu 15.04 LE in powerNV system, I see the following error:

[ 242.141309] INFO: task systemd-udevd:623 blocked for more than 120 seconds.
[ 242.141408] Tainted: G E 3.18.0-12-generic #13-Ubuntu
[ 242.141463] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 242.141529] systemd-udevd D 00003fff8ad27d24 0 623 603 0x00040000
[ 242.141597] Call Trace:
[ 242.141625] [c000002fe2dbad50] [0000000500000000] 0x500000000 (unreliable)

Mellanox Error from trace:

[ 2.609984] /build/buildd/linux-3.18.0/drivers/rtc/hctosys.c: unable to open rtc device (rtc0)
[ 2.611142] Freeing unused kernel memory: 5760K (c000000000d90000 - c000000001330000)
starting version 218
[ 2.664785] scsi_transport_fc: module verification failed: signature and/or required key missing - tainting kernel
[ 2.668055] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
[ 2.668124] mlx4_core: Initializing 0000:01:00.0

---uname output---
Linux powerio-le21 3.16.0-23-generic #31-Ubuntu SMP Tue Oct 21 17:55:08 UTC 2014 ppc64le ppc64le ppc64le GNU/Linux

---Additional Hardware Info---
Mellanox device which seems to be causing the error:
0000:01:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)

Machine Type = 8286-42A PowerNV

---Debugger---
---Steps to Reproduce---
1. Start install of Ubuntu 15.04 LE
2. Installer does not start at all. We see the error

Install ISO Information: Ubuntu 15.04 - vivid-server-ppc64el.iso

Install method: DVD

Install disk info: # ethtool -i eth17
driver: mlx4_en
version: 2.2-1 (Feb 2014)
firmware-version: 2.9.1326
bus-info: 0000:01:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

The issue is that we are not getting interrupts.

I forced to install Mellanox OFED in my virtual guest with ubuntu 15.04 and I do not see the issue with that code so I will try to look tonight for differences between that code and upstream to see if I can spot the issue.

The problem is related to the following commit:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/patch/drivers/net/ethernet/mellanox/mlx5/core?id=c7a08ac7ee68b9af0d5af99c7b34b574cac4d144

They forgot to set the page size for UAR to adapter so that is why is not working. So any kernel for power that gets that patch in mlx5 will see this issue.

Revision history for this message
bugproxy (bugproxy) wrote : kernal_trace

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-121428 severity-critical targetmilestone-inin1504
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1419938/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
affects: ubuntu → linux (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2015-02-11 15:08 EDT-------
patch available to https://patchwork.ozlabs.org/patch/438793/
(net/mlx5_core: Fix configuration of log_uar_page_sz)

Chris J Arges (arges)
Changed in linux (Ubuntu):
importance: Undecided → Critical
importance: Critical → Medium
status: New → Confirmed
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-02-16 18:56 EDT-------
The patch referenced in comment #14 resolves this issue. Can we get this patch pulled into in 15.04?

Revision history for this message
Chris J Arges (arges) wrote :

Once this patch gets merged into Linus' tree, we'll apply it in Vivid.

Revision history for this message
Breno Leitão (breno-leitao) wrote :

Chris, the patch was applied by DaveM already, and merged by Linus in merge f5af19d10d151c5a2afae3306578f485c244db25. The commit id in Linus' tree is de61390cb3e03186f85997fe08a11dcb9f7a01a3.

Changed in linux (Ubuntu):
assignee: nobody → Leann Ogasawara (leannogasawara)
status: Confirmed → In Progress
Changed in linux (Ubuntu):
status: In Progress → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.19.0-8.8

---------------
linux (3.19.0-8.8) vivid; urgency=low

  [ Andy Whitcroft ]

  * ubuntu: vbox -- elide the new symlinks and reconstruct on clean:
    - LP: #1426113
  * rebase to stable v3.19.1

  [ John Johansen ]

  * SAUCE: (no-up): apparmor: fix mediation of fs unix sockets
    - LP: #1408833

  [ Leann Ogasawara ]

  * Release Tracking Bug
    - LP: #1429940

  [ Upstream Kernel Changes ]

  * xen: correct bug in p2m list initialization
  * net/mlx5_core: Fix configuration of log_uar_page_sz
    - LP: #1419938
  * tpm/ibmvtpm: Additional LE support for tpm_ibmvtpm_send
    - LP: #1420575
  * net/mlx4_core: Maintain a persistent memory for mlx4 device
    - LP: #1422481
  * net/mlx4_core: Set device configuration data to be persistent across
    reset
    - LP: #1422481
  * net/mlx4_core: Refactor the catas flow to work per device
    - LP: #1422481
  * net/mlx4_core: Enhance the catas flow to support device reset
    - LP: #1422481
  * net/mlx4_core: Activate reset flow upon fatal command cases
    - LP: #1422481
  * net/mlx4_core: Manage interface state for Reset flow cases
    - LP: #1422481
  * net/mlx4_core: Handle AER flow properly
    - LP: #1422481
  * net/mlx4_core: Enable device recovery flow with SRIOV
    - LP: #1422481
  * net/mlx4_core: Reset flow activation upon SRIOV fatal command cases
    - LP: #1422481
  * tg3: Hold tp->lock before calling tg3_halt() from tg3_init_one()
    - LP: #1428111
  * rebase to v3.19.1
    - LP: #1410704
    - LP: #1411193
    - LP: #1400215
 -- Leann Ogasawara <email address hidden> Mon, 09 Mar 2015 10:08:29 -0700

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :
Download full text (3.4 KiB)

------- Comment From <email address hidden> 2015-03-11 19:14 EDT-------
This looks fixed with 3.19.0-8-generic #8-Ubuntu
it was able to recover from EEH.

[ 2694.622586] EEH: Notify device drivers to shutdown
[ 2694.622587] mlx4_core 0004:01:00.0: device was reset successfully
[ 2694.622589] mlx4_core 0004:01:00.0: mlx4_pci_err_detected was called
[ 2694.622594] mlx4_en 0004:01:00.0: Internal error detected, restarting device
[ 2694.622786] mlx4_en: eth14: Close port called
[ 2694.846830] mlx4_en 0004:01:00.0: removed PHC
[ 2694.874036] EEH: Collect temporary log
[ 2694.879101] EEH: of node=/pciex@3fffe42000000/pci@0/ethernet@0
[ 2694.879465] EEH: PCI device/vendor: 100715b3
[ 2694.879478] EEH: PCI cmd/status register: 00100142
[ 2694.879479] EEH: PCI-E capabilities and status follow:
[ 2694.879544] EEH: PCI-E 00: 00020010 10008e02 0020204e 0843f483
[ 2694.879597] EEH: PCI-E 10: 10830040 00000000 00000000 00000000
[ 2694.879598] EEH: PCI-E 20: 00000000
[ 2694.879599] EEH: PCI-E AER capability register set follows:
[ 2694.879666] EEH: PCI-E AER 00: 18c20001 00000000 00000000 00062010
[ 2694.879719] EEH: PCI-E AER 10: 00000000 00002000 000001e0 00000000
[ 2694.879772] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 2694.879785] EEH: PCI-E AER 30: 00000000 00000000
[ 2694.879787] PHB3 PHB#4 Diag-data (Version: 1)
[ 2694.879789] brdgCtl: 00000002
[ 2694.879790] UtlSts: 00200000 00000000 00000000
[ 2694.879791] RootSts: 00000040 00400000 f0830048 00100147 00000000
[ 2694.879792] PhbSts: 0000001c00000000 0000001c00000000
[ 2694.879793] Lem: 0000000000100000 42498e327f502eae 0000000000000000
[ 2694.879795] InAErr: 8000000000000000 8000000000000000 0402008000000000 0000000000000000
[ 2694.879796] PE[ 1] A/B: 8480002b00000000 8000000000000000
[ 2694.879797] PE[ 2] A/B: 8000000000000000 8000000000000000
[ 2694.879798] PE[ 3] A/B: 8000000000000000 8000000000000000
[ 2694.879799] PE[ 4] A/B: 8000000000000000 8000000000000000
[ 2694.879800] PE[ 5] A/B: 8000000000000000 8000000000000000
[ 2694.879801] EEH: Reset without hotplug activity
[ 2698.898176] EEH: Notify device drivers the completion of reset
[ 2698.898181] mlx4_core 0004:01:00.0: mlx4_pci_slot_reset was called
[ 2698.898218] mlx4_core 0004:01:00.0: enabling device (0140 -> 0142)
[ 2705.396286] mlx4_core 0004:01:00.0: PCIe link speed is 8.0GT/s, device supports 8.0GT/s
[ 2705.396288] mlx4_core 0004:01:00.0: PCIe link width is x8, device supports x8
[ 2706.143789] mlx4_en 0004:01:00.0: registered PHC clock
[ 2706.143864] mlx4_en 0004:01:00.0: Activating port:1
[ 2706.159496] mlx4_en: eth11: Using 256 TX rings
[ 2706.159504] mlx4_en: eth11: Using 8 RX rings
[ 2706.159506] mlx4_en: eth11: frag:0 - size:1518 prefix:0 stride:1536
[ 2706.159722] mlx4_en: eth11: Initializing port
[ 2706.160022] mlx4_en 0004:01:00.0: Activating port:2
[ 2706.165214] mlx4_core 0004:01:00.0 eth14: renamed from eth11
[ 2706.188419] mlx4_en: eth11: Using 256 TX rings
[ 2706.188427] mlx4_en: eth11: Using 8 RX rings
[ 2706.188430] mlx4_en: eth11: frag:0 - size:1518 prefix:0 stride:1536
[ 2706.188660] mlx4_en: eth11: Initializing port
[ 2706.197316] EEH: Notify device driver to resume...

Read more...

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-03-12 18:35 EDT-------
Ok this correct update. The driver is getting interrupts and I can ping ibX interface.
uname -a
Linux powerio-le21 3.19.0-8-generic #8-Ubuntu SMP Tue Mar 10 13:07:58 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
root@powerio-le21:/home/clsoto# ifconfig ib0
ib0 Link encap:UNSPEC HWaddr 80-00-00-26-FE-80-00-00-00-00-00-00-00-00-00-00
inet addr:40.40.40.41 Bcast:40.40.40.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:5 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:256
RX bytes:364 (364.0 B) TX bytes:0 (0.0 B)

root@powerio-le21:/home/clsoto# ping -c 1 40.40.40.40
PING 40.40.40.40 (40.40.40.40) 56(84) bytes of data.
64 bytes from 40.40.40.40: icmp_seq=1 ttl=64 time=0.100 ms

--- 40.40.40.40 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.100/0.100/0.100/0.000 ms

Thanks so much.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-03-12 18:50 EDT-------
Closing this since it is working now.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.