NVidia: Ubuntu: OS crashed into xmon Prompt; scsi_report_bus_reset

Bug #1483170 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

Problem Description:
====================
This system is running non-virtualized ubuntu with one nvidia k80 GPU. During a hardbootme run the OS crashed. Here are the details from xmon:

0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c000003ffff8f3b0]
    pc: c00000000069ba80: scsi_report_bus_reset+0x60/0xb0
    lr: d00000001cae524c: ipr_erp_start+0x3bc/0x644 [ipr]
    sp: c000003ffff8f630
   msr: 9000000000009033
   dar: 100178
 dsisr: 40000000
  current = 0xc000000001359b10
  paca = 0xc00000000fb80000 softe: 0 irq_happened: 0x01
    pid = 0, comm = swapper/0
0:mon> r
R00 = d00000001cae524c R16 = 0000000000200000
R01 = c000003ffff8f630 R17 = 0000000000000000
R02 = c0000000013d8028 R18 = 00000000fffefa58
R03 = c000000fdcb00000 R19 = c000000000e4a000
R04 = 0000000000000000 R20 = c000000001412180
R05 = 0000000000000002 R21 = 0000000000000001
R06 = 0000000000000067 R22 = 0000000000000002
R07 = 0000000006290000 R23 = 00000000000001f0
R08 = 0000000000000001 R24 = c00000001010ea00
R09 = 00000000001000f0 R25 = c000000fdcb00730
R10 = 00000000000000ff R26 = 0000000000000001
R11 = d00000001cae6518 R27 = 0000000006290000
R12 = c00000000069ba20 R28 = c000000fdce40cf0
R13 = c00000000fb80000 R29 = c000000fa4c50300
R14 = c00000000135a120 R30 = 0000000000000000
R15 = 0000000000000000 R31 = c000000fdcb00000
pc = c00000000069ba80 scsi_report_bus_reset+0x60/0xb0
cfar= c000000000009368 slb_miss_realmode+0x50/0x78
lr = d00000001cae524c ipr_erp_start+0x3bc/0x644 [ipr]
msr = 9000000000009033 cr = 28044444
ctr = c00000000069ba20 xer = 0000000000000000 trap = 300
dar = 0000000000100178 dsisr = 40000000
0:mon> t
[c000003ffff8f660] d00000001cae524c ipr_erp_start+0x3bc/0x644 [ipr]
[c000003ffff8f6c0] d00000001caddb20 ipr_scsi_done+0x100/0x120 [ipr]
[c000003ffff8f700] d00000001cadc5bc ipr_isr_mhrrq+0x10c/0x250 [ipr]
[c000003ffff8f760] c00000000012ff90 handle_irq_event_percpu+0x90/0x2b0
[c000003ffff8f820] c000000000130218 handle_irq_event+0x68/0xd0
[c000003ffff8f850] c000000000135380 handle_fasteoi_irq+0xe0/0x250
[c000003ffff8f880] c00000000012f188 generic_handle_irq+0x58/0x90
[c000003ffff8f8b0] c0000000000119d0 __do_irq+0x80/0x190
[c000003ffff8f8e0] c000000000011bec do_IRQ+0x10c/0x120
[c000003ffff8f940] c000000000002794 hardware_interrupt_common+0x114/0x180
--- Exception: 501 (Hardware Interrupt) at c0000000006a45b4 scsi_io_completion+0x1e4/0x800
[c000003ffff8fd00] c00000000069662c scsi_finish_command+0x15c/0x1b0
[c000003ffff8fd80] c0000000006a41d8 scsi_softirq_done+0x198/0x200
[c000003ffff8fe00] c0000000004cbbd4 blk_done_softirq+0xb4/0xe0
[c000003ffff8fe40] c0000000000b5244 __do_softirq+0x174/0x3e0
[c000003ffff8ff30] c0000000000b5888 irq_exit+0xf8/0x140
[c000003ffff8ff60] c0000000000119dc __do_irq+0x8c/0x190
[c000003ffff8ff90] c000000000025320 call_do_irq+0x14/0x24
[c0000000013d7840] c000000000011b80 do_IRQ+0xa0/0x120
[c0000000013d78a0] c000000000002794 hardware_interrupt_common+0x114/0x180
--- Exception: 501 (Hardware Interrupt) at c0000000000110d4 arch_local_irq_restore+0x74/0x90
[c0000000013d7b90] c0000000000162f8 __switch_to+0x208/0x350 (unreliable)
[c0000000013d7bb0] c0000000000ef70c finish_task_switch+0x7c/0x1e0
[c0000000013d7bf0] c0000000009d6c40 __schedule+0x370/0x910
[c0000000013d7e10] c0000000009d7880 schedule_preempt_disabled+0x20/0x30
[c0000000013d7e30] c0000000001121e4 cpu_startup_entry+0x1c4/0x500
[c0000000013d7ee0] c00000000000ccd4 rest_init+0xa4/0xc0
[c0000000013d7f00] c000000000d53e4c start_kernel+0x520/0x53c
[c0000000013d7f90] c000000000009b6c start_here_common+0x20/0xa8
0:mon>

== Comment: #1 - Brian J. King <email address hidden> - 2015-05-28 17:08:13 ==
Make sure we have the host lock held when calling scsi_report_bus_reset. Fixes a crash seen as the __devices list in the scsi host was changing as we were iterating through it.

== Comment: #8 - Wen Xiong <email address hidden> - 2015-08-06 11:09:25 ==
Release of bug changed to Ubuntu14.04.

He has tested the patch and " yes the patch worked". We have upstream the patch last month. Here is the commit link:

https://git.kernel.org/cgit/linux/kernel/git/jejb/scsi.git/commit/drivers/scsi/ipr.c?h=misc&id=36b8e180e1e929e00b351c3b72aab3147fc14116

Revision history for this message
bugproxy (bugproxy) wrote : [PATCH] ipr: Fix locking for unit attention handling

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-125606 severity-critical targetmilestone-inin14042
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1483170/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
affects: ubuntu → linux (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote :

Default Comment by Bridge

tags: added: targetmilestone-inin14044
removed: targetmilestone-inin14042
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2015-11-17 16:12 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-01-26 11:27 EDT-------
Relevant patch is in vivid, which is 14.04.3. Moving to accepted awaiting verification.

tags: added: targetmilestone-inin14043
removed: targetmilestone-inin14044
Revision history for this message
bugproxy (bugproxy) wrote :
Download full text (4.1 KiB)

------- Comment From <email address hidden> 2016-03-07 12:15 EDT-------
==== State: Verify by: panico on 07 March 2016 11:10:48 ====

Tested on the original gp6 system (PowerNV Ubuntu and nVidia K80) and verified. I ran a hardbootme on the system, shutting down and booting the system every four hours over a two day period.

The system versions:
root@gp6p01:~# uname -a
Linux gp6p01 3.19.0-32-generic #37~14.04.1-Ubuntu SMP Thu Oct 22 10:11:54 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux
root@gp6p01:~# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.3 LTS
Release: 14.04
Codename: trusty

The system has this versions of firmware:
Current Side Driver:.....fips811/b1105a_1540.811

Here is the hardbootme.log file from the script running on the lcb:

Bootme started at :
Fri Mar 4 16:38:39 CST 2016
==========================================
System gp6 powered off at :
Fri Mar 4 17:12:41 CST 2016
System gp6 powering on at :
Fri Mar 4 17:22:41 CST 2016
==========================================
==========================================
System gp6 powered off at :
Fri Mar 4 20:01:48 CST 2016
System gp6 powering on at :
Fri Mar 4 20:11:48 CST 2016
==========================================
==========================================
System gp6 powered off at :
Sat Mar 5 00:00:59 CST 2016
System gp6 powering on at :
Sat Mar 5 00:10:59 CST 2016
==========================================
==========================================
System gp6 powered off at :
Sat Mar 5 04:01:09 CST 2016
System gp6 powering on at :
Sat Mar 5 04:11:09 CST 2016
==========================================
==========================================
System gp6 powered off at :
Sat Mar 5 08:01:20 CST 2016
System gp6 powering on at :
Sat Mar 5 08:11:20 CST 2016
==========================================
==========================================
System gp6 powered off at :
Sat Mar 5 12:01:31 CST 2016
System gp6 powering on at :
Sat Mar 5 12:11:31 CST 2016
==========================================
==========================================
System gp6 powered off at :
Sat Mar 5 16:01:41 CST 2016
System gp6 powering on at :
Sat Mar 5 16:11:41 CST 2016
==========================================
==========================================
System gp6 powered off at :
Sat Mar 5 20:00:52 CST 2016
System gp6 powering on at :
Sat Mar 5 20:10:52 CST 2016
==========================================
==========================================
System gp6 powered off at :
Sun Mar 6 00:01:03 CST 2016
System gp6 powering on at :
Sun Mar 6 00:11:03 CST 2016
==========================================
==========================================
System gp6 powered off at :
Sun Mar 6 04:01:13 CST 2016
System gp6 powering on at :
Sun Mar 6 04:11:13 CST 2016
==========================================
==========================================
System gp6 powered off at :
Sun Mar 6 08:01:24 CST 2016
System gp6 powering on at :
Sun Mar 6 08:11:24 CST 2016
==========================================
==========================================
System gp6 powered off at :
Sun Mar 6 12:01:35 CST 2016
System gp6 powering on at :
Sun Mar 6 12:11:35 CST 2016...

Read more...

Luciano Chavez (lnx1138)
Changed in linux (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.