Server Crash while running IO and switch port bounce test with 2K login session

Bug #1971193 reported by Laurie Barry
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Undecided
Unassigned
Jammy
New
Undecided
Unassigned

Bug Description

[Impact]
Server crash and Call trace reported on one of the servers running IO and
switch port bounce test from the 2K login session configuration.

Call Trace:
[56048.470488] Call Trace:
[56048.470489] _raw_spin_lock_irqsave+0x32/0x40
[56048.470489] lpfc_dmp_dbg.part.32+0x28/0x220 [lpfc]
[56048.470490] lpfc_cmpl_els_fdisc+0x145/0x460 [lpfc]
[56048.470490] lpfc_sli_cancel_jobs+0x92/0xd0 [lpfc]
[56048.470490] lpfc_els_flush_cmd+0x43c/0x670 [lpfc]
[56048.470491] lpfc_els_flush_all_cmd+0x37/0x60 [lpfc]
[56048.470491] lpfc_sli4_async_event_proc+0x956/0x1720 [lpfc]
[56048.470492] lpfc_do_work+0x1485/0x1d70 [lpfc]
[56048.470492] ? __schedule+0x280/0x700
[56048.470492] ? finish_wait+0x80/0x80
[56048.470493] ? lpfc_unregister_unused_fcf+0x80/0x80 [lpfc]
[56048.470493] kthread+0x112/0x130
[56048.470493] ? kthread_flush_work_fn+0x10/0x10
[56048.470494] ret_from_fork+0x1f/0x40
[56048.470494] Kernel panic - not syncing: Hard LOCKUP
[56048.470495] CPU: 0 PID: 682 Comm: lpfc_worker_0 Kdump: loaded Tainted: G
     IOE --------- - - 4.18.0-240.el8.x86_64 #1
[56048.470496] Hardware name: Dell Inc. PowerEdge R740/0DY2X0, BIOS 2.11.2
004/21/2021
[56048.470496] Call Trace:
[56048.470496] <NMI>
[56048.470496] dump_stack+0x5c/0x80
[56048.470497] panic+0xe7/0x2a9
[56048.470497] ? __switch_to_asm+0x51/0x70
[56048.470497] nmi_panic.cold.9+0xc/0xc
[56048.470498] watchdog_overflow_callback.cold.7+0x5c/0x70
[56048.470498] __perf_event_overflow+0x52/0xf0
[56048.470499] handle_pmi_common+0x1db/0x270
[56048.470499] ? __set_pte_vaddr+0x32/0x50
[56048.470499] ? __native_set_fixmap+0x24/0x30
[56048.470500] ? ghes_copy_tofrom_phys+0xd3/0x1c0
[56048.470500] ? __ghes_peek_estatus.isra.12+0x49/0xa0
[56048.470500] intel_pmu_handle_irq+0xbf/0x160
[56048.470501] perf_event_nmi_handler+0x2d/0x50
[56048.470501] nmi_handle+0x63/0x110
[56048.470501] default_do_nmi+0x4e/0x100
[56048.470502] do_nmi+0x128/0x190
[56048.470502] end_repeat_nmi+0x16/0x6a
[56048.470503] RIP: 0010:native_queued_spin_lock_slowpath+0x5d/0x1d0
[56048.470504] Code: 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4
09 d0 a9 00 01 ff ff 75 47 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75
f8 b8 01 00 00 00 66 89 07 c3 8b 37 81 fe 00 01 00 00 75
[56048.470504] RSP: 0018:ffffacebc7877ca8 EFLAGS: 00000002
[56048.470505] RAX: 0000000000000101 RBX: 0000000000000246 RCX:
000000000000001f
[56048.470505] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff94dcf5341dc0
[56048.470506] RBP: ffff94dcf5340000 R08: 0000000000000002 R09:
0000000000029600
[56048.470506] R10: 000060d29656a45c R11: ffff94dcf534fd12 R12:
ffff94dcf5341db0
[56048.470507] R13: ffff94dcf5341dc0 R14: ffff94dcc4ae8a00 R15:
0000000000000003
[56048.470507] ? native_queued_spin_lock_slowpath+0x5d/0x1d0
[56048.470507] ? native_queued_spin_lock_slowpath+0x5d/0x1d0
[56048.470508] </NMI>
[56048.470508] _raw_spin_lock_irqsave+0x32/0x40
[56048.470509] lpfc_dmp_dbg.part.32+0x28/0x220 [lpfc]
[56048.470509] lpfc_cmpl_els_fdisc+0x145/0x460 [lpfc]
[56048.470509] lpfc_sli_cancel_jobs+0x92/0xd0 [lpfc]
[56048.470510] lpfc_els_flush_cmd+0x43c/0x670 [lpfc]
[56048.470510] lpfc_els_flush_all_cmd+0x37/0x60 [lpfc]
[56048.470510] lpfc_sli4_async_event_proc+0x956/0x1720 [lpfc]
[56048.470511] lpfc_do_work+0x1485/0x1d70 [lpfc]
[56048.470511] ? __schedule+0x280/0x700
[56048.470511] ? finish_wait+0x80/0x80
[56048.470512] ? lpfc_unregister_unused_fcf+0x80/0x80 [lpfc]
[56048.470512] kthread+0x112/0x130
[56048.470513] ? kthread_flush_work_fn+0x10/0x10
[56048.470513] ret_from_fork+0x1f/0x40
[root@ms-svr3-10-231-131-160 127.0.0.1-2021-11-20-05:14:30]#

[root@ms-svr3-10-231-131-160 127.0.0.1-2021-11-20-05:14:30]# cat
/etc/redhat-release
Red Hat Enterprise Linux release 8.3 (Ootpa)

[root@ms-svr3-10-231-131-160 127.0.0.1-2021-11-20-05:14:30]# cat
/sys/module/lpfc/version
0:14.0.390.2

[root@ms-svr3-10-231-131-160 127.0.0.1-2021-11-20-05:14:30]# cat
/sys/class/scsi_host/host*/modeldesc
Emulex LightPulse LPe32002-M2 2-Port 32Gb Fibre Channel Adapter
Emulex LightPulse LPe32002-M2 2-Port 32Gb Fibre Channel Adapter

[root@ms-svr3-10-231-131-160 127.0.0.1-2021-11-20-05:14:30]# cat
/sys/class/scsi_host/host*/fwrev
14.0.390.1, sli-4:2:c
14.0.390.1, sli-4:2:c

[root@ms-svr3-10-231-131-160 127.0.0.1-2021-11-20-05:14:30]# cat
/sys/class/fc_host/host*/port_name
0x10000090faf09459
0x10000090faf0945a
[root@ms-svr3-10-231-131-160 127.0.0.1-2021-11-20-05:14:30]#

HBA Attributes for 10:00:00:90:fa:f0:94:59

Host Name : ms-svr3-10-231-131-160
Manufacturer : Emulex Corporation
Serial Number : FC70793283
Model : LPe32002-M2
Model Desc : Emulex LightPulse LPe32002-M2 2-Port 32Gb Fibre
Channel Adapter
Node WWN : 20 00 00 90 fa f0 94 59
Node Symname :
HW Version : 0000000c 00000001 00000000
FW Version : 14.0.390.1
Vendor Spec ID : 10DF
Number of Ports : 1
Driver Name : lpfc
Driver Version : 14.0.390.2; HBAAPI(I) v2.3.d, 07-12-10
Device ID : E300
HBA Type : LPe32002-M2
Operational FW : 14.0.390.1
IEEE Address : 00 90 fa f0 94 59
Boot Code : Enabled
Boot Version : 14.0.390.1
Board Temperature : Normal
Function Type : FC
Sub Device ID : E300
PCI Bus Number : 94
PCI Func Number : 0
Sub Vendor ID : 10DF
IPL Filename : H62LEX1
Service Processor FW Name : 14.0.390.1
ULP FW Name : 14.0.390.1
FC Universal BIOS Version : 14.0.390.1
FC x86 BIOS Version : 14.0.390.1
FC EFI BIOS Version : 14.0.388.0
FC FCODE Version : 14.0.386.0
Flash Firmware Version : 14.0.390.1
Secure Firmware : Enabled

[root@ms-svr3-10-231-131-160 log]# hbacmd portattrib 10:00:00:90:fa:f0:94:59

Port Attributes for 10:00:00:90:fa:f0:94:59

Node WWN : 20 00 00 90 fa f0 94 59
Port WWN : 10 00 00 90 fa f0 94 59
Port Symname :
Port FCID : 0000
Port Type : Unknown
Port State : Link Down
Port Service Type : 8
Port Supported FC4 : 00 00 01 00 00 00 00 01
                            00 00 00 00 00 00 00 00
                            00 00 00 00 00 00 00 00
                            00 00 00 00 00 00 00 00
Port Active FC4 : 00 00 01 00 00 00 00 01
                            00 00 00 00 00 00 00 00
                            00 00 00 00 00 00 00 00
                            00 00 00 00 00 00 00 00
Port Supported Speed : 8 16 32 Gbit/sec
Configured Port Speed : Auto Detect
Port Speed : Not Available
Max Frame Size : 2048
OS Device Name : /sys/class/scsi_host/host15
Num Discovered Ports : 0
Fabric Name : 00 00 00 00 00 00 00 00
Function Type : FC
FEC : Enabled

[Fixes]
The following patch will resolve the issue:
scsi: lpfc: Move cfg_log_verbose check before calling lpfc_dmp_dbg()
In an attempt to log message 0126 with LOG_TRACE_EVENT, the following hard
lockup call trace hangs the system.

[Testcase]

[root@ms-svr3-10-231-131-160 log]#
[reply] [-]Comment 3James Smart 2022-04-13 09:12:37 PDT
Patches pushed upstream 4/12/22:

https://<email address hidden>/T/#t

Tags: servcert-345
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1971193

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Jeff Lane  (bladernr)
tags: added: servcert-345
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Note for other readers / observation:

The kernel/OS in bug description are RHEL 8, not Ubuntu.
[4.18.0-240.el8.x86_64]

But per the comment/link at the end [1] it seems this bug
will be used for a lpfc driver update (fixing that error).

"""
Patches pushed upstream 4/12/22:

https://<email address hidden>/T/#t
"""

[PATCH 00/26] lpfc: Update lpfc to revision 14.2.0.2]

Revision history for this message
Jeff Lane  (bladernr) wrote (last edit ):

Thanks @mfo! That is correct, the crash was seen there, but they determined it was generic and are pushing this to all the Linux OSVs.

Also of note, that patch set is a general driver update and not all of those are relevant to this bug, I've asked them to pinpoint the patches that resolve this issue specifically with the intent of just pulling those.

description: updated
Revision history for this message
Laurie Barry (laurie-barry-4) wrote :
Download full text (4.0 KiB)

Driver team has highlighted this patch is required to address this issue:

author James Smart <email address hidden> 2022-04-12 15:19:44 -0700
committer Martin K. Petersen <email address hidden> 2022-04-18 22:48:43 -0400
commit e294647b1aed4247fe52851f3a3b2b19ae906228 (patch)
tree fd7e11a3c6f680d5aabd468d523d08ffcd66b59f /drivers/scsi/lpfc
parent b83a8c21f3fe874e12eb2b6e6c5cfb220d35c446 (diff)
download scsi-e294647b1aed4247fe52851f3a3b2b19ae906228.tar.gz
scsi: lpfc: Move cfg_log_verbose check before calling lpfc_dmp_dbg()
In an attempt to log message 0126 with LOG_TRACE_EVENT, the following hard
lockup call trace hangs the system.

Call Trace:
 _raw_spin_lock_irqsave+0x32/0x40
 lpfc_dmp_dbg.part.32+0x28/0x220 [lpfc]
 lpfc_cmpl_els_fdisc+0x145/0x460 [lpfc]
 lpfc_sli_cancel_jobs+0x92/0xd0 [lpfc]
 lpfc_els_flush_cmd+0x43c/0x670 [lpfc]
 lpfc_els_flush_all_cmd+0x37/0x60 [lpfc]
 lpfc_sli4_async_event_proc+0x956/0x1720 [lpfc]
 lpfc_do_work+0x1485/0x1d70 [lpfc]
 kthread+0x112/0x130
 ret_from_fork+0x1f/0x40
Kernel panic - not syncing: Hard LOCKUP

The same CPU tries to claim the phba->port_list_lock twice.

Move the cfg_log_verbose checks as part of the lpfc_printf_vlog() and
lpfc_printf_log() macros before calling lpfc_dmp_dbg(). There is no need
to take the phba->port_list_lock within lpfc_dmp_dbg().

Link: https://<email address hidden>
Co-developed-by: Justin Tee <email address hidden>
Signed-off-by: Justin Tee <email address hidden>
Signed-off-by: James Smart <email address hidden>
Signed-off-by: Martin K. Petersen <email address hidden>
Diffstat (limited to 'drivers/scsi/lpfc')
-rw-r--r-- drivers/scsi/lpfc/lpfc_init.c 29
-rw-r--r-- drivers/scsi/lpfc/lpfc_logmsg.h 6
2 files changed, 4 insertions, 31 deletions
diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index 461d333b1b3a8..f9cd4b72d949a 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -15700,34 +15700,7 @@ void lpfc_dmp_dbg(struct lpfc_hba *phba)
  unsigned int temp_idx;
  int i;
  int j = 0;
- unsigned long rem_nsec, iflags;
- bool log_verbose = false;
- struct lpfc_vport *port_iterator;
-
- /* Don't dump messages if we explicitly set log_verbose for the
- * physical port or any vport.
- */
- if (phba->cfg_log_verbose)
- return;
-
- spin_lock_irqsave(&phba->port_list_lock, iflags);
- list_for_each_entry(port_iterator, &phba->port_list, listentry) {
- if (port_iterator->load_flag & FC_UNLOADING)
- continue;
- if (scsi_host_get(lpfc_shost_from_vport(port_iterator))) {
- if (port_iterator->cfg_log_verbose)
- log_verbose = true;
-
- scsi_host_put(lpfc_shost_from_vport(port_iterator));
-
- if (log_verbose) {
- spin_unlock_irqrestore(&phba->port_list_lock,
- iflags);
- return;
- }
- }
- }
- spin_unlock_irqrestore(&phba->port_list_lock, iflags);
+ unsigned long rem_nsec;

  if (atomic_cmpxchg(&phba->dbg_log_dmping, 0, 1) != 0)
   return;
diff --git a/drivers/scsi/lpfc/lpfc_logmsg.h b/drivers/scsi/lpfc/lpfc_logmsg.h
index 7d480c7987942..a5aafe230c74f 100644
--- a/drivers/scsi/lpfc/lpfc_logmsg.h
+++ b/drivers/scsi/lpfc/lpfc_logmsg.h
@@ -73,7 +7...

Read more...

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Jeff Lane  (bladernr)
Changed in linux (Ubuntu):
status: Expired → New
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote :

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1971193

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Jeff Lane  (bladernr) wrote :

So this is actually in 5.19, and has been pulled into 22.04 via our 5.15 kernel. Is there anything more to do here?

commit eb2f403f098fedb4c58283e9532df2ff0d2a36e9
Author: James Smart <email address hidden>
Date: Tue Apr 12 15:19:44 2022 -0700

    scsi: lpfc: Move cfg_log_verbose check before calling lpfc_dmp_dbg()

    BugLink: https://bugs.launchpad.net/bugs/1981864

    [ Upstream commit e294647b1aed4247fe52851f3a3b2b19ae906228 ]

    In an attempt to log message 0126 with LOG_TRACE_EVENT, the following hard
    lockup call trace hangs the system.

    Call Trace:
     _raw_spin_lock_irqsave+0x32/0x40
     lpfc_dmp_dbg.part.32+0x28/0x220 [lpfc]
     lpfc_cmpl_els_fdisc+0x145/0x460 [lpfc]
     lpfc_sli_cancel_jobs+0x92/0xd0 [lpfc]
     lpfc_els_flush_cmd+0x43c/0x670 [lpfc]
     lpfc_els_flush_all_cmd+0x37/0x60 [lpfc]
     lpfc_sli4_async_event_proc+0x956/0x1720 [lpfc]
     lpfc_do_work+0x1485/0x1d70 [lpfc]
     kthread+0x112/0x130
     ret_from_fork+0x1f/0x40
    Kernel panic - not syncing: Hard LOCKUP

    The same CPU tries to claim the phba->port_list_lock twice.

    Move the cfg_log_verbose checks as part of the lpfc_printf_vlog() and
    lpfc_printf_log() macros before calling lpfc_dmp_dbg(). There is no need
    to take the phba->port_list_lock within lpfc_dmp_dbg().

    Link: https://<email address hidden>
    Co-developed-by: Justin Tee <email address hidden>
    Signed-off-by: Justin Tee <email address hidden>
    Signed-off-by: James Smart <email address hidden>
    Signed-off-by: Martin K. Petersen <email address hidden>
    Signed-off-by: Sasha Levin <email address hidden>
    Signed-off-by: Kamal Mostafa <email address hidden>
    Signed-off-by: Stefan Bader <email address hidden>

Revision history for this message
Laurie Barry (laurie-barry-4) wrote : Re: [Bug 1971193] Re: Server Crash while running IO and switch port bounce test with 2K login session
Download full text (11.5 KiB)

Checking

On Wed, Aug 24, 2022 at 7:50 PM Jeff Lane  <email address hidden>
wrote:

> So this is actually in 5.19, and has been pulled into 22.04 via our 5.15
> kernel. Is there anything more to do here?
>
> commit eb2f403f098fedb4c58283e9532df2ff0d2a36e9
> Author: James Smart <email address hidden>
> Date: Tue Apr 12 15:19:44 2022 -0700
>
> scsi: lpfc: Move cfg_log_verbose check before calling lpfc_dmp_dbg()
>
> BugLink: https://bugs.launchpad.net/bugs/1981864
>
> [ Upstream commit e294647b1aed4247fe52851f3a3b2b19ae906228 ]
>
> In an attempt to log message 0126 with LOG_TRACE_EVENT, the following
> hard
> lockup call trace hangs the system.
>
> Call Trace:
> _raw_spin_lock_irqsave+0x32/0x40
> lpfc_dmp_dbg.part.32+0x28/0x220 [lpfc]
> lpfc_cmpl_els_fdisc+0x145/0x460 [lpfc]
> lpfc_sli_cancel_jobs+0x92/0xd0 [lpfc]
> lpfc_els_flush_cmd+0x43c/0x670 [lpfc]
> lpfc_els_flush_all_cmd+0x37/0x60 [lpfc]
> lpfc_sli4_async_event_proc+0x956/0x1720 [lpfc]
> lpfc_do_work+0x1485/0x1d70 [lpfc]
> kthread+0x112/0x130
> ret_from_fork+0x1f/0x40
> Kernel panic - not syncing: Hard LOCKUP
>
> The same CPU tries to claim the phba->port_list_lock twice.
>
> Move the cfg_log_verbose checks as part of the lpfc_printf_vlog() and
> lpfc_printf_log() macros before calling lpfc_dmp_dbg(). There is no
> need
> to take the phba->port_list_lock within lpfc_dmp_dbg().
>
> Link:
> https://<email address hidden>
> Co-developed-by: Justin Tee <email address hidden>
> Signed-off-by: Justin Tee <email address hidden>
> Signed-off-by: James Smart <email address hidden>
> Signed-off-by: Martin K. Petersen <email address hidden>
> Signed-off-by: Sasha Levin <email address hidden>
> Signed-off-by: Kamal Mostafa <email address hidden>
> Signed-off-by: Stefan Bader <email address hidden>
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1971193
>
> Title:
> Server Crash while running IO and switch port bounce test with 2K
> login session
>
> Status in linux package in Ubuntu:
> Incomplete
>
> Bug description:
> [Impact]
> Server crash and Call trace reported on one of the servers running IO and
> switch port bounce test from the 2K login session configuration.
>
> Call Trace:
> [56048.470488] Call Trace:
> [56048.470489] _raw_spin_lock_irqsave+0x32/0x40
> [56048.470489] lpfc_dmp_dbg.part.32+0x28/0x220 [lpfc]
> [56048.470490] lpfc_cmpl_els_fdisc+0x145/0x460 [lpfc]
> [56048.470490] lpfc_sli_cancel_jobs+0x92/0xd0 [lpfc]
> [56048.470490] lpfc_els_flush_cmd+0x43c/0x670 [lpfc]
> [56048.470491] lpfc_els_flush_all_cmd+0x37/0x60 [lpfc]
> [56048.470491] lpfc_sli4_async_event_proc+0x956/0x1720 [lpfc]
> [56048.470492] lpfc_do_work+0x1485/0x1d70 [lpfc]
> [56048.470492] ? __schedule+0x280/0x700
> [56048.470492] ? finish_wait+0x80/0x80
> [56048.470493] ? lpfc_unregister_unused_fcf+0x80/0x80 [lpfc]
> [56048.470493] kthread+0x112/0x130
> [56048.470493] ? kthread_flush_work_fn+0x10/0x10
> [...

Revision history for this message
Jeff Lane  (bladernr) wrote :

Hi Laurie,

Just wanted to confirm that it's sufficient for this to be in 22.04 GA (5.15) before I close it out

Revision history for this message
Laurie Barry (laurie-barry-4) wrote :

Yes, this is verified complete from our perspective based Development's verification.

Was the test kernel you gave us was from your formal build process or whether it was your personal sandbox build?

If what we've already received is a formal build then we will forward in our DVT test organization and let you now if we find any other issues but we don't anticipate any.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.