[UBUNTU 20.04] zfcp: Fix panic on ERP timeout for previously dismissed ERP

Bug #1887774 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
High
Skipper Bug Screeners
linux (Ubuntu)
Fix Released
Undecided
Skipper Bug Screeners
Bionic
Fix Released
Undecided
Unassigned
Focal
Fix Released
Undecided
Unassigned
Groovy
Fix Released
Undecided
Skipper Bug Screeners

Bug Description

SRU Justification:
==================

[Impact]

* Linux kernel panics due to kernel page fault in IRQ context when running zfcp_erp_timeout_handler() calling zfcp_erp_notify().

[Fix]

* 936e6b85da0476dd2edac7c51c68072da9fb4ba2 936e6b85da04 "scsi: zfcp: Fix panic on ERP timeout for previously dismissed ERP action"

[Test Case]

* Requires an IBM z13/z13s or LinuxONE Rockhopper/Emperor system (or newer) connected to zfcp capcble storage sub-system.

* Initiate an (ERP) timeout (maybe by injection or by causing a slow recovery otherwise).

* Monitor the system log for any kernel panics.

[Regression Potential]

* The regression can be considered as medium since the modification is platform specific / limited to s390x and again limited to the zfcp layer.

* Within zfcp it's further limited to the error recovery procedure (ERP) of fcp and only touches zfcp_erp.c, means the code path is mainly active under error conditions.

[Other]

* The above fix is upstream accepted with v5.8-rc3, hence will make it's way to groovy with kernel 5.8.

* Therefore this SRU request was submitted for bionic and focal only and not for groovy.

__________

Description: zfcp: Fix panic on ERP timeout for previously dismissed ERP
Symptom: Linux kernel panic due to kernel page fault in IRQ context
               when running zfcp_erp_timeout_handler() calling
               zfcp_erp_notify().
Problem: Suppose that, for unrelated reasons, FSF requests on behalf
               of recovery are very slow and can run into the ERP timeout.
               In the case at hand, we did adapter recovery to a large
               degree. However due to the slowness a LUN open is pending so
               the corresponding fc_rport remains blocked. After
               fast_io_fail_tmo we trigger close physical port recovery for
               the port under which the LUN should have been opened. The
               new higher order port recovery dismisses the pending LUN
               open ERP action and dismisses the pending LUN open FSF
               request. Such dismissal decouples the ERP action from the
               pending corresponding FSF request by setting
               zfcp_fsf_req->erp_action to NULL (among other things)
               [zfcp_erp_strategy_check_fsfreq()].
               If now the ERP timeout for the pending open LUN request runs
               out, we must not use zfcp_fsf_req->erp_action in the ERP
               timeout handler. This is a problem since v4.15 commit
               75492a51568b ("s390/scsi: Convert timers to use
               timer_setup()"). Before that we intentionally only passed
               zfcp_erp_action as context argument to
               zfcp_erp_timeout_handler().
               Note: The lifetime of the corresponding zfcp_fsf_req object
               continues until a (late) response or an (unrelated) adapter
               recovery.
Solution: Just like the regular response path ignores dismissed
               requests [zfcp_fsf_req_complete() =>
               zfcp_fsf_protstatus_eval() => return early] the ERP timeout
               handler now needs to ignore dismissed requests. So simply
               return early in the ERP timeout handler if the FSF request
               is marked as dismissed in its status flags. To protect
               against the race where zfcp_erp_strategy_check_fsfreq()
               dismisses and sets zfcp_fsf_req->erp_action to NULL after
               our previous status flag check, return early if
               zfcp_fsf_req->erp_action is NULL. After all, the former ERP
               action does not need to be woken up as that was already done
               as part of the dismissal above [zfcp_erp_action_dismiss()].

Upstream-ID: 936e6b85da0476dd2edac7c51c68072da9fb4ba2 -> kernel 5.8

Will be integrated by kernel 5.8 by groovy.

Please check that this also be integrated into 20.04

bugproxy (bugproxy)
tags: added: architecture-s39064 bugnameltc-186883 severity-high targetmilestone-inin20041
Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Frank Heimes (fheimes) wrote :

According to linux / linux-next
$ git log --oneline --grep "scsi: zfcp: Fix panic on ERP timeout for previously dismissed ERP action"
3cd1c5d582f4 Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
936e6b85da04 scsi: zfcp: Fix panic on ERP timeout for previously dismissed ERP action
$ git tag --contains 936e6b85da04 | grep ^v
v5.8-rc3
v5.8-rc4
v5.8-rc5
this is upstream with v5.8-rc3 and higher and with that not in focal, yet.

Changed in ubuntu-z-systems:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
Revision history for this message
Frank Heimes (fheimes) wrote :

In the bug description I'm reading:
>>> This is a problem since v4.15 commit 75492a51568b ("s390/scsi: Convert timers to use timer_setup()"). <<<
This makes me think that Ubuntu 18.04 / bionic is affected as well, is that correct?

Revision history for this message
Frank Heimes (fheimes) wrote :

A patched kernel (that I created during the SRU preparation and test compile) is available for further testing here:
https://people.canonical.com/~fheimes/lp1887774/

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-07-16 09:33 EDT-------
Due to the fact, that this problem comes up with kernel 4.15, also integration for 18.04 is also required.

Revision history for this message
Frank Heimes (fheimes) wrote :

Kernel SRU request submitted:
https://lists.ubuntu.com/archives/kernel-team/2020-July/thread.html#112154
Updating status to 'In Progress'.

Changed in linux (Ubuntu Focal):
status: New → In Progress
Changed in ubuntu-z-systems:
status: Triaged → In Progress
description: updated
Changed in linux (Ubuntu Bionic):
status: New → In Progress
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-07-24 12:54 EDT-------
zfcp regression tested the private build from https://people.canonical.com/~fheimes/lp1887774/.

(The private build seems to have the same kernelrelease as the latest official update kernel. I removed the latter after going to the previous backlevel official update kernel (5.4.0-40-generic) and before installing the private build. I hope I did run the correct private build:
Linux hostname 5.4.0-42-generic #46 SMP Thu Jul 16 12:06:43 UTC 2020 s390x s390x s390x GNU/Linux)

Changed in linux (Ubuntu Focal):
status: In Progress → Fix Released
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Revision history for this message
Frank Heimes (fheimes) wrote :

Since we have kernel 5.8 in groovy proposed, I'm updating the 'affecting groovy' entry to Fix Committed.

Changed in linux (Ubuntu Groovy):
status: New → Fix Committed
Changed in ubuntu-z-systems:
status: In Progress → Fix Committed
Revision history for this message
Frank Heimes (fheimes) wrote :

Meanwhile Kernel 5.8 migrated from groovy proposed to main,
hence updating the status of the groovy entry to Fix Released.

Changed in linux (Ubuntu Groovy):
status: Fix Committed → Fix Released
Revision history for this message
Frank Heimes (fheimes) wrote :

The patch got picked up by bionic's kernel Ubuntu-4.15.0-113.114
and 4.15.0.134.121 is currently in bionic-updates,
hence I'm updating the bionic entry to Fix Released
which closes the entire bug.

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in ubuntu-z-systems:
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2021-01-22 05:55 EDT-------
IBM Bugzilla status->closed, Fix Released for all requested distros

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.