[UBUNTU 20.04] mke2fs dasd(fba),Failing CCW,default ERP has run out of retries and failed

Bug #1879707 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Fix Released
Medium
Skipper Bug Screeners
linux (Ubuntu)
Fix Released
Undecided
Skipper Bug Screeners

Bug Description

mke2fs,dasd(fba) guest edevices FBA,default ERP has run out of retries and failed,Failing CCW

---uname output---
xxxxxx - 5.4.0-29-generic #33-Ubuntu SMP Wed Apr 29 14:27:18 UTC 2020 s390x s390x s390x GNU/Linux

Machine Type = IBM 3906

---Debugger---
A debugger is not configured

---Steps to Reproduce---
 mke2fs to dasd(fba) devices

Stack trace output:
 no

Oops output:
 no

System Dump Info:
  The system is not configured to capture a system dump.

-Post a private note with access information to the machine that the bug is occuring on.
-Attach sysctl -a output output to the bug.

dasd(fba),Failing CCW,default ERP has run out of retries and failed between the following syslog events,
mke2fs running, before mounting and starting IO to dasd(fba) devices

May 14 14:33:32 ilabg13 root: ILAB_IO_FROM_MSDI_START
May 14 14:48:34 ilabg13 root: ILAB_IO_FROM_MSDI_RUNNING

bugproxy (bugproxy)
tags: added: architecture-s39064 bugnameltc-185933 severity-medium targetmilestone-inin2004
Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Frank Heimes (fheimes) wrote :

Does the mke2fs of the DASD FBA issue happen during an installation or post install on a running system?
Please share the steps you performed that allow to reproduce and also the log (syslog).

Changed in ubuntu-z-systems:
importance: Undecided → Medium
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-05-20 21:17 EDT-------
(In reply to comment #10)
> Does the mke2fs of the DASD FBA issue happen during an installation or post
> install on a running system?

post install when setting up to start IO to mounted filesystems located on dev dasd

> Please share the steps you performed that allow to reproduce and also the
> log (syslog).

fdisk -l /dev/dasdax1
Disk /dev/dasdax1: 9.102 GiB, 10737417216 bytes, 20971518 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

mkfs.ext3 -F /dev/dasdax1

/var/log/syslog
May 14 14:33:32 ilabg13 kernel: [116232.148325] dasd.5c0c98: 0.0.520e: default ERP has run out of retries and failed
May 14 14:33:32 ilabg13 kernel: [116232.148331] dasd(fba): I/O status report for device 0.0.520e:
May 14 14:33:32 ilabg13 kernel: [116232.148331] dasd(fba): in req: 000000001821d4df CS: 0x20 DS: 0x0C
May 14 14:33:32 ilabg13 kernel: [116232.148331] dasd(fba): device 0.0.520e: Failing CCW: 00000000116dbb5a
May 14 14:33:32 ilabg13 kernel: [116232.148331] dasd(fba): SORRY - NO VALID SENSE AVAILABLE
May 14 14:33:32 ilabg13 kernel: [116232.148339] dasd(fba): Related CP in req: 000000001821d4df
May 14 14:33:32 ilabg13 kernel: [116232.148339] dasd(fba): CCW 0000000025cde72e: 63400010 57C6BFC0 DAT: 00000200 00000002 00000000 00007FFF
May 14 14:33:32 ilabg13 kernel: [116232.148339] dasd(fba): CCW 00000000b79c223f: 43400008 57C6BFD0 DAT: 05000006 00000000
May 14 14:33:32 ilabg13 kernel: [116232.148339] dasd(fba): CCW 00000000db6e9304: 41600C00 F4100000 DAT: 00000000 00029708 00000000 00000001 00000000 5EBDB4CA 00000000 0008AB6C
May 14 14:33:32 ilabg13 kernel: [116232.148339] dasd(fba): CCW 00000000116dbb5a: 43400008 57C6BFD8 DAT: 05007FF8 00000006
May 14 14:33:32 ilabg13 kernel: [116232.148339] dasd(fba): CCW 00000000c490cf73: 41600000 00000000 DAT:
May 14 14:33:32 ilabg13 kernel: [116232.148339] dasd(fba): CCW 00000000f65dac93: 43400008 57C6BFE0 DAT: 05000002 00007FFE
May 14 14:33:32 ilabg13 kernel: [116232.148339] dasd(fba): CCW 00000000744f9570: 41200400 F4100000 DAT: 00000000 00029708 00000000 00000001 00000000 5EBDB4CA 00000000 0008AB6C
May 14 14:33:32 ilabg13 kernel: [116232.148364] blk_update_request: I/O error, dev dasdax, sector 2 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 0

Revision history for this message
bugproxy (bugproxy) wrote : syslog.3

------- Comment (attachment only) From <email address hidden> 2020-05-20 21:20 EDT-------

Revision history for this message
Frank Heimes (fheimes) wrote :

Involved IBM LoZ dev, looks like FBA driver expertise is needed.

Revision history for this message
bugproxy (bugproxy) wrote : 0001-s390-dasd-Fix-zero-write-for-EDEVICEs.patch

------- Comment on attachment From <email address hidden> 2020-07-20 11:15 EDT-------

This is a patch that potentially fixes the issue. Can you please apply the patch and run your tests with it? Please let me know if this fixes the issues for you.

Please note that this patch is for testing _only_ and not a final or even upstream patch.

Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: New → In Progress
Revision history for this message
Frank Heimes (fheimes) wrote :

A test build got created and is available from here: https://people.canonical.com/~fheimes/lp1879707/

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2020-07-22 03:20 EDT-------
@Canonical: Thanks for providing the test build. Does that fix can also be tested in your environment, where other BZ relates to.. Many thx

Revision history for this message
Frank Heimes (fheimes) wrote :

I just wanted to try the kernel that I've build yesterday (see comment #6) and as a first step I wanted to recreate re-create the described error situation on an up to date 20.04 system:
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 20.04 LTS
Release: 20.04
Codename: focal
$ uname -a
Linux zlin42 5.4.0-40-generic #44-Ubuntu SMP Mon Jun 22 23:57:33 UTC 2020 s390x s390x s390x GNU/Linux
$ apt-cache policy linux-generic
linux-generic:
  Installed: 5.4.0.40.43
  Candidate: 5.4.0.40.43
  Version table:
 *** 5.4.0.40.43 500
        500 http://ports.ubuntu.com/ubuntu-ports focal-updates/main s390x Packages
        500 http://ports.ubuntu.com/ubuntu-ports focal-security/main s390x Packages
        100 /var/lib/dpkg/status
     5.4.0.26.32 500
        500 http://ports.ubuntu.com/ubuntu-ports focal/main s390x Packages
$

But by surprise I didn't ran into any problems!

I was able to flawlessly make use of FBA devices in the following three different way:
1) (re-)used a FBA device that was previously in use, without wiping out any data
2) used a FBA device that I wiped out using wipefs
3) used a FBA device that I entirely wiped out (and zeroed) using dd
(see the attached doc for more details)

Looks like a fix for this problem came in with the kernels that were rolled out in between 5.4.0-29-generic and 5.4.0-40-generic.
(Again I didn't installed the patched kernel, I just used the latest official one.)

Message to the initial bug reporter:
Please can you retry on an up-to-date 20.04 system (or after having updated your existing one, like for example with "sudo apt -y -q update && sudo apt -y -q full-upgrade" plus reboot in case the kernel got update, what I assume will happen)?

Changed in ubuntu-z-systems:
status: In Progress → Incomplete
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2020-07-24 06:10 EDT-------
(In reply to comment #29)
> I just wanted to try the kernel that I've build yesterday (see comment #6)
> and as a first step I wanted to recreate re-create the described error
> situation on an up to date 20.04 system:
> $ lsb_release -a
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description: Ubuntu 20.04 LTS
> Release: 20.04
> Codename: focal
> $ uname -a
> Linux zlin42 5.4.0-40-generic #44-Ubuntu SMP Mon Jun 22 23:57:33 UTC 2020
> s390x s390x s390x GNU/Linux
> $ apt-cache policy linux-generic
> linux-generic:
> Installed: 5.4.0.40.43
> Candidate: 5.4.0.40.43
> Version table:
> *** 5.4.0.40.43 500
> 500 http://ports.ubuntu.com/ubuntu-ports focal-updates/main s390x Packages
> 500 http://ports.ubuntu.com/ubuntu-ports focal-security/main s390x Packages
> 100 /var/lib/dpkg/status
> 5.4.0.26.32 500
> 500 http://ports.ubuntu.com/ubuntu-ports focal/main s390x Packages
> $
>
> But by surprise I didn't ran into any problems!
>
> I was able to flawlessly make use of FBA devices in the following three
> different way:
> 1) (re-)used a FBA device that was previously in use, without wiping out any
> data
> 2) used a FBA device that I wiped out using wipefs
> 3) used a FBA device that I entirely wiped out (and zeroed) using dd
> (see the attached doc for more details)
>
> Looks like a fix for this problem came in with the kernels that were rolled
> out in between 5.4.0-29-generic and 5.4.0-40-generic.
> (Again I didn't installed the patched kernel, I just used the latest
> official one.)
>
> Message to the initial bug reporter:
> Please can you retry on an up-to-date 20.04 system (or after having updated
> your existing one, like for example with "sudo apt -y -q update && sudo apt
> -y -q full-upgrade" plus reboot in case the kernel got update, what I assume
> will happen)?

The issue is more likely to occur when the system has > 2GB of memory. I can't see in your attachment what system configuration you're running, but maybe that's the reason you're not running into the problem.

This was also my mistake when I tried to reproduce the problem. Once I had my configuration changed to 8GB of memory I ran into the error immediately.
The problem really is that the ZERO_PAGE allocation for the discard I/O can't be used on systems with > 2GB of memory.

The fix I proposed actually solves the problem and is currently in review.

I'll post the commit id once the fix went upstream.

Jan

Revision history for this message
Frank Heimes (fheimes) wrote :
Download full text (5.0 KiB)

So I took the time to re-test this again.
My z/VM guest has 4 CPUs (but SMT on), and 4 DASD FBA devices that equally split a 64GB zFCP/SCSI LUN in 4 16GB FBA chunks.

I've tested (in comment #8) with 2GB RAM where things worked and I wasn't able to recreate the error situation.
I then moved to 6GB RAM and things still worked for me.
Then 8GB - where everything was still fine.
And finally 10GB - still don't see the issue.

$ grep -i 'error\|crash\|crit\|panic\|I\/O\|erp\|sense\|fba' /var/log/syslog
ul 28 10:05:23 hwe0005 systemd[1]: Stopping LSB: automatic crash report generation...
Jul 28 10:05:23 hwe0005 systemd[1]: Stopping Configure dump on panic for System z...
Jul 28 10:07:36 hwe0005 systemd-udevd[514]: dasd-fba: /etc/udev/rules.d/41-generic-ccw-0.0.0009.rules:7 Failed to write ATTR{/sys/devices/css0/0.0.0007/0.0.0009/online}, ignoring: Invalid argument
Jul 28 10:07:36 hwe0005 systemd-udevd[511]: 0.0.0102: /etc/udev/rules.d/41-dasd-fba-0.0.0102.rules:7 Failed to write ATTR{/sys/devices/css0/0.0.0001/0.0.0102/online}, ignoring: Invalid argument
Jul 28 10:07:36 hwe0005 systemd-udevd[522]: 0.0.0101: /etc/udev/rules.d/41-dasd-fba-0.0.0101.rules:7 Failed to write ATTR{/sys/devices/css0/0.0.0000/0.0.0101/online}, ignoring: Invalid argument
Jul 28 10:07:36 hwe0005 systemd-udevd[522]: 0.0.0103: /etc/udev/rules.d/41-dasd-fba-0.0.0103.rules:7 Failed to write ATTR{/sys/devices/css0/0.0.0002/0.0.0103/online}, ignoring: Invalid argument
Jul 28 10:07:36 hwe0005 systemd-udevd[505]: 0.0.0104: /etc/udev/rules.d/41-dasd-fba-0.0.0104.rules:7 Failed to write ATTR{/sys/devices/css0/0.0.0003/0.0.0104/online}, ignoring: Invalid argument
Jul 28 10:07:36 hwe0005 kernel: [ 4.983272] dasd-fba.f36f2f: 0.0.0101: New FBA DASD 9336/10 (CU 6310/80) with 16383 MB and 512 B/blk
Jul 28 10:07:36 hwe0005 kernel: [ 4.988020] dasd-fba.f36f2f: 0.0.0102: New FBA DASD 9336/10 (CU 6310/80) with 16383 MB and 512 B/blk
Jul 28 10:07:36 hwe0005 kernel: [ 4.990317] dasd-fba.f36f2f: 0.0.0103: New FBA DASD 9336/10 (CU 6310/80) with 16383 MB and 512 B/blk
Jul 28 10:07:36 hwe0005 kernel: [ 4.992370] dasd-fba.f36f2f: 0.0.0104: New FBA DASD 9336/10 (CU 6310/80) with 16384 MB and 512 B/blk
Jul 28 10:07:36 hwe0005 systemd[1]: Condition check resulted in Process error reports when automatic reporting is enabled (file watch) being skipped.
Jul 28 10:07:36 hwe0005 systemd[1]: Condition check resulted in Unix socket for apport crash forwarding being skipped.
Jul 28 10:07:36 hwe0005 systemd[1]: Starting LSB: automatic crash report generation...
Jul 28 10:07:36 hwe0005 systemd[1]: Starting Configure dump on panic for System z...
Jul 28 10:07:36 hwe0005 apport[764]: * Starting automatic crash report generation: apport
Jul 28 10:07:36 hwe0005 dumpconf[770]: stop on panic configured.
Jul 28 10:07:36 hwe0005 systemd[1]: Finished Configure dump on panic for System z.
Jul 28 10:07:36 hwe0005 systemd[1]: Started LSB: automatic crash report generation.

I'm wondering a bit about the systemd msgs and the sysfs device tree. But other than that no ERP, sense, or panics so far ...

$ dmesg | grep -i 'error\|fail\|crash\|warn\|crit\|panic\|erp\|fba'
[ 4.983272] dasd-fba.f36f2f: 0.0.0101: New FBA...

Read more...

Revision history for this message
Frank Heimes (fheimes) wrote :

And btw. @Jan which Ubuntu release and kernel version did you had in use?

Revision history for this message
bugproxy (bugproxy) wrote :
Download full text (7.2 KiB)

------- Comment From <email address hidden> 2020-10-15 13:47 EDT-------
(In reply to comment #34)
> So I took the time to re-test this again.
> My z/VM guest has 4 CPUs (but SMT on), and 4 DASD FBA devices that equally
> split a 64GB zFCP/SCSI LUN in 4 16GB FBA chunks.
>
> I've tested (in comment #8) with 2GB RAM where things worked and I wasn't
> able to recreate the error situation.
> I then moved to 6GB RAM and things still worked for me.
> Then 8GB - where everything was still fine.
> And finally 10GB - still don't see the issue.
>
> $ grep -i 'error\|crash\|crit\|panic\|I\/O\|erp\|sense\|fba' /var/log/syslog
> ul 28 10:05:23 hwe0005 systemd[1]: Stopping LSB: automatic crash report
> generation...
> Jul 28 10:05:23 hwe0005 systemd[1]: Stopping Configure dump on panic for
> System z...
> Jul 28 10:07:36 hwe0005 systemd-udevd[514]: dasd-fba:
> /etc/udev/rules.d/41-generic-ccw-0.0.0009.rules:7 Failed to write
> ATTR{/sys/devices/css0/0.0.0007/0.0.0009/online}, ignoring: Invalid argument
> Jul 28 10:07:36 hwe0005 systemd-udevd[511]: 0.0.0102:
> /etc/udev/rules.d/41-dasd-fba-0.0.0102.rules:7 Failed to write
> ATTR{/sys/devices/css0/0.0.0001/0.0.0102/online}, ignoring: Invalid argument
> Jul 28 10:07:36 hwe0005 systemd-udevd[522]: 0.0.0101:
> /etc/udev/rules.d/41-dasd-fba-0.0.0101.rules:7 Failed to write
> ATTR{/sys/devices/css0/0.0.0000/0.0.0101/online}, ignoring: Invalid argument
> Jul 28 10:07:36 hwe0005 systemd-udevd[522]: 0.0.0103:
> /etc/udev/rules.d/41-dasd-fba-0.0.0103.rules:7 Failed to write
> ATTR{/sys/devices/css0/0.0.0002/0.0.0103/online}, ignoring: Invalid argument
> Jul 28 10:07:36 hwe0005 systemd-udevd[505]: 0.0.0104:
> /etc/udev/rules.d/41-dasd-fba-0.0.0104.rules:7 Failed to write
> ATTR{/sys/devices/css0/0.0.0003/0.0.0104/online}, ignoring: Invalid argument
> Jul 28 10:07:36 hwe0005 kernel: [ 4.983272] dasd-fba.f36f2f: 0.0.0101:
> New FBA DASD 9336/10 (CU 6310/80) with 16383 MB and 512 B/blk
> Jul 28 10:07:36 hwe0005 kernel: [ 4.988020] dasd-fba.f36f2f: 0.0.0102:
> New FBA DASD 9336/10 (CU 6310/80) with 16383 MB and 512 B/blk
> Jul 28 10:07:36 hwe0005 kernel: [ 4.990317] dasd-fba.f36f2f: 0.0.0103:
> New FBA DASD 9336/10 (CU 6310/80) with 16383 MB and 512 B/blk
> Jul 28 10:07:36 hwe0005 kernel: [ 4.992370] dasd-fba.f36f2f: 0.0.0104:
> New FBA DASD 9336/10 (CU 6310/80) with 16384 MB and 512 B/blk
> Jul 28 10:07:36 hwe0005 systemd[1]: Condition check resulted in Process
> error reports when automatic reporting is enabled (file watch) being skipped.
> Jul 28 10:07:36 hwe0005 systemd[1]: Condition check resulted in Unix socket
> for apport crash forwarding being skipped.
> Jul 28 10:07:36 hwe0005 systemd[1]: Starting LSB: automatic crash report
> generation...
> Jul 28 10:07:36 hwe0005 systemd[1]: Starting Configure dump on panic for
> System z...
> Jul 28 10:07:36 hwe0005 apport[764]: * Starting automatic crash report
> generation: apport
> Jul 28 10:07:36 hwe0005 dumpconf[770]: stop on panic configured.
> Jul 28 10:07:36 hwe0005 systemd[1]: Finished Configure dump on panic for
> System z.
> Jul 28 10:07:36 hwe0005 systemd[1]: Started LSB: automatic crash report
> generation.
>
> I'm wondering a bit about the systemd ...

Read more...

Revision history for this message
Frank Heimes (fheimes) wrote :

Checking the focal git tree this is included in the Ubuntu kernels:
Ubuntu-5.4.0-55.61
Ubuntu-5.4.0-56.62
Ubuntu-5.4.0-57.63
And Ubuntu-5.4.0-56 just migrated to updates:
 linux-generic | 5.4.0.56.59 | focal-updates | s390x
hence I'm updating this (focal) to Fix Released.

Changed in linux (Ubuntu):
status: New → Fix Released
Changed in ubuntu-z-systems:
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.