Comment 6 for bug 1911999

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Bonus for my theory that these timeouts are about the paths going down and not the disks.
If I unplug the adapter (= 2 paths) it goes through this:

1. it immediately detects that something is wrong. I/O might still be queued.

Jan 25 09:27:52 s1lp5 multipathd[782]: checker failed path 65:32 in map mpatha
Jan 25 09:27:52 s1lp5 multipathd[782]: mpatha: remaining active paths: 3
Jan 25 09:27:52 s1lp5 kernel: device-mapper: multipath: Failing path 65:32.
Jan 25 09:27:53 s1lp5 multipathd[782]: checker failed path 8:176 in map mpathc
Jan 25 09:27:53 s1lp5 multipathd[782]: mpathc: remaining active paths: 3
Jan 25 09:27:53 s1lp5 multipathd[782]: checker failed path 8:160 in map mpathd
Jan 25 09:27:53 s1lp5 multipathd[782]: mpathd: remaining active paths: 3
Jan 25 09:27:53 s1lp5 kernel: device-mapper: multipath: Failing path 8:176.
Jan 25 09:27:53 s1lp5 kernel: device-mapper: multipath: Failing path 8:160.
Jan 25 09:27:54 s1lp5 multipathd[782]: checker failed path 8:208 in map mpatha
Jan 25 09:27:54 s1lp5 multipathd[782]: mpatha: remaining active paths: 2
Jan 25 09:27:54 s1lp5 multipathd[782]: checker failed path 65:48 in map mpathe
Jan 25 09:27:54 s1lp5 multipathd[782]: mpathe: remaining active paths: 3
Jan 25 09:27:54 s1lp5 multipathd[782]: checker failed path 8:240 in map mpathd
Jan 25 09:27:54 s1lp5 multipathd[782]: mpathd: remaining active paths: 2
Jan 25 09:27:54 s1lp5 multipathd[782]: checker failed path 8:224 in map mpathe
Jan 25 09:27:54 s1lp5 multipathd[782]: mpathe: remaining active paths: 2
Jan 25 09:27:54 s1lp5 multipathd[782]: checker failed path 65:0 in map mpathc
Jan 25 09:27:54 s1lp5 multipathd[782]: mpathc: remaining active paths: 2
Jan 25 09:27:54 s1lp5 kernel: device-mapper: multipath: Failing path 8:208.
Jan 25 09:27:54 s1lp5 kernel: device-mapper: multipath: Failing path 65:48.
Jan 25 09:27:54 s1lp5 kernel: device-mapper: multipath: Failing path 8:240.
Jan 25 09:27:54 s1lp5 kernel: device-mapper: multipath: Failing path 8:224.
Jan 25 09:27:54 s1lp5 kernel: device-mapper: multipath: Failing path 65:0.

state now:
mpathb (36005076306ffd6b6000000000000240a) dm-3 IBM,2107900
size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 0:0:0:1074413604 sdc 8:32 active ready running
  |- 0:0:1:1074413604 sdh 8:112 active ready running
  |- 1:0:1:1074413604 sdr 65:16 active i/o pending running
  `- 1:0:0:1074413604 sdm 8:192 active i/o pending running

2. after the timeouts for failing I/O (fast_io_fail_tmo) all queued is cancelled

state now:
mpathb (36005076306ffd6b6000000000000240a) dm-3 IBM,2107900
size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 0:0:0:1074413604 sdc 8:32 active ready running
  |- 0:0:1:1074413604 sdh 8:112 active ready running
  |- 1:0:1:1074413604 sdr 65:16 failed faulty running
  `- 1:0:0:1074413604 sdm 8:192 failed faulty running

3. finally once I hit my 60 second limit on the paths (dev_loss_tmo) they are considered dead

Jan 25 09:28:52 s1lp5 kernel: rport-1:0-0: blocked FC remote port time out: removing target and saving binding
Jan 25 09:28:52 s1lp5 kernel: rport-1:0-1: blocked FC remote port time out: removing target and saving binding
Jan 25 09:28:52 s1lp5 kernel: sd 1:0:0:1073889316: [sdk] Synchronizing SCSI cache
Jan 25 09:28:52 s1lp5 kernel: sd 1:0:0:1073889316: [sdk] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jan 25 09:28:52 s1lp5 kernel: sd 1:0:0:1073954852: [sdl] Synchronizing SCSI cache
Jan 25 09:28:52 s1lp5 kernel: sd 1:0:0:1073954852: [sdl] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jan 25 09:28:52 s1lp5 multipathd[782]: mpathd: load table [0 20971520 multipath 1 queue_if_no_path 1 alua 1 1 service-time 0 3 1 8:0 1 8:80 1 8:240 1]
Jan 25 09:28:52 s1lp5 multipathd[782]: sync_map_state: failing sdp state 2 dmstate 2
Jan 25 09:28:52 s1lp5 multipathd[782]: sdk [8:160]: path removed from map mpathd
Jan 25 09:28:52 s1lp5 kernel: device-mapper: multipath: Failing path 8:240.
Jan 25 09:28:52 s1lp5 kernel: sd 1:0:1:1073889316: [sdp] Synchronizing SCSI cache
Jan 25 09:28:52 s1lp5 kernel: sd 1:0:1:1073889316: [sdp] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Jan 25 09:28:52 s1lp5 kernel: scsi 1:0:0:1073889316: alua: Detached
Jan 25 09:28:52 s1lp5 multipath[994777]: 8:240: cannot find block device
Jan 25 09:28:52 s1lp5 multipath[994777]: 8:240: Empty device name
Jan 25 09:28:52 s1lp5 multipath[994777]: 8:240: Empty device name
...

state now:
mpathb (36005076306ffd6b6000000000000240a) dm-3 IBM,2107900
size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 0:0:0:1074413604 sdc 8:32 active ready running
  `- 0:0:1:1074413604 sdh 8:112 active ready running

^^ nothing will be enqueued or checked anymore the paths are gone.
IIUC this is what dev_loss_tmo is for.

4. If I later bring back the paths they are newly detected and re-added

Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076052004: Direct-Access IBM 2107900 2700 PQ: 0 ANSI: 5
Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076052004: alua: supports implicit TPGS
Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076052004: alua: device naa.6005076306ffd6b60000000000002423 port group 0 rel port 330
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:0:1076052004: Attached scsi generic sg18 type 0
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:0:1076052004: Power-on or device reset occurred
Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076117540: Direct-Access IBM 2107900 2700 PQ: 0 ANSI: 5
Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076117540: alua: supports implicit TPGS
Jan 25 09:30:24 s1lp5 kernel: scsi 1:0:0:1076117540: alua: device naa.6005076306ffd6b60000000000002424 port group 0 rel port 330
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:0:1076117540: Attached scsi generic sg19 type 0
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:0:1076117540: Power-on or device reset occurred
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1074413604: alua: port group 00 state A preferred supports tolusnA
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076117540: alua: port group 00 state A preferred supports tolusnA
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1073954852: alua: port group 00 state A preferred supports tolusnA
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1074413604: [sdm] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1073954852: [sdl] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076117540: [sdo] 134217728 512-byte logical blocks: (68.7 GB/64.0 GiB)
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1074413604: [sdm] Write Protect is off
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1074413604: [sdm] Mode Sense: ed 00 00 08
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1073954852: [sdl] Write Protect is off
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076117540: [sdo] Write Protect is off
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1073954852: [sdl] Mode Sense: ed 00 00 08
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076117540: [sdo] Mode Sense: ed 00 00 08
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1074413604: [sdm] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1073954852: [sdl] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076117540: [sdo] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076052004: alua: port group 00 state A preferred supports tolusnA
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076052004: [sdn] 134217728 512-byte logical blocks: (68.7 GB/64.0 GiB)
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076052004: [sdn] Write Protect is off
Jan 25 09:30:24 s1lp5 kernel: sd 1:0:1:1076052004: [sdn] Mode Sense: ed 00 00 08

state now:
mpathb (36005076306ffd6b6000000000000240a) dm-3 IBM,2107900
size=10G features='1 queue_if_no_path' hwhandler='1 alua' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 0:0:0:1074413604 sdc 8:32 active ready running
  |- 0:0:1:1074413604 sdh 8:112 active ready running
  |- 1:0:1:1074413604 sdm 8:192 active ready running
  `- 1:0:0:1074413604 sdr 65:16 active ready running