Comment 5 for bug 1815599

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

@JFH - As just discussed you have mentioned that you have discussed the existing tunables to Heinz Werner. Thanks for linking these here again to make sure that Shixm knows.

Somewhat reminds me of bug 1540407 - but those changes are in Ubuntu already since 16.04.
Same for the even older bug 1374999.
Your kernel and open-iscsi versions indicate that you are on Bionic is that correct?
Mutlipath tools should be on

Your repro of:
  > Run storage side error inject 'node reset' for SVC
isn't clear to me, neither do I have a Storage server I'm allowed to do error in ject nor the tools/UI to control it.

Instead I have tried the repro that are available to me as in:
https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1540407/comments/7
https://bugs.launchpad.net/ubuntu/+source/multipath-tools/+bug/1540407/comments/8

But all of them worked, see below.
Note that I never reached the loss of the path info in the faulty state '#:#:#:#' - even in faulty state it kew the path info.

@Shixm - since you have a setup that can reproduce this, can you try if any of the latter releases (Cosmic/Disco) already resolve the issue that you are seeing? Then we could try to hunt down which changes might have resolved it for you instead of assuming this would need a totally new change.

@Shixm - Any way to reproduce this without 'node reset' for SVC?

Finally this might as well need subject matter expertise - can we make sure that IBMs zfcp experts (Devs and maybe Thorsten who drove the old bugs) are subscribed on the mirrored bug 175431?
@JFH - do you think you can check that with the IBM team?

------------
Test results when retrying to trigger the issue:

Approach #1 gives me this (which isn't exactly the same state)
36005076306ffd6b60000000000002403 dm-1 IBM,2107900
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=active
  |- 1:0:0:1073954852 sdg 8:96 active ready running
  |- 1:0:1:1073954852 sdn 8:208 active ready running
  |- 0:0:0:1073954852 sdb 8:16 active faulty offline
  `- 0:0:1:1073954852 sdj 8:144 active ready running
If I add it abck after this it works just fine again:
36005076306ffd6b60000000000002403 dm-1 IBM,2107900
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=enabled
  |- 1:0:0:1073954852 sdg 8:96 active ready running
  |- 1:0:1:1073954852 sdn 8:208 active ready running
  |- 0:0:1:1073954852 sdj 8:144 active ready running
  `- 0:0:0:1073954852 sdb 8:16 active ready running

The second approach (Disable, sleep, enable of the adapter), check zfcp config
lszdev -t zfcp-host
DEVICE TYPE zfcp
  Description : SCSI-over-Fibre Channel (FCP) devices and SCSI devices
  Modules : zfcp
  Active : yes
  Persistent : yes

  ATTRIBUTE ACTIVE PERSISTENT
  allow_lun_scan "1" "1"
  datarouter "1" -
  dbflevel "3" -
  dbfsize "4" -
  dif "0" -
  no_auto_port_rescan "0" -
  port_scan_backoff "500" -
  port_scan_ratelimit "60000" -
  queue_depth "32" -

Initially I get this (as expected):
36005076306ffd6b60000000000002403 dm-1 IBM,2107900
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=enabled
  |- 1:0:0:1073954852 sdg 8:96 active ready running
  |- 1:0:1:1073954852 sdn 8:208 active ready running
  |- 0:0:1:1073954852 sdj 8:144 active i/o pending running
  `- 0:0:0:1073954852 sdb 8:16 active i/o pending running

Then after a while it reaches the final fault state:
36005076306ffd6b60000000000002403 dm-1 IBM,2107900
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=enabled
  |- 1:0:0:1073954852 sdg 8:96 active ready running
  |- 1:0:1:1073954852 sdn 8:208 active ready running
  |- 0:0:1:1073954852 sdj 8:144 failed faulty running
  `- 0:0:0:1073954852 sdb 8:16 failed faulty running

After getting the paths back it immediately switches to:
36005076306ffd6b60000000000002403 dm-1 IBM,2107900
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=37 status=enabled
  |- 1:0:0:1073954852 sdg 8:96 active ready running
  |- 1:0:1:1073954852 sdn 8:208 active ready running
  |- 0:0:1:1073954852 sdj 8:144 failed ready running
  `- 0:0:0:1073954852 sdb 8:16 failed ready running

And after less than 20 seconds fully recovers to:
36005076306ffd6b60000000000002403 dm-1 IBM,2107900
size=10G features='3 queue_if_no_path queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=50 status=enabled
  |- 1:0:0:1073954852 sdg 8:96 active ready running
  |- 1:0:1:1073954852 sdn 8:208 active ready running
  |- 0:0:1:1073954852 sdj 8:144 active ready running
  `- 0:0:0:1073954852 sdb 8:16 active ready running