Failed to iscsi login to 3PAR after controller restart
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Cinder |
Incomplete
|
Undecided
|
Unassigned |
Bug Description
Env:
Helion OpenStack 4.0 (Mitaka)
3PAR
<snip>
3.0.10 - Remove metadata that tracks the instance ID. bug #1572665
3.99.10.1 - _create_
3.99.10.2 - Added entry point tracing
3.99.10.3 - Handling HTTP conflict 409, host WWN/iSCSI name already
bug #1642945
3.99.10.4 - Fix snapCPG error during backup of attached volume.
Also recently applied:
- Update CHAP on host record when volume is migrated to new compute host. bug # 1737181
https:/
(fixed for Newton, Ocata and Peak)
- 3PAR: Get host from os-brick bug #1690244
https:/
(fixed for Newton, Ocata)
Into the following scenarios we found that iscsi session fails intermintently to login back due to authentication failure:
"iscsiadm: Could not login to [iface:default, target: <target>, portal: <portal>].
iscsiadm: initiation reported error (24 - iSCSI login failed due to authorization failure)"
Into 3PAR (showevents) we can see:
"CHAP initiation auth.: authentication of <initiator> failed (wrong secret?)"
Scenario 1.- 3Par controller port restarted
Scenario 2.- compute node restart (i.e. after maintenance or evacuate)
This issue was identified as CHAP secret been updated into 3PAR for compute node not maching the one stored into compute node itself (/etc/iscsi/
Workaround:
1.- Connect to compute node to get CHAP secret for the 3PAR node to be used
$ grep pass /etc/iscsi/
2.- Connect to 3PAR and update CHAP secret for compute node using the secret from 1.-
$ sethost initchap -chapname ha-volume-manager <secre> <compute-node>
I couldn´t find the way to reproduce the issue but it seems that it could be related to some driver operations updating that CHAP secret from cinder DB.
While checking DB I found that there are some active volumes keeping "wrong" (not maching the one into compute node) CHAP into their information (provider_auth).
I wonder if this is related to one of the issues already fixed (bug # 1737181) and beyond the fix some extra cleanup on DB is required to avoid not synced CHAP secrets from been re-used.
I also know that one of the operations setting CHAP for a host is host definion creation so that´s something that could explain Scenario 2; when a compute has no volume presented (i.e. host evacuate - all instances migrated) host definition is removed on 3PAR, once node is back running and tries to create new iscsi sessions it fails because host definition was created using a wrong secret.
Regarding Scenario 1, there should be some operation updating CHAP, not affecting the current sessions until a re-login is required for any reason (i.e. controller port restart)
If I find a way to reproduce it I would include more details.
Should you need anything else just let me know.
tags: | added: 3par drivers hpe |
Changed in cinder: | |
status: | New → Incomplete |
Hi Pedro,
please find the response on your query:
>>> The problem here is that I couldn´t reproduce it yet but to me it´s important to know if somehow invalid CHAPs in DB can be used to update host definition and secret on 3PAR. Could you answer this question? If this is the case, do we need to perform some update on cinder DB to avoid this issue?
HPE 3par driver do not directly update the cinder.volumes table with CHAP secret, its the cinder layer who sits on the top of the hpe 3par drivers updates the cinder DB. and this is what recommended
from driver, create_export() is responsible for updation of CHAP secret in cinder.volumes table
Since you already have vluns created in 3par with CHAP secret key, hpe3par driver. create_ export( ), will fetch the same secret from existing vlun and that secret will be set for newly created ones.
There might be some operation done, which triggers the bug #1737181 & #1690244 and due to which the CHAP is not/incorrectly set in 3par, and this is not sync with create_export()
As you have already tried applpying the patch https:/ /review. openstack. org/#/c/ 531669/ 4 and https:/ /review. openstack. org/#/c/ 482103/, but it didn't worked.
I suggest to remove all the vluns from 3par for any ONE specific host and then apply both the patches, restart the services and then perform the same operation from using that host