test_volume_swap_with_multiattach intermittently fails during cleanup when run in a multi-node job

Bug #1807723 reported by Matt Riedemann on 2018-12-10
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tempest
Medium
Matt Riedemann

Bug Description

This is found from debugging the tempest.api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap failures in the (multinode) tempest-slow job here:

https://review.openstack.org/#/c/606981/

The failure is ultimately during teardown of the test class, it fails to delete the "old" multiattach volume:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-c-vol.txt.gz?level=TRACE#_Dec_08_01_45_41_380161

Dec 08 01:45:41.380161 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: ERROR cinder.volume.targets.tgt [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] Failed to remove iscsi target for Volume ID: 2a48ae08-2833-488c-8771-eea86f41c7b8: Unexpected error while running command.
Dec 08 01:45:41.380475 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8
Dec 08 01:45:41.380783 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Exit code: 22
Dec 08 01:45:41.381094 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Stdout: u'Command:\n\ttgtadm -C 0 --mode target --op delete --tid=1\nexited with code: 22.\n'
Dec 08 01:45:41.381386 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Stderr: u'tgtadm: this target is still active\n': ProcessExecutionError: Unexpected error while running command.
Dec 08 01:45:49.587081 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: ERROR cinder.volume.targets.tgt [None req-9dddc5b9-7e06-4ab9-83a6-ef211e0d4c48 tempest-TestMultiAttachVolumeSwap-1803326020 None] Failed to remove iscsi target for Volume ID: 2a48ae08-2833-488c-8771-eea86f41c7b8: Unexpected error while running command.
Dec 08 01:45:49.587355 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8

After digging through the logs, I can see at least one set of issues in the ComputeManager.swap_volume code here:

https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/compute/manager.py#L5798-L5806

Which is updating the "old" bdm with values from the new volume, including updating the bdm.volume_id from the old to the new, which likely orphans the old record somehow, or messes up the cleanup.

My detailed notes follow:

server1:

{"server": {"OS-EXT-STS:task_state": null, "addresses": {}, "links": [{"href": "https://149.202.185.218/compute/v2.1/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a", "rel": "self"}, {"href": "https://149.202.185.218/compute/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a", "rel": "bookmark"}], "image": {"id": "f55b4d12-f159-453f-86cb-8965684121c8", "links": [{"href": "https://149.202.185.218/compute/images/f55b4d12-f159-453f-86cb-8965684121c8", "rel": "bookmark"}]}, "OS-EXT-STS:vm_state": "active", "OS-SRV-USG:launched_at": "2018-12-08T01:43:11.000000", "flavor": {"ephemeral": 0, "ram": 64, "original_name": "m1.nano", "vcpus": 1, "extra_specs": {}, "swap": 0, "disk": 0}, "id": "d46fba31-9469-4799-b2bf-1fbad4369a9a", "description": null, "user_id": "9bc61597b6794ca49c79ce599c26b223", "OS-DCF:diskConfig": "MANUAL", "accessIPv4": "", "accessIPv6": "", "progress": 0, "OS-EXT-STS:power_state": 1, "OS-EXT-AZ:availability_zone": "nova", "config_drive": "", "status": "ACTIVE", "updated": "2018-12-08T01:43:11Z", "hostId": "f4600b0fc614ce95d5be30fc1fc82bdb150c47ff8c84b22262be6d19", "OS-SRV-USG:terminated_at": null, "tags": [], "key_name": null, "locked": false, "name": "tempest-TestMultiAttachVolumeSwap-server-226416218-1", "created": "2018-12-08T01:43:04Z", "tenant_id": "5db98765c5d94a73bae8aba2f13c7762", "os-extended-volumes:volumes_attached": [], "metadata": {}}}

server1.hostId = f4600b0fc614ce95d5be30fc1fc82bdb150c47ff8c84b22262be6d19
server1 builds on the subnode (ubuntu-xenial-ovh-gra1-0001066278)

server2:

{"server": {"OS-EXT-STS:task_state": null, "addresses": {}, "links": [{"href": "https://149.202.185.218/compute/v2.1/servers/e0b74695-f9a0-4f4c-8a08-a0de3420cc6c", "rel": "self"}, {"href": "https://149.202.185.218/compute/servers/e0b74695-f9a0-4f4c-8a08-a0de3420cc6c", "rel": "bookmark"}], "image": {"id": "f55b4d12-f159-453f-86cb-8965684121c8", "links": [{"href": "https://149.202.185.218/compute/images/f55b4d12-f159-453f-86cb-8965684121c8", "rel": "bookmark"}]}, "OS-EXT-STS:vm_state": "active", "OS-SRV-USG:launched_at": "2018-12-08T01:43:10.000000", "flavor": {"ephemeral": 0, "ram": 64, "original_name": "m1.nano", "vcpus": 1, "extra_specs": {}, "swap": 0, "disk": 0}, "id": "e0b74695-f9a0-4f4c-8a08-a0de3420cc6c", "description": null, "user_id": "9bc61597b6794ca49c79ce599c26b223", "OS-DCF:diskConfig": "MANUAL", "accessIPv4": "", "accessIPv6": "", "progress": 0, "OS-EXT-STS:power_state": 1, "OS-EXT-AZ:availability_zone": "nova", "config_drive": "", "status": "ACTIVE", "updated": "2018-12-08T01:43:11Z", "hostId": "5f1a8abdf427d6cbda8003bd178ef40c605f24f626ec88c3a02816df", "OS-SRV-USG:terminated_at": null, "tags": [], "key_name": null, "locked": false, "name": "tempest-TestMultiAttachVolumeSwap-server-226416218-2", "created": "2018-12-08T01:43:04Z", "tenant_id": "5db98765c5d94a73bae8aba2f13c7762", "os-extended-volumes:volumes_attached": [], "metadata": {}}}

server2.hostId = 5f1a8abdf427d6cbda8003bd178ef40c605f24f626ec88c3a02816df
server2 builds on the primary node (ubuntu-xenial-ovh-gra1-0001066161)

So clearly the two servers are on different hosts. Also, volume1 is on the primary node (same as server2).

attach volume1 to server1:

Body: {"volumeAttachment": {"device": "/dev/vdb", "serverId": "d46fba31-9469-4799-b2bf-1fbad4369a9a", "id": "2a48ae08-2833-488c-8771-eea86f41c7b8", "volumeId": "2a48ae08-2833-488c-8771-eea86f41c7b8"}} _log_request_full tempest/lib/common/rest_client.py:437

attach volume1 to server2:

Body: {"volumeAttachment": {"device": "/dev/vdb", "serverId": "e0b74695-f9a0-4f4c-8a08-a0de3420cc6c", "id": "2a48ae08-2833-488c-8771-eea86f41c7b8", "volumeId": "2a48ae08-2833-488c-8771-eea86f41c7b8"}} _log_request_full tempest/lib/common/rest_client.py:437

swap volume1 (2a48ae08-2833-488c-8771-eea86f41c7b8) to volume2 (fa56e173-3bd4-4701-bc71-cb4810c53c51) on server1 (e0b74695-f9a0-4f4c-8a08-a0de3420cc6c):

2018-12-08 01:43:28.336 8815 INFO tempest.lib.common.rest_client [req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 ] Request (TestMultiAttachVolumeSwap:test_volume_swap_with_multiattach): 202 PUT https://149.202.185.218/compute/v2.1/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a/os-volume_attachments/2a48ae08-2833-488c-8771-eea86f41c7b8 0.618s
2018-12-08 01:43:28.337 8815 DEBUG tempest.lib.common.rest_client [req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>', 'X-OpenStack-Nova-API-Version': '2.60'}
        Body: {"volumeAttachment": {"volumeId": "fa56e173-3bd4-4701-bc71-cb4810c53c51"}}
    Response - Headers: {u'x-openstack-request-id': 'req-b6cb2f89-ebf6-47b5-9757-8692dd91c812', u'content-length': '0', u'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', u'server': 'Apache/2.4.18 (Ubuntu)', u'x-compute-request-id': 'req-b6cb2f89-ebf6-47b5-9757-8692dd91c812', u'content-type': 'application/json', 'content-location': 'https://149.202.185.218/compute/v2.1/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a/os-volume_attachments/2a48ae08-2833-488c-8771-eea86f41c7b8', u'date': 'Sat, 08 Dec 2018 01:43:27 GMT', 'status': '202', u'openstack-api-version': 'compute 2.60', u'connection': 'close', u'x-openstack-nova-api-version': '2.60'}

now the test waits for the swap to complete and the attachments are all recorded correctly;
once the swap is done, volume1 should be attached to server2 and volume2 should be attached
to server1

This is the connection_info for the old/new volume during the swap:

Dec 08 01:43:31.321263 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: DEBUG nova.compute.manager [None req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 tempest-TestMultiAttachVolumeSwap-477435682 tempest-TestMultiAttachVolumeSwap-477435682] [instance: d46fba31-9469-4799-b2bf-1fbad4369a9a] swap_volume: Calling driver volume swap with connection infos:

new: {'status': u'reserved', 'multiattach': True, 'detached_at': u'', u'volume_id': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', 'attach_mode': u'null', 'driver_volume_type': u'iscsi', 'instance': u'd46fba31-9469-4799-b2bf-1fbad4369a9a', 'attached_at': u'', 'serial': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', 'data': {u'access_mode': u'rw', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-fa56e173-3bd4-4701-bc71-cb4810c53c51', u'target_portal': u'213.32.74.238:3260', u'volume_id': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', u'target_lun': 1, u'auth_password': u'***', u'auth_username': u'oAJ38A88Rom7xrwQ5A4Y', u'auth_method': u'CHAP'}};

old: {u'status': u'reserved', u'instance': u'd46fba31-9469-4799-b2bf-1fbad4369a9a', u'multiattach': True, u'attached_at': u'', u'attach_mode': u'null', u'driver_volume_type': u'iscsi', u'detached_at': u'', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'serial': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'data': {u'device_path': u'/dev/sda', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal': u'149.202.185.218:3260', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'auth_password': u'***', u'target_lun': 1, u'access_mode': u'rw', u'auth_username': u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'}} {{(pid=20206) _swap_volume /opt/stack/nova/nova/compute/manager.py:5646}}

When the test cleans up and tries to delete volume1, it fails trying to cleanup iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8 from the subnode (which is the old connection to server1).

volume1 connection is supposed to be detached from server1 here:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/compute1/logs/screen-n-cpu.txt.gz#_Dec_08_01_45_36_699753

Dec 08 01:45:36.699753 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: DEBUG os_brick.initiator.connectors.iscsi [None req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 tempest-TestMultiAttachVolumeSwap-477435682 tempest-TestMultiAttachVolumeSwap-477435682] ==> disconnect_volume: call u"{'args': (<os_brick.initiator.connectors.iscsi.ISCSIConnector object at 0x7fe1aa096d50>, {u'device_path': u'/dev/sda', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal': u'149.202.185.218:3260', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'auth_password': u'***', u'target_lun': 1, u'access_mode': u'rw', u'auth_username': u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'}, None), 'kwargs': {}}" {{(pid=20206) trace_logging_wrapper /usr/local/lib/python2.7/dist-packages/os_brick/utils.py:146}}

It says it logged out:

Dec 08 01:45:37.336156 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: Logout of [sid: 1, target: iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8, portal: 149.202.185.218,3260] successful.

This definitely looks wrong (post-swap volume, updating old BDM record), note it's updating the BDM for volume1 (2a48ae08-2833-488c-8771-eea86f41c7b8) but with the volume2 (fa56e173-3bd4-4701-bc71-cb4810c53c51) connection_info - note that for the new volume attach flow the connection_info might not matter since we likely don't use it (we just work with the attachment records), but if
the attachment_id (769b75cc-4283-4b46-a975-243faae6e263) is for the wrong volume, that could be a problem:

Dec 08 01:45:37.699221 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: DEBUG nova.compute.manager [None req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 tempest-TestMultiAttachVolumeSwap-477435682 tempest-TestMultiAttachVolumeSwap-477435682] [instance: d46fba31-9469-4799-b2bf-1fbad4369a9a] swap_volume: Updating volume 2a48ae08-2833-488c-8771-eea86f41c7b8 BDM record with {'attachment_id': u'769b75cc-4283-4b46-a975-243faae6e263', 'no_device': None, 'connection_info': '{"status": "reserved", "multiattach": true, "detached_at": "", "volume_id": "fa56e173-3bd4-4701-bc71-cb4810c53c51", "attach_mode": "null", "driver_volume_type": "iscsi", "instance": "d46fba31-9469-4799-b2bf-1fbad4369a9a", "attached_at": "", "serial": "fa56e173-3bd4-4701-bc71-cb4810c53c51", "data": {"access_mode": "rw", "target_discovered": false, "encrypted": false, "qos_specs": null, "target_iqn": "iqn.2010-10.org.openstack:volume-fa56e173-3bd4-4701-bc71-cb4810c53c51", "target_portal": "213.32.74.238:3260", "volume_id": "fa56e173-3bd4-4701-bc71-cb4810c53c51", "target_lun": 1, "device_path": "/dev/sdb", "auth_password": "Ko6DLxWNfrbqWQ7E", "auth_username": "oAJ38A88Rom7xrwQ5A4Y", "auth_method": "CHAP"}}', 'volume_id': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', 'source_type': 'volume', 'snapshot_id': None, 'destination_type': 'volume'} {{(pid=20206) swap_volume /opt/stack/nova/nova/compute/manager.py:5819}}

Probably the bigger problem is the BDM for volume1 is updated to have volume_id fa56e173-3bd4-4701-bc71-cb4810c53c51 so we changed the volume_id on the bdm for volume1, so that might have something to do with orphaning volume1's connections when we go to delete - since deleting server2 (which still has volume1 attached) likely doesn't cleanup volume1 properly.

The other things I'm wondering about are:

1. Is nova-compute deleting the old volume1/server1 attachment record here:

https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/compute/manager.py#L5691

because I didn't see anything about attachment delete in the logs.

2. Is there a periodic task in the cinder-volume service that is recreating the target connection for volume1 and server1 in between the time we swap the volume and nova-compute deletes the connection but before nova-compute deletes the old attachment record?

The first and most obvious thing to do is fix that bdm update code at the end of the ComputeManager.swap_volume method. I'm not exactly sure yet how the multi-host nature of this comes into play, except possibly in the libvirt driver volume disconnect code here:

https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/virt/libvirt/driver.py#L1221

Matt Riedemann (mriedem) wrote :

I'm not sure if multiple hosts is really the main culprit because the test passed on this tempest-slow job:

http://logs.openstack.org/78/606978/4/check/tempest-slow/63b3e0e/

And in that case, the two servers involved in the swap from the same multiattach volume are on different hosts. So maybe there is some timing issue involved when this fails.

In that passing case, these are the two servers:

Body: {"servers": [{"id": "988b38e6-1556-45b8-94ba-76d5045ef5e9", "links": [{"href": "https://10.210.68.80/compute/v2.1/servers/988b38e6-1556-45b8-94ba-76d5045ef5e9", "rel": "self"}, {"href": "https://10.210.68.80/compute/servers/988b38e6-1556-45b8-94ba-76d5045ef5e9", "rel": "bookmark"}], "name": "tempest-TestMultiAttachVolumeSwap-server-1095754502-2"}, {"id": "2f4078c8-4911-4154-b24b-ea2ca37fcdd3", "links": [{"href": "https://10.210.68.80/compute/v2.1/servers/2f4078c8-4911-4154-b24b-ea2ca37fcdd3", "rel": "self"}, {"href": "https://10.210.68.80/compute/servers/2f4078c8-4911-4154-b24b-ea2ca37fcdd3", "rel": "bookmark"}], "name": "tempest-TestMultiAttachVolumeSwap-server-1095754502-1"}]} _log_request_full tempest/lib/common/rest_client.py:437

988b38e6-1556-45b8-94ba-76d5045ef5e9 (server2) builds on the primary node and 2f4078c8-4911-4154-b24b-ea2ca37fcdd3 (server1) builds on the subnode.

Maybe it depends on where volume1 builds? In that test, volume1 is 2fe7ad08-b928-4139-9f9d-a4ee54ac630d which builds on the primary node, and if server1 is on the subnode, then that is one difference from the fail scenario where volume1 and server1 are on the same host.

description: updated
Matt Riedemann (mriedem) wrote :

Ignore comment 1, the original description of the bug was wrong because in the fail scenario, volume1 is on the primary node and server1 is on the subnode.

Matt Riedemann (mriedem) wrote :
Download full text (4.4 KiB)

For the fail scenario, it looks like this is where we delete the old attachment record:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-c-vol.txt.gz#_Dec_08_01_45_40_765428

Dec 08 01:45:40.765428 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: DEBUG cinder.volume.manager [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] Deleting attachment d7bfeddf-c8c6-4072-8d34-91aaba0ab343. {{(pid=1884) _do_attachment_delete /opt/stack/cinder/cinder/volume/manager.py:4525}}

And removing the connection fails:

Dec 08 01:45:41.058514 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: INFO cinder.volume.targets.tgt [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] Removing iscsi_target for Volume ID: 2a48ae08-2833-488c-8771-eea86f41c7b8
Dec 08 01:45:41.060048 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: DEBUG oslo_concurrency.processutils [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8 {{(pid=1884) execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:372}}
Dec 08 01:45:41.074707 ubuntu-xenial-ovh-gra1-0001066161 sudo[9566]: stack : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/bin/cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8
Dec 08 01:45:41.076726 ubuntu-xenial-ovh-gra1-0001066161 sudo[9566]: pam_unix(sudo:session): session opened for user root by (uid=0)
Dec 08 01:45:41.373922 ubuntu-xenial-ovh-gra1-0001066161 sudo[9566]: pam_unix(sudo:session): session closed for user root
Dec 08 01:45:41.376988 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: DEBUG oslo_concurrency.processutils [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] CMD "sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8" returned: 22 in 0.316s {{(pid=1884) execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:409}}
Dec 08 01:45:41.378167 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: DEBUG oslo_concurrency.processutils [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] u'sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8' failed. Not Retrying. {{(pid=1884) execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:457}}
Dec 08 01:45:41.380161 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: ERROR cinder.volume.targets.tgt [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9...

Read more...

Matt Riedemann (mriedem) wrote :

I think this is where that cinder attachment delete fails:

https://github.com/openstack/cinder/blob/9d4fd16bcd8eca930910798cc519cb5bc5846c59/cinder/volume/manager.py#L4528

Notice the FIXME in there:

https://github.com/openstack/cinder/blob/9d4fd16bcd8eca930910798cc519cb5bc5846c59/cinder/volume/manager.py#L4530

# FIXME(jdg): Obviously our volume object is going to need some
# changes to deal with multi-attach and figuring out how to
# represent a single failed attach out of multiple attachments

Because of that exception we don't call this:

            self.db.volume_detached(context.elevated(), vref.id,
attachment.get('id'))

Which would delete the volume attachment record in the Cinder DB, but I'm not sure how much that matters since on the nova side we update the volume1 BDM to point at a new attachment_id (769b75cc-4283-4b46-a975-243faae6e263).

Matt Riedemann (mriedem) wrote :

Looking at the swap volume flow in nova again, I think

https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/compute/manager.py#L5798-L5806

is likely intentional since for volume1/server1 there is only a single BDM record. It starts out with the old volume_id and attachment_id, and then once the swap is complete, the compute manager code, rather than delete the old BDM for vol1/server1 and create a new BDM for vol2/server1, it just updates the BDM for vol1/serv1 to point at vol2 (updates the connection_info, attachment_id and volume_id to be for vol2 instead of vol1).

So when we delete server1, it should cleanup volume2 from the bdm that points at volume2. And when we delete server2, it should cleanup volume1 from the bdm that points at volume1.

The problem likely goes back to the old vol1/server1 attachment delete failing here:

https://github.com/openstack/cinder/blob/9d4fd16bcd8eca930910798cc519cb5bc5846c59/cinder/volume/manager.py#L4528

Maybe we need to retry in that case if we're hitting a race? Maybe we need to detect "tgtadm: this target is still active" here:

https://github.com/openstack/cinder/blob/9d4fd16bcd8eca930910798cc519cb5bc5846c59/cinder/volume/targets/tgt.py#L283

And retry?

Changed in cinder:
status: New → Confirmed
importance: Undecided → Medium
Matt Riedemann (mriedem) wrote :

I'm not sure that a retry in cinder will fix the problem since the attachment delete for server1/volume1 fails here:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-c-vol.txt.gz#_Dec_08_01_45_41_380161

and the volume delete fails for the same reason 8 seconds later here:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-c-vol.txt.gz#_Dec_08_01_45_49_587081

so if it is a timing race, does it really take longer than 8 seconds for that target to no longer be active? Or is there something wrong where it thinks that target is the one being used between volume1 and server2 and we haven't yet deleted server2?

Matt Riedemann (mriedem) wrote :

Here are the n-cpu details from volume1 is attached to server2:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-n-cpu.txt.gz#_Dec_08_01_43_20_805619

Dec 08 01:43:20.805619 ubuntu-xenial-ovh-gra1-0001066161 nova-compute[32186]: INFO nova.compute.manager [None req-bda55e50-9756-4d10-8f1b-b4ebde6a58ee tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] [instance: e0b74695-f9a0-4f4c-8a08-a0de3420cc6c] Attaching volume 2a48ae08-2833-488c-8771-eea86f41c7b8 to /dev/vdb

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-n-cpu.txt.gz#_Dec_08_01_43_24_073493

Dec 08 01:43:24.073493 ubuntu-xenial-ovh-gra1-0001066161 nova-compute[32186]: DEBUG os_brick.initiator.connectors.iscsi [None req-bda55e50-9756-4d10-8f1b-b4ebde6a58ee tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] ==> connect_volume: call u"{'args': (<os_brick.initiator.connectors.iscsi.ISCSIConnector object at 0x7f3217c0de50>, {u'access_mode': u'rw', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal': u'149.202.185.218:3260', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_lun': 1, u'auth_password': u'***', u'auth_username': u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'}), 'kwargs': {}}" {{(pid=32186) trace_logging_wrapper /usr/local/lib/python2.7/dist-packages/os_brick/utils.py:146}}

Dec 08 01:43:25.437141 ubuntu-xenial-ovh-gra1-0001066161 nova-compute[32186]: Login to [iface: default, target: iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8, portal: 149.202.185.218,3260] successful.

Dec 08 01:43:25.455421 ubuntu-xenial-ovh-gra1-0001066161 nova-compute[32186]: DEBUG os_brick.initiator.connectors.iscsi [None req-bda55e50-9756-4d10-8f1b-b4ebde6a58ee tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] Connected to sda using {u'auth_password': u'***', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal': u'149.202.185.218:3260', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_lun': 1, u'access_mode': u'rw', u'auth_username': u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'} {{(pid=32186) _connect_vol /usr/local/lib/python2.7/dist-packages/os_brick/initiator/connectors/iscsi.py:663}}

Matt Riedemann (mriedem) wrote :
Download full text (3.8 KiB)

Details from attaching volume1 to server1:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/compute1/logs/screen-n-cpu.txt.gz#_Dec_08_01_43_12_388771

Dec 08 01:43:12.388771 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: INFO nova.compute.manager [None req-15ea8e06-99b2-439b-bad8-d0f0f6bee781 tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] [instance: d46fba31-9469-4799-b2bf-1fbad4369a9a] Attaching volume 2a48ae08-2833-488c-8771-eea86f41c7b8 to /dev/vdb

Dec 08 01:43:16.269897 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: DEBUG os_brick.initiator.connectors.iscsi [None req-15ea8e06-99b2-439b-bad8-d0f0f6bee781 tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] ==> connect_volume: call u"{'args': (<os_brick.initiator.connectors.iscsi.ISCSIConnector object at 0x7fe1aa096d50>, {u'access_mode': u'rw', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal': u'149.202.185.218:3260', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_lun': 1, u'auth_password': u'***', u'auth_username': u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'}), 'kwargs': {}}" {{(pid=20206) trace_logging_wrapper /usr/local/lib/python2.7/dist-packages/os_brick/utils.py:146}}

Should target_lun be the same in both cases?

Dec 08 01:43:17.800452 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: Login to [iface: default, target: iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8, portal: 149.202.185.218,3260] successful.

Dec 08 01:43:17.825508 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: DEBUG os_brick.initiator.connectors.iscsi [None req-15ea8e06-99b2-439b-bad8-d0f0f6bee781 tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] Connected to sda using {u'auth_password': u'***', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal': u'149.202.185.218:3260', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_lun': 1, u'access_mode': u'rw', u'auth_username': u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'} {{(pid=20206) _connect_vol /usr/local/lib/python2.7/dist-packages/os_brick/initiator/connectors/iscsi.py:663}}

---

We start deleting server1 here:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/compute1/logs/screen-n-cpu.txt.gz#_Dec_08_01_45_42_745301

Dec 08 01:45:42.745301 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: INFO nova.compute.manager [None req-69aba1e7-c922-40f2-8136-6e6e0a8c924e tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] [instance: d46fba31-9469-4799-b2bf-1fbad4369a9a] Terminating instance

Disconnects from volume2 here:

Dec 08 01:45:43.151301 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: DEBUG os_brick.initiator.connectors.iscsi [None req-69aba1e7-c922-40f2-8136-6e6e0a8c924e tempest-TestMultiAttachVolumeSwap-1803326020 tempest-Test...

Read more...

Matt Riedemann (mriedem) wrote :

We start deleting server2 here:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-n-cpu.txt.gz#_Dec_08_01_45_42_924650

Dec 08 01:45:42.924650 ubuntu-xenial-ovh-gra1-0001066161 nova-compute[32186]: INFO nova.compute.manager [None req-a0211268-269b-4227-ba65-a680ec13b6e3 tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] [instance: e0b74695-f9a0-4f4c-8a08-a0de3420cc6c] Terminating instance

Interestingly I don't see anything in there about disconnecting from volume1 which likely means the code to get the BDM that represents volume1 and server2 is not found. And that's likely the reason trying to delete volume1 fails - because when deleting server2, which is the only thing that's supposed to be attached to volume1 after the swap volume (server1 swaps from vol1 to vol2), we should be disconnecting from volume1 but are not and the target connection is left intact.

Matt Riedemann (mriedem) on 2018-12-10
no longer affects: cinder
Download full text (3.4 KiB)

With the debug patch I see that we're definitely having some weird DB issue with BDMs during server delete:

logs.openstack.org/78/606978/5/check/tempest-slow/5a90cad/controller/logs/screen-n-api.txt.gz#_Dec_10_20_29_39_767102

Dec 10 20:29:39.767102 ubuntu-xenial-rax-ord-0001105586 <email address hidden>[23323]: ERROR nova.compute.api [None req-04f71fae-15dd-4cc0-b211-4a4f53a7cbc8 tempest-TestMultiAttachVolumeSwap-1722594678 tempest-TestMultiAttachVolumeSwap-1722594678] [instance: c3c9407c-e2af-4d04-94ed-f334844ea6bf] No volume BDMs found for server.
Dec 10 20:29:39.775519 ubuntu-xenial-rax-ord-0001105586 <email address hidden>[23323]: ERROR nova.compute.api [None req-04f71fae-15dd-4cc0-b211-4a4f53a7cbc8 tempest-TestMultiAttachVolumeSwap-1722594678 tempest-TestMultiAttachVolumeSwap-1722594678] [instance: c3c9407c-e2af-4d04-94ed-f334844ea6bf] BDMs were already deleted: [BlockDeviceMapping(attachment_id=None,boot_index=0,connection_info=None,created_at=2018-12-10T20:28:24Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=1,image_id='863afc54-1096-4382-b8f2-6103641a65c1',instance=<?>,instance_uuid=c3c9407c-e2af-4d04-94ed-f334844ea6bf,no_device=False,snapshot_id=None,source_type='image',tag=None,updated_at=2018-12-10T20:28:25Z,uuid=be69a559-05ee-43bd-8170-2c65cc2a518c,volume_id=None,volume_size=None,volume_type=None), BlockDeviceMapping(attachment_id=be11fe1f-5c65-4a64-a5c6-caa934f564c9,boot_index=None,connection_info='{"status": "reserved", "multiattach": true, "detached_at": "", "volume_id": "26af085c-f977-4508-8bb1-46a57c8f34ed", "attach_mode": "null", "driver_volume_type": "iscsi", "instance": "c3c9407c-e2af-4d04-94ed-f334844ea6bf", "attached_at": "", "serial": "26af085c-f977-4508-8bb1-46a57c8f34ed", "data": {"access_mode": "rw", "target_discovered": false, "encrypted": false, "qos_specs": null, "target_iqn": "iqn.2010-10.org.openstack:volume-26af085c-f977-4508-8bb1-46a57c8f34ed", "target_portal": "10.210.4.21:3260", "volume_id": "26af085c-f977-4508-8bb1-46a57c8f34ed", "target_lun": 1, "device_path": "/dev/sda", "auth_password": "***", "auth_username": "P2uFcW9nEpz3GcuKwqHv", "auth_method": "CHAP"}}',created_at=2018-12-10T20:28:37Z,delete_on_termination=False,deleted=True,deleted_at=2018-12-10T20:29:39Z,destination_type='volume',device_name='/dev/vdb',device_type=None,disk_bus=None,guest_format=None,id=4,image_id=None,instance=<?>,instance_uuid=c3c9407c-e2af-4d04-94ed-f334844ea6bf,no_device=False,snapshot_id=None,source_type='volume',tag=None,updated_at=2018-12-10T20:28:44Z,uuid=15aba0c7-c
Dec 10 20:29:39.776294 ubuntu-xenial-rax-ord-0001105586 <email address hidden>[23323]: 612-40a4-9445-dfe964764b02,volume_id='26af085c-f977-4508-8bb1-46a57c8f34ed',volume_size=1,volume_type=None)]

That's one of the TestMultiAttachVolumeSwap servers being deleted and its volume bdm that represents the multiattach volume is showing up as deleted already, so we don't detach it, but it's not yet clear to me what is deleting it.

I have a suspicion that in the original scenario:

a) volume1 attached to server1 and server2
b) swap volume1 to volume2 on ...

Read more...

Matt Riedemann (mriedem) wrote :

In that error log in comment 11, 26af085c-f977-4508-8bb1-46a57c8f34ed is the volume1 (it's the volume that is swapped *from* on server1).

Maybe the failure depends on the order in which the servers are deleted by tempest?

Matt Riedemann (mriedem) wrote :

Oh I guess tempest will detach the volume from each server before deleting each server, because of it's cleanup steps.

So since we do this to setup:

1. create volume1
2. create volume2
3. create server1 and server2 (multi-create request)
4. attach volume1 to server1
5. attach volume1 to server2
6. swap volume1 to volume2 on server1

When the test tears down, it will cleanup in this order:

1. detach volume1 from server2
2. detach volume1 from server1 (this should 404 since volume1 is not attached to server1 after the swap)
3. delete server1 and server2
4. delete volume2
5. delete volume1

One thing that appears to be missing is tempest does not explicitly detach volume2 from server1 on teardown, so that would have to happen by nova-compute when deleting server1, which is confirmed by:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/compute1/logs/screen-n-cpu.txt.gz#_Dec_08_01_45_42_745301

in comment 8:

Dec 08 01:45:43.084933 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: DEBUG nova.virt.libvirt.volume.iscsi [None req-69aba1e7-c922-40f2-8136-6e6e0a8c924e tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] [instance: d46fba31-9469-4799-b2bf-1fbad4369a9a] calling os-brick to detach iSCSI Volume {{(pid=20206) disconnect_volume /opt/stack/nova/nova/virt/libvirt/volume/iscsi.py:72}}

So maybe what is happening is tempest doesn't know volume2 is attached to server1, deletes server1 (which detaches volume2) but since tempest isn't waiting for volume2 to be detached before trying to delete it, it just goes to delete volume2 which fails because the backend target is not yet disconnected.

Matt Riedemann (mriedem) wrote :

I have a fix for tempest here: https://review.openstack.org/#/c/624491/

Changed in tempest:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Matt Riedemann (mriedem)
no longer affects: nova
summary: - swap multiattach volume intermittently fails when servers are on
- different hosts
+ test_volume_swap_with_multiattach intermittently fails during cleanup

My proposed tempest fix apparently didn't do the trick since the tempest change to run slow multiattach tests on top of it still failed.

Changed in tempest:
importance: High → Medium
summary: test_volume_swap_with_multiattach intermittently fails during cleanup
+ when run in a multi-node job
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers