tempest

test_volume_swap_with_multiattach intermittently fails during cleanup when run in a multi-node job

Bug #1807723 reported by Matt Riedemann on 2018-12-10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tempest	New	Medium	Lee Yarwood

Bug Description

This is found from debugging the tempest.api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap failures in the (multinode) tempest-slow job here:

https://review.openstack.org/#/c/606981/

The failure is ultimately during teardown of the test class, it fails to delete the "old" multiattach volume:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-c-vol.txt.gz?level=TRACE#_Dec_08_01_45_41_380161

Dec 08 01:45:41.380161 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: ERROR cinder.volume.targets.tgt [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] Failed to remove iscsi target for Volume ID: 2a48ae08-2833-488c-8771-eea86f41c7b8: Unexpected error while running command.
Dec 08 01:45:41.380475 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8
Dec 08 01:45:41.380783 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Exit code: 22
Dec 08 01:45:41.381094 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Stdout: u'Command:\n\ttgtadm -C 0 --mode target --op delete --tid=1\nexited with code: 22.\n'
Dec 08 01:45:41.381386 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Stderr: u'tgtadm: this target is still active\n': ProcessExecutionError: Unexpected error while running command.
Dec 08 01:45:49.587081 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: ERROR cinder.volume.targets.tgt [None req-9dddc5b9-7e06-4ab9-83a6-ef211e0d4c48 tempest-TestMultiAttachVolumeSwap-1803326020 None] Failed to remove iscsi target for Volume ID: 2a48ae08-2833-488c-8771-eea86f41c7b8: Unexpected error while running command.
Dec 08 01:45:49.587355 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8

After digging through the logs, I can see at least one set of issues in the ComputeManager.swap_volume code here:

https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/compute/manager.py#L5798-L5806

Which is updating the "old" bdm with values from the new volume, including updating the bdm.volume_id from the old to the new, which likely orphans the old record somehow, or messes up the cleanup.

My detailed notes follow:

server1:

{"server": {"OS-EXT-STS:task_state": null, "addresses": {}, "links": [{"href": "https://149.202.185.218/compute/v2.1/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a", "rel": "self"}, {"href": "https://149.202.185.218/compute/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a", "rel": "bookmark"}], "image": {"id": "f55b4d12-f159-453f-86cb-8965684121c8", "links": [{"href": "https://149.202.185.218/compute/images/f55b4d12-f159-453f-86cb-8965684121c8", "rel": "bookmark"}]}, "OS-EXT-STS:vm_state": "active", "OS-SRV-USG:launched_at": "2018-12-08T01:43:11.000000", "flavor": {"ephemeral": 0, "ram": 64, "original_name": "m1.nano", "vcpus": 1, "extra_specs": {}, "swap": 0, "disk": 0}, "id": "d46fba31-9469-4799-b2bf-1fbad4369a9a", "description": null, "user_id": "9bc61597b6794ca49c79ce599c26b223", "OS-DCF:diskConfig": "MANUAL", "accessIPv4": "", "accessIPv6": "", "progress": 0, "OS-EXT-STS:power_state": 1, "OS-EXT-AZ:availability_zone": "nova", "config_drive": "", "status": "ACTIVE", "updated": "2018-12-08T01:43:11Z", "hostId": "f4600b0fc614ce95d5be30fc1fc82bdb150c47ff8c84b22262be6d19", "OS-SRV-USG:terminated_at": null, "tags": [], "key_name": null, "locked": false, "name": "tempest-TestMultiAttachVolumeSwap-server-226416218-1", "created": "2018-12-08T01:43:04Z", "tenant_id": "5db98765c5d94a73bae8aba2f13c7762", "os-extended-volumes:volumes_attached": [], "metadata": {}}}

server1.hostId = f4600b0fc614ce95d5be30fc1fc82bdb150c47ff8c84b22262be6d19
server1 builds on the subnode (ubuntu-xenial-ovh-gra1-0001066278)

server2:

{"server": {"OS-EXT-STS:task_state": null, "addresses": {}, "links": [{"href": "https://149.202.185.218/compute/v2.1/servers/e0b74695-f9a0-4f4c-8a08-a0de3420cc6c", "rel": "self"}, {"href": "https://149.202.185.218/compute/servers/e0b74695-f9a0-4f4c-8a08-a0de3420cc6c", "rel": "bookmark"}], "image": {"id": "f55b4d12-f159-453f-86cb-8965684121c8", "links": [{"href": "https://149.202.185.218/compute/images/f55b4d12-f159-453f-86cb-8965684121c8", "rel": "bookmark"}]}, "OS-EXT-STS:vm_state": "active", "OS-SRV-USG:launched_at": "2018-12-08T01:43:10.000000", "flavor": {"ephemeral": 0, "ram": 64, "original_name": "m1.nano", "vcpus": 1, "extra_specs": {}, "swap": 0, "disk": 0}, "id": "e0b74695-f9a0-4f4c-8a08-a0de3420cc6c", "description": null, "user_id": "9bc61597b6794ca49c79ce599c26b223", "OS-DCF:diskConfig": "MANUAL", "accessIPv4": "", "accessIPv6": "", "progress": 0, "OS-EXT-STS:power_state": 1, "OS-EXT-AZ:availability_zone": "nova", "config_drive": "", "status": "ACTIVE", "updated": "2018-12-08T01:43:11Z", "hostId": "5f1a8abdf427d6cbda8003bd178ef40c605f24f626ec88c3a02816df", "OS-SRV-USG:terminated_at": null, "tags": [], "key_name": null, "locked": false, "name": "tempest-TestMultiAttachVolumeSwap-server-226416218-2", "created": "2018-12-08T01:43:04Z", "tenant_id": "5db98765c5d94a73bae8aba2f13c7762", "os-extended-volumes:volumes_attached": [], "metadata": {}}}

server2.hostId = 5f1a8abdf427d6cbda8003bd178ef40c605f24f626ec88c3a02816df
server2 builds on the primary node (ubuntu-xenial-ovh-gra1-0001066161)

So clearly the two servers are on different hosts. Also, volume1 is on the primary node (same as server2).

attach volume1 to server1:

Body: {"volumeAttachment": {"device": "/dev/vdb", "serverId": "d46fba31-9469-4799-b2bf-1fbad4369a9a", "id": "2a48ae08-2833-488c-8771-eea86f41c7b8", "volumeId": "2a48ae08-2833-488c-8771-eea86f41c7b8"}} _log_request_full tempest/lib/common/rest_client.py:437

attach volume1 to server2:

Body: {"volumeAttachment": {"device": "/dev/vdb", "serverId": "e0b74695-f9a0-4f4c-8a08-a0de3420cc6c", "id": "2a48ae08-2833-488c-8771-eea86f41c7b8", "volumeId": "2a48ae08-2833-488c-8771-eea86f41c7b8"}} _log_request_full tempest/lib/common/rest_client.py:437

swap volume1 (2a48ae08-2833-488c-8771-eea86f41c7b8) to volume2 (fa56e173-3bd4-4701-bc71-cb4810c53c51) on server1 (e0b74695-f9a0-4f4c-8a08-a0de3420cc6c):

2018-12-08 01:43:28.336 8815 INFO tempest.lib.common.rest_client [req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 ] Request (TestMultiAttachVolumeSwap:test_volume_swap_with_multiattach): 202 PUT https://149.202.185.218/compute/v2.1/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a/os-volume_attachments/2a48ae08-2833-488c-8771-eea86f41c7b8 0.618s
2018-12-08 01:43:28.337 8815 DEBUG tempest.lib.common.rest_client [req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>', 'X-OpenStack-Nova-API-Version': '2.60'}
Body: {"volumeAttachment": {"volumeId": "fa56e173-3bd4-4701-bc71-cb4810c53c51"}}
Response - Headers: {u'x-openstack-request-id': 'req-b6cb2f89-ebf6-47b5-9757-8692dd91c812', u'content-length': '0', u'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', u'server': 'Apache/2.4.18 (Ubuntu)', u'x-compute-request-id': 'req-b6cb2f89-ebf6-47b5-9757-8692dd91c812', u'content-type': 'application/json', 'content-location': 'https://149.202.185.218/compute/v2.1/servers/d46fba31-9469-4799-b2bf-1fbad4369a9a/os-volume_attachments/2a48ae08-2833-488c-8771-eea86f41c7b8', u'date': 'Sat, 08 Dec 2018 01:43:27 GMT', 'status': '202', u'openstack-api-version': 'compute 2.60', u'connection': 'close', u'x-openstack-nova-api-version': '2.60'}

now the test waits for the swap to complete and the attachments are all recorded correctly;
once the swap is done, volume1 should be attached to server2 and volume2 should be attached
to server1

This is the connection_info for the old/new volume during the swap:

Dec 08 01:43:31.321263 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: DEBUG nova.compute.manager [None req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 tempest-TestMultiAttachVolumeSwap-477435682 tempest-TestMultiAttachVolumeSwap-477435682] [instance: d46fba31-9469-4799-b2bf-1fbad4369a9a] swap_volume: Calling driver volume swap with connection infos:

new: {'status': u'reserved', 'multiattach': True, 'detached_at': u'', u'volume_id': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', 'attach_mode': u'null', 'driver_volume_type': u'iscsi', 'instance': u'd46fba31-9469-4799-b2bf-1fbad4369a9a', 'attached_at': u'', 'serial': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', 'data': {u'access_mode': u'rw', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-fa56e173-3bd4-4701-bc71-cb4810c53c51', u'target_portal': u'213.32.74.238:3260', u'volume_id': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', u'target_lun': 1, u'auth_password': u'***', u'auth_username': u'oAJ38A88Rom7xrwQ5A4Y', u'auth_method': u'CHAP'}};

old: {u'status': u'reserved', u'instance': u'd46fba31-9469-4799-b2bf-1fbad4369a9a', u'multiattach': True, u'attached_at': u'', u'attach_mode': u'null', u'driver_volume_type': u'iscsi', u'detached_at': u'', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'serial': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'data': {u'device_path': u'/dev/sda', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal': u'149.202.185.218:3260', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'auth_password': u'***', u'target_lun': 1, u'access_mode': u'rw', u'auth_username': u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'}} {{(pid=20206) _swap_volume /opt/stack/nova/nova/compute/manager.py:5646}}

When the test cleans up and tries to delete volume1, it fails trying to cleanup iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8 from the subnode (which is the old connection to server1).

volume1 connection is supposed to be detached from server1 here:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/compute1/logs/screen-n-cpu.txt.gz#_Dec_08_01_45_36_699753

Dec 08 01:45:36.699753 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: DEBUG os_brick.initiator.connectors.iscsi [None req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 tempest-TestMultiAttachVolumeSwap-477435682 tempest-TestMultiAttachVolumeSwap-477435682] ==> disconnect_volume: call u"{'args': (<os_brick.initiator.connectors.iscsi.ISCSIConnector object at 0x7fe1aa096d50>, {u'device_path': u'/dev/sda', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal': u'149.202.185.218:3260', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'auth_password': u'***', u'target_lun': 1, u'access_mode': u'rw', u'auth_username': u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'}, None), 'kwargs': {}}" {{(pid=20206) trace_logging_wrapper /usr/local/lib/python2.7/dist-packages/os_brick/utils.py:146}}

It says it logged out:

Dec 08 01:45:37.336156 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: Logout of [sid: 1, target: iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8, portal: 149.202.185.218,3260] successful.

This definitely looks wrong (post-swap volume, updating old BDM record), note it's updating the BDM for volume1 (2a48ae08-2833-488c-8771-eea86f41c7b8) but with the volume2 (fa56e173-3bd4-4701-bc71-cb4810c53c51) connection_info - note that for the new volume attach flow the connection_info might not matter since we likely don't use it (we just work with the attachment records), but if
the attachment_id (769b75cc-4283-4b46-a975-243faae6e263) is for the wrong volume, that could be a problem:

Dec 08 01:45:37.699221 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: DEBUG nova.compute.manager [None req-b6cb2f89-ebf6-47b5-9757-8692dd91c812 tempest-TestMultiAttachVolumeSwap-477435682 tempest-TestMultiAttachVolumeSwap-477435682] [instance: d46fba31-9469-4799-b2bf-1fbad4369a9a] swap_volume: Updating volume 2a48ae08-2833-488c-8771-eea86f41c7b8 BDM record with {'attachment_id': u'769b75cc-4283-4b46-a975-243faae6e263', 'no_device': None, 'connection_info': '{"status": "reserved", "multiattach": true, "detached_at": "", "volume_id": "fa56e173-3bd4-4701-bc71-cb4810c53c51", "attach_mode": "null", "driver_volume_type": "iscsi", "instance": "d46fba31-9469-4799-b2bf-1fbad4369a9a", "attached_at": "", "serial": "fa56e173-3bd4-4701-bc71-cb4810c53c51", "data": {"access_mode": "rw", "target_discovered": false, "encrypted": false, "qos_specs": null, "target_iqn": "iqn.2010-10.org.openstack:volume-fa56e173-3bd4-4701-bc71-cb4810c53c51", "target_portal": "213.32.74.238:3260", "volume_id": "fa56e173-3bd4-4701-bc71-cb4810c53c51", "target_lun": 1, "device_path": "/dev/sdb", "auth_password": "Ko6DLxWNfrbqWQ7E", "auth_username": "oAJ38A88Rom7xrwQ5A4Y", "auth_method": "CHAP"}}', 'volume_id': u'fa56e173-3bd4-4701-bc71-cb4810c53c51', 'source_type': 'volume', 'snapshot_id': None, 'destination_type': 'volume'} {{(pid=20206) swap_volume /opt/stack/nova/nova/compute/manager.py:5819}}

Probably the bigger problem is the BDM for volume1 is updated to have volume_id fa56e173-3bd4-4701-bc71-cb4810c53c51 so we changed the volume_id on the bdm for volume1, so that might have something to do with orphaning volume1's connections when we go to delete - since deleting server2 (which still has volume1 attached) likely doesn't cleanup volume1 properly.

The other things I'm wondering about are:

1. Is nova-compute deleting the old volume1/server1 attachment record here:

https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/compute/manager.py#L5691

because I didn't see anything about attachment delete in the logs.

2. Is there a periodic task in the cinder-volume service that is recreating the target connection for volume1 and server1 in between the time we swap the volume and nova-compute deletes the connection but before nova-compute deletes the old attachment record?

The first and most obvious thing to do is fix that bdm update code at the end of the ComputeManager.swap_volume method. I'm not exactly sure yet how the multi-host nature of this comes into play, except possibly in the libvirt driver volume disconnect code here:

https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/virt/libvirt/driver.py#L1221

See original description

Tags:

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-12-10:

I'm not sure if multiple hosts is really the main culprit because the test passed on this tempest-slow job:

http://logs.openstack.org/78/606978/4/check/tempest-slow/63b3e0e/

And in that case, the two servers involved in the swap from the same multiattach volume are on different hosts. So maybe there is some timing issue involved when this fails.

In that passing case, these are the two servers:

Body: {"servers": [{"id": "988b38e6-1556-45b8-94ba-76d5045ef5e9", "links": [{"href": "https://10.210.68.80/compute/v2.1/servers/988b38e6-1556-45b8-94ba-76d5045ef5e9", "rel": "self"}, {"href": "https://10.210.68.80/compute/servers/988b38e6-1556-45b8-94ba-76d5045ef5e9", "rel": "bookmark"}], "name": "tempest-TestMultiAttachVolumeSwap-server-1095754502-2"}, {"id": "2f4078c8-4911-4154-b24b-ea2ca37fcdd3", "links": [{"href": "https://10.210.68.80/compute/v2.1/servers/2f4078c8-4911-4154-b24b-ea2ca37fcdd3", "rel": "self"}, {"href": "https://10.210.68.80/compute/servers/2f4078c8-4911-4154-b24b-ea2ca37fcdd3", "rel": "bookmark"}], "name": "tempest-TestMultiAttachVolumeSwap-server-1095754502-1"}]} _log_request_full tempest/lib/common/rest_client.py:437

988b38e6-1556-45b8-94ba-76d5045ef5e9 (server2) builds on the primary node and 2f4078c8-4911-4154-b24b-ea2ca37fcdd3 (server1) builds on the subnode.

Maybe it depends on where volume1 builds? In that test, volume1 is 2fe7ad08-b928-4139-9f9d-a4ee54ac630d which builds on the primary node, and if server1 is on the subnode, then that is one difference from the fail scenario where volume1 and server1 are on the same host.

description:

updated

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-12-10:

Ignore comment 1, the original description of the bug was wrong because in the fail scenario, volume1 is on the primary node and server1 is on the subnode.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-12-10:

Download full text (4.4 KiB)

For the fail scenario, it looks like this is where we delete the old attachment record:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-c-vol.txt.gz#_Dec_08_01_45_40_765428

Dec 08 01:45:40.765428 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: DEBUG cinder.volume.manager [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] Deleting attachment d7bfeddf-c8c6-4072-8d34-91aaba0ab343. {{(pid=1884) _do_attachment_delete /opt/stack/cinder/cinder/volume/manager.py:4525}}

And removing the connection fails:

Dec 08 01:45:41.058514 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: INFO cinder.volume.targets.tgt [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] Removing iscsi_target for Volume ID: 2a48ae08-2833-488c-8771-eea86f41c7b8
Dec 08 01:45:41.060048 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: DEBUG oslo_concurrency.processutils [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8 {{(pid=1884) execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:372}}
Dec 08 01:45:41.074707 ubuntu-xenial-ovh-gra1-0001066161 sudo[9566]: stack : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/bin/cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8
Dec 08 01:45:41.076726 ubuntu-xenial-ovh-gra1-0001066161 sudo[9566]: pam_unix(sudo:session): session opened for user root by (uid=0)
Dec 08 01:45:41.373922 ubuntu-xenial-ovh-gra1-0001066161 sudo[9566]: pam_unix(sudo:session): session closed for user root
Dec 08 01:45:41.376988 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: DEBUG oslo_concurrency.processutils [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] CMD "sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8" returned: 22 in 0.316s {{(pid=1884) execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:409}}
Dec 08 01:45:41.378167 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: DEBUG oslo_concurrency.processutils [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] u'sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8' failed. Not Retrying. {{(pid=1884) execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:457}}
Dec 08 01:45:41.380161 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: ERROR cinder.volume.targets.tgt [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9...

For the fail scenario, it looks like this is where we delete the old attachment record:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-c-vol.txt.gz#_Dec_08_01_45_40_765428

And removing the connection fails:

Dec 08 01:45:41.058514 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: INFO cinder.volume.targets.tgt [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] Removing iscsi_target for Volume ID: 2a48ae08-2833-488c-8771-eea86f41c7b8
Dec 08 01:45:41.060048 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: DEBUG oslo_concurrency.processutils [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] Running cmd (subprocess): sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8 {{(pid=1884) execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:372}}
Dec 08 01:45:41.074707 ubuntu-xenial-ovh-gra1-0001066161 sudo[9566]:    stack : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/local/bin/cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8
Dec 08 01:45:41.076726 ubuntu-xenial-ovh-gra1-0001066161 sudo[9566]: pam_unix(sudo:session): session opened for user root by (uid=0)
Dec 08 01:45:41.373922 ubuntu-xenial-ovh-gra1-0001066161 sudo[9566]: pam_unix(sudo:session): session closed for user root
Dec 08 01:45:41.376988 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: DEBUG oslo_concurrency.processutils [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] CMD "sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8" returned: 22 in 0.316s {{(pid=1884) execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:409}}
Dec 08 01:45:41.378167 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: DEBUG oslo_concurrency.processutils [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] u'sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8' failed. Not Retrying. {{(pid=1884) execute /usr/local/lib/python2.7/dist-packages/oslo_concurrency/processutils.py:457}}
Dec 08 01:45:41.380161 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: ERROR cinder.volume.targets.tgt [req-4aa23d64-309a-44a4-a7c9-381a6fbef330 req-f47e0ea9-7e80-4481-bce1-7a1060c6728b tempest-TestMultiAttachVolumeSwap-1803326020 None] Failed to remove iscsi target for Volume ID: 2a48ae08-2833-488c-8771-eea86f41c7b8: Unexpected error while running command.
Dec 08 01:45:41.380475 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Command: sudo cinder-rootwrap /etc/cinder/rootwrap.conf tgt-admin --force --delete iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8
Dec 08 01:45:41.380783 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Exit code: 22
Dec 08 01:45:41.381094 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Stdout: u'Command:\n\ttgtadm -C 0 --mode target --op delete --tid=1\nexited with code: 22.\n'
Dec 08 01:45:41.381386 ubuntu-xenial-ovh-gra1-0001066161 cinder-volume[1776]: Stderr: u'tgtadm: this target is still active\n': ProcessExecutionError: Unexpected error while running command.

That is 4 seconds after we logged out the connection on the subnode for that volume:

Maybe that's the race? Is data in the primary node (where the volume "lives") old?

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-12-10:

I think this is where that cinder attachment delete fails:

https://github.com/openstack/cinder/blob/9d4fd16bcd8eca930910798cc519cb5bc5846c59/cinder/volume/manager.py#L4528

Notice the FIXME in there:

https://github.com/openstack/cinder/blob/9d4fd16bcd8eca930910798cc519cb5bc5846c59/cinder/volume/manager.py#L4530

# FIXME(jdg): Obviously our volume object is going to need some
# changes to deal with multi-attach and figuring out how to
# represent a single failed attach out of multiple attachments

Because of that exception we don't call this:

self.db.volume_detached(context.elevated(), vref.id,
attachment.get('id'))

Which would delete the volume attachment record in the Cinder DB, but I'm not sure how much that matters since on the nova side we update the volume1 BDM to point at a new attachment_id (769b75cc-4283-4b46-a975-243faae6e263).

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-12-10:

Looking at the swap volume flow in nova again, I think

https://github.com/openstack/nova/blob/ae3064b7a820ea02f7fc8a1aa4a41f35a06534f1/nova/compute/manager.py#L5798-L5806

is likely intentional since for volume1/server1 there is only a single BDM record. It starts out with the old volume_id and attachment_id, and then once the swap is complete, the compute manager code, rather than delete the old BDM for vol1/server1 and create a new BDM for vol2/server1, it just updates the BDM for vol1/serv1 to point at vol2 (updates the connection_info, attachment_id and volume_id to be for vol2 instead of vol1).

So when we delete server1, it should cleanup volume2 from the bdm that points at volume2. And when we delete server2, it should cleanup volume1 from the bdm that points at volume1.

The problem likely goes back to the old vol1/server1 attachment delete failing here:

https://github.com/openstack/cinder/blob/9d4fd16bcd8eca930910798cc519cb5bc5846c59/cinder/volume/manager.py#L4528

Maybe we need to retry in that case if we're hitting a race? Maybe we need to detect "tgtadm: this target is still active" here:

https://github.com/openstack/cinder/blob/9d4fd16bcd8eca930910798cc519cb5bc5846c59/cinder/volume/targets/tgt.py#L283

And retry?

Changed in cinder:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-12-10:

I'm not sure that a retry in cinder will fix the problem since the attachment delete for server1/volume1 fails here:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-c-vol.txt.gz#_Dec_08_01_45_41_380161

and the volume delete fails for the same reason 8 seconds later here:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-c-vol.txt.gz#_Dec_08_01_45_49_587081

so if it is a timing race, does it really take longer than 8 seconds for that target to no longer be active? Or is there something wrong where it thinks that target is the one being used between volume1 and server2 and we haven't yet deleted server2?

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-12-10:

Here are the n-cpu details from volume1 is attached to server2:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-n-cpu.txt.gz#_Dec_08_01_43_20_805619

Dec 08 01:43:20.805619 ubuntu-xenial-ovh-gra1-0001066161 nova-compute[32186]: INFO nova.compute.manager [None req-bda55e50-9756-4d10-8f1b-b4ebde6a58ee tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] [instance: e0b74695-f9a0-4f4c-8a08-a0de3420cc6c] Attaching volume 2a48ae08-2833-488c-8771-eea86f41c7b8 to /dev/vdb

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-n-cpu.txt.gz#_Dec_08_01_43_24_073493

Dec 08 01:43:24.073493 ubuntu-xenial-ovh-gra1-0001066161 nova-compute[32186]: DEBUG os_brick.initiator.connectors.iscsi [None req-bda55e50-9756-4d10-8f1b-b4ebde6a58ee tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] ==> connect_volume: call u"{'args': (<os_brick.initiator.connectors.iscsi.ISCSIConnector object at 0x7f3217c0de50>, {u'access_mode': u'rw', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal': u'149.202.185.218:3260', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_lun': 1, u'auth_password': u'***', u'auth_username': u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'}), 'kwargs': {}}" {{(pid=32186) trace_logging_wrapper /usr/local/lib/python2.7/dist-packages/os_brick/utils.py:146}}

Dec 08 01:43:25.437141 ubuntu-xenial-ovh-gra1-0001066161 nova-compute[32186]: Login to [iface: default, target: iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8, portal: 149.202.185.218,3260] successful.

Dec 08 01:43:25.455421 ubuntu-xenial-ovh-gra1-0001066161 nova-compute[32186]: DEBUG os_brick.initiator.connectors.iscsi [None req-bda55e50-9756-4d10-8f1b-b4ebde6a58ee tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] Connected to sda using {u'auth_password': u'***', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal': u'149.202.185.218:3260', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_lun': 1, u'access_mode': u'rw', u'auth_username': u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'} {{(pid=32186) _connect_vol /usr/local/lib/python2.7/dist-packages/os_brick/initiator/connectors/iscsi.py:663}}

Here are the n-cpu details from volume1 is attached to server2:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-n-cpu.txt.gz#_Dec_08_01_43_20_805619

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/controller/logs/screen-n-cpu.txt.gz#_Dec_08_01_43_24_073493

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-12-10:

Download full text (3.8 KiB)

Details from attaching volume1 to server1:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/compute1/logs/screen-n-cpu.txt.gz#_Dec_08_01_43_12_388771

Dec 08 01:43:12.388771 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: INFO nova.compute.manager [None req-15ea8e06-99b2-439b-bad8-d0f0f6bee781 tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] [instance: d46fba31-9469-4799-b2bf-1fbad4369a9a] Attaching volume 2a48ae08-2833-488c-8771-eea86f41c7b8 to /dev/vdb

Dec 08 01:43:16.269897 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: DEBUG os_brick.initiator.connectors.iscsi [None req-15ea8e06-99b2-439b-bad8-d0f0f6bee781 tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] ==> connect_volume: call u"{'args': (<os_brick.initiator.connectors.iscsi.ISCSIConnector object at 0x7fe1aa096d50>, {u'access_mode': u'rw', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal': u'149.202.185.218:3260', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_lun': 1, u'auth_password': u'***', u'auth_username': u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'}), 'kwargs': {}}" {{(pid=20206) trace_logging_wrapper /usr/local/lib/python2.7/dist-packages/os_brick/utils.py:146}}

Should target_lun be the same in both cases?

Dec 08 01:43:17.800452 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: Login to [iface: default, target: iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8, portal: 149.202.185.218,3260] successful.

Dec 08 01:43:17.825508 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: DEBUG os_brick.initiator.connectors.iscsi [None req-15ea8e06-99b2-439b-bad8-d0f0f6bee781 tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] Connected to sda using {u'auth_password': u'***', u'target_discovered': False, u'encrypted': False, u'qos_specs': None, u'target_iqn': u'iqn.2010-10.org.openstack:volume-2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_portal': u'149.202.185.218:3260', u'volume_id': u'2a48ae08-2833-488c-8771-eea86f41c7b8', u'target_lun': 1, u'access_mode': u'rw', u'auth_username': u'HMHytc239biDCT6NTgMa', u'auth_method': u'CHAP'} {{(pid=20206) _connect_vol /usr/local/lib/python2.7/dist-packages/os_brick/initiator/connectors/iscsi.py:663}}

---

We start deleting server1 here:

http://logs.openstack.org/81/606981/4/check/tempest-slow/fafde23/compute1/logs/screen-n-cpu.txt.gz#_Dec_08_01_45_42_745301

Dec 08 01:45:42.745301 ubuntu-xenial-ovh-gra1-0001066278 nova-compute[20206]: INFO nova.compute.manager [None req-69aba1e7-c922-40f2-8136-6e6e0a8c924e tempest-TestMultiAttachVolumeSwap-1803326020 tempest-TestMultiAttachVolumeSwap-1803326020] [instance: d46fba31-9469-4799-b2bf-1fbad4369a9a] Terminating instance

Disconnects from volume2 here:

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Changed in tempest:
status:	New → In Progress
importance:	Undecided → High
assignee:	nobody → Matt Riedemann (mriedem)
no longer affects:	nova
summary:	- swap multiattach volume intermittently fails when servers are on - different hosts + test_volume_swap_with_multiattach intermittently fails during cleanup

Changed in tempest:
importance:	High → Medium
summary:	test_volume_swap_with_multiattach intermittently fails during cleanup + when run in a multi-node job

Changed in tempest:
status:	In Progress → New
assignee:	Matt Riedemann (mriedem) → nobody
assignee:	nobody → Lee Yarwood (lyarwood)