HA test failed because it is unable to create an instance due to block device mapping invalid (failed to get volume)

Bug #1774653 reported by Raoul Scarazzini
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Invalid
Critical
Unassigned

Bug Description

While testing HA in RDOPhase2 we stop all the master/slave resources (galera and redis) and check that everything is fine. After the recovery of the resources (so after verifying that every service is right on the cluster side), we start an openstack-heat template [1] that spawns an instance using a previously created volume.
At this stage it could racily happen that the openstack-heat deployment fails with this error:

2018-05-31 08:18:37.614 7 ERROR heat.engine.resource Traceback (most recent call last):
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 919, in _action_recorder
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource yield
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1027, in _do_action
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args)
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346, in wrapper
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource step = next(subtask)
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 978, in action_handler_task
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource done = check(handler_data)
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/nova/server.py", line 872, in check_create_complete
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource check = self.client_plugin()._check_active(server_id)
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/clients/os/nova.py", line 242, in _check_active
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource 'code': fault.get('code', _('Unknown'))
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource ResourceInError: Went to status ERROR due to "Message: Build of instance 0a4a29ff-9b75-4359-9007-b7fa4b962a4e aborted: Volume attachment 32d409c8-e9ac-40cb-a0db-ff8c294ec0a8 could not be found., Code: 500"
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource
2018-05-31 08:18:42.727 7 INFO heat.engine.scheduler [req-b5219f38-3754-41e3-a371-b822f219c054 - admin - default default] Task pause timed out
2018-05-31 08:18:43.114 7 INFO heat.engine.resource [req-b5219f38-3754-41e3-a371-b822f219c054 - admin - default default] CREATE: Server "instance" [0a4a29ff-9b75-4359-9007-b7fa4b962a4e] Stack "stack_test_master-slave" [887c743c-261f-4aaf-b5e1-c0f324ce1373]
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource Traceback (most recent call last):
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 919, in _action_recorder
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource yield
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1027, in _do_action
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 329, in wrapper
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource step = next(subtask)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 972, in action_handler_task
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource handler_data = handler(*args)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/nova/server.py", line 862, in handle_create
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource admin_pass=admin_pass)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 1403, in create
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource **boot_kwargs)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 802, in _boot
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource return_raw=return_raw, **kwargs)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 361, in _create
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource resp, body = self.api.client.post(url, body=body)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 310, in post
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource return self.request(url, 'POST', **kwargs)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 83, in request
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource raise exceptions.from_response(resp, body, url, method)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource BadRequest: Block Device Mapping is Invalid: failed to get volume 0ac38f46-0b34-4af8-97cf-dd8ea520df09. (HTTP 400) (Request-ID: req-3e9d6368-c52c-44b4-a94e-210384eea77e)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource
2018-05-31 08:18:43.128 7 INFO heat.engine.stack [req-b5219f38-3754-41e3-a371-b822f219c054 - admin - default default] Stack CREATE FAILED (stack_test_master-slave): Resource CREATE failed: BadRequest: resources.instance: Block Device Mapping is Invalid: failed to get volume 0ac38f46-0b34-4af8-97cf-dd8ea520df09. (HTTP 400) (Request-ID: req-3e9d6368-c52c-44b4-a94e-210384eea77e)
2018-05-31 08:18:43.137 7 ERROR oslo.messaging._drivers.impl_rabbit [req-b5219f38-3754-41e3-a371-b822f219c054 - admin - default default] [68ff2277-db1e-4744-a4d3-6d7446f751e2] AMQP server on overcloud-controller-0.internalapi.localdomain:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds. Client port: 52310: error: [Errno 32] Broken pipe

Looking at the cinder log, particularly where the volume resource is active [2] we look that on Cinder side everything seems fine, but on the other side Nova tells [3] a different story:

2018-05-31 08:18:37.752 18 WARNING nova.compute.api [req-3e6f3a69-9ad2-4a5f-8478-51dcf289570a 356063444c1046b5a82b2e5e2bab9ca4 3dd8317c620e4aeebf4a899344e3fafe - default default] [instance: 0a4a29ff-9b75-4359-9007-b7fa4b962a4e] instance's host None is down, deleting from database
2018-05-31 08:18:37.821 18 ERROR nova.volume.cinder [req-3e6f3a69-9ad2-4a5f-8478-51dcf289570a 356063444c1046b5a82b2e5e2bab9ca4 3dd8317c620e4aeebf4a899344e3fafe - default default] Delete attachment failed for attachment 32d409c8-e9ac-40cb-a0db-ff8c294ec0a8. Error: Volume attachment could not be found with filter: attachment_id = 32d409c8-e9ac-40cb-a0db-ff8c294ec0a8. (HTTP 404) (Request-ID: req-efcb2e5c-d86a-437c-9521-ecf34f88ae8e) Code: 404: NotFound: Volume attachment could not be found with filter: attachment_id = 32d409c8-e9ac-40cb-a0db-ff8c294ec0a8. (HTTP 404) (Request-ID: req-efcb2e5c-d86a-437c-9521-ecf34f88ae8e)
2018-05-31 08:18:37.822 18 WARNING nova.compute.api [req-3e6f3a69-9ad2-4a5f-8478-51dcf289570a 356063444c1046b5a82b2e5e2bab9ca4 3dd8317c620e4aeebf4a899344e3fafe - default default] [instance: 0a4a29ff-9b75-4359-9007-b7fa4b962a4e] Ignoring volume cleanup failure due to Volume attachment 32d409c8-e9ac-40cb-a0db-ff8c294ec0a8 could not be found.: VolumeAttachmentNotFound: Volume attachment 32d409c8-e9ac-40cb-a0db-ff8c294ec0a8 could not be found.

The entire log collection for this environment is here [4], but this feels a lot like a problem on the Cinder side, not able to survive to the interruption of the core resources.

[1] https://github.com/openstack/tripleo-ha-utils/blob/master/roles/validate-ha/templates/validate-ha-heat-template.yaml.j2
[2] https://thirdparty.logs.rdoproject.org/jenkins-oooq-master-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-170/overcloud-controller-0/var/log/containers/cinder/cinder-volume.log.txt.gz
[3] https://thirdparty.logs.rdoproject.org/jenkins-oooq-master-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-170/overcloud-controller-1/var/log/containers/nova/nova-api.log.txt.gz
[4] https://thirdparty.logs.rdoproject.org/jenkins-oooq-master-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-170/

Tags: ci
Raoul Scarazzini (rasca)
Changed in tripleo:
importance: Undecided → Medium
milestone: none → rocky-2
status: New → Triaged
Changed in tripleo:
milestone: rocky-2 → rocky-3
Raoul Scarazzini (rasca)
Changed in tripleo:
importance: Medium → Critical
tags: added: promotion-blocker
tags: added: ci
Revision history for this message
Alan Bishop (alan-bishop) wrote :

The failure seems to be due to a rabbitmq glitch at the time when Nova first
requests Cinder delete the attachment. Cinder appears to properly handle the
request at [1], but note the glitch at [2] before it response 200 OK at [3].

[1] https://thirdparty.logs.rdoproject.org/jenkins-oooq-master-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-170/overcloud-controller-2/var/log/containers/cinder/cinder-api.log.txt.gz#_2018-05-31_08_18_27_525
[2] https://thirdparty.logs.rdoproject.org/jenkins-oooq-master-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-170/overcloud-controller-2/var/log/containers/cinder/cinder-api.log.txt.gz#_2018-05-31_08_18_27_574
[3] https://thirdparty.logs.rdoproject.org/jenkins-oooq-master-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-170/overcloud-controller-2/var/log/containers/cinder/cinder-api.log.txt.gz#_2018-05-31_08_18_31_025

So, from Cinder's perspective the attachment has been deleted. This explains
why Cinder returns 404 for all subsequent requests to delete the attachment.
Note the [4],[5],[6] timestamps follow the successful deletion at [3].

[4] https://thirdparty.logs.rdoproject.org/jenkins-oooq-master-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-170/overcloud-controller-2/var/log/containers/cinder/cinder-api.log.txt.gz#_2018-05-31_08_18_34_053
[5] https://thirdparty.logs.rdoproject.org/jenkins-oooq-master-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-170/overcloud-controller-2/var/log/containers/cinder/cinder-api.log.txt.gz#_2018-05-31_08_18_36_893
[6] https://thirdparty.logs.rdoproject.org/jenkins-oooq-master-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-170/overcloud-controller-0/var/log/containers/cinder/cinder-api.log.txt.gz#_2018-05-31_08_18_37_816

Lastly, Nova also reports a rabbitmq issue with controller-2 (which handled
the initial request to delete the attachment). Nova's log is [7].

[7] https://thirdparty.logs.rdoproject.org/jenkins-oooq-master-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-170/overcloud-controller-1/var/log/containers/nova/nova-api.log.txt.gz#_2018-05-31_08_18_25_174

Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → rocky-rc2
Revision history for this message
Raoul Scarazzini (rasca) wrote :

Removed the promotion-blocker until we'll see it consistently happening again.

tags: removed: promotion-blocker
Revision history for this message
Alex Schultz (alex-schultz) wrote :

Marking as incomplete as this doesn't seem to be an issue any more. Feel free to reopen if it's occurring again

Changed in tripleo:
milestone: rocky-rc2 → stein-1
status: Triaged → Incomplete
Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.