While testing HA in RDOPhase2 we stop all the master/slave resources (galera and redis) and check that everything is fine. After the recovery of the resources (so after verifying that every service is right on the cluster side), we start an openstack-heat template [1] that spawns an instance using a previously created volume.
At this stage it could racily happen that the openstack-heat deployment fails with this error:
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource Traceback (most recent call last):
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 919, in _action_recorder
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource yield
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1027, in _do_action
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args)
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 346, in wrapper
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource step = next(subtask)
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 978, in action_handler_task
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource done = check(handler_data)
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/nova/server.py", line 872, in check_create_complete
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource check = self.client_plugin()._check_active(server_id)
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/clients/os/nova.py", line 242, in _check_active
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource 'code': fault.get('code', _('Unknown'))
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource ResourceInError: Went to status ERROR due to "Message: Build of instance 0a4a29ff-9b75-4359-9007-b7fa4b962a4e aborted: Volume attachment 32d409c8-e9ac-40cb-a0db-ff8c294ec0a8 could not be found., Code: 500"
2018-05-31 08:18:37.614 7 ERROR heat.engine.resource
2018-05-31 08:18:42.727 7 INFO heat.engine.scheduler [req-b5219f38-3754-41e3-a371-b822f219c054 - admin - default default] Task pause timed out
2018-05-31 08:18:43.114 7 INFO heat.engine.resource [req-b5219f38-3754-41e3-a371-b822f219c054 - admin - default default] CREATE: Server "instance" [0a4a29ff-9b75-4359-9007-b7fa4b962a4e] Stack "stack_test_master-slave" [887c743c-261f-4aaf-b5e1-c0f324ce1373]
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource Traceback (most recent call last):
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 919, in _action_recorder
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource yield
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1027, in _do_action
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 329, in wrapper
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource step = next(subtask)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 972, in action_handler_task
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource handler_data = handler(*args)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/nova/server.py", line 862, in handle_create
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource admin_pass=admin_pass)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 1403, in create
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource **boot_kwargs)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/novaclient/v2/servers.py", line 802, in _boot
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource return_raw=return_raw, **kwargs)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/novaclient/base.py", line 361, in _create
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource resp, body = self.api.client.post(url, body=body)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 310, in post
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource return self.request(url, 'POST', **kwargs)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/novaclient/client.py", line 83, in request
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource raise exceptions.from_response(resp, body, url, method)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource BadRequest: Block Device Mapping is Invalid: failed to get volume 0ac38f46-0b34-4af8-97cf-dd8ea520df09. (HTTP 400) (Request-ID: req-3e9d6368-c52c-44b4-a94e-210384eea77e)
2018-05-31 08:18:43.114 7 ERROR heat.engine.resource
2018-05-31 08:18:43.128 7 INFO heat.engine.stack [req-b5219f38-3754-41e3-a371-b822f219c054 - admin - default default] Stack CREATE FAILED (stack_test_master-slave): Resource CREATE failed: BadRequest: resources.instance: Block Device Mapping is Invalid: failed to get volume 0ac38f46-0b34-4af8-97cf-dd8ea520df09. (HTTP 400) (Request-ID: req-3e9d6368-c52c-44b4-a94e-210384eea77e)
2018-05-31 08:18:43.137 7 ERROR oslo.messaging._drivers.impl_rabbit [req-b5219f38-3754-41e3-a371-b822f219c054 - admin - default default] [68ff2277-db1e-4744-a4d3-6d7446f751e2] AMQP server on overcloud-controller-0.internalapi.localdomain:5672 is unreachable: [Errno 32] Broken pipe. Trying again in 1 seconds. Client port: 52310: error: [Errno 32] Broken pipe
Looking at the cinder log, particularly where the volume resource is active [2] we look that on Cinder side everything seems fine, but on the other side Nova tells [3] a different story:
2018-05-31 08:18:37.752 18 WARNING nova.compute.api [req-3e6f3a69-9ad2-4a5f-8478-51dcf289570a 356063444c1046b5a82b2e5e2bab9ca4 3dd8317c620e4aeebf4a899344e3fafe - default default] [instance: 0a4a29ff-9b75-4359-9007-b7fa4b962a4e] instance's host None is down, deleting from database
2018-05-31 08:18:37.821 18 ERROR nova.volume.cinder [req-3e6f3a69-9ad2-4a5f-8478-51dcf289570a 356063444c1046b5a82b2e5e2bab9ca4 3dd8317c620e4aeebf4a899344e3fafe - default default] Delete attachment failed for attachment 32d409c8-e9ac-40cb-a0db-ff8c294ec0a8. Error: Volume attachment could not be found with filter: attachment_id = 32d409c8-e9ac-40cb-a0db-ff8c294ec0a8. (HTTP 404) (Request-ID: req-efcb2e5c-d86a-437c-9521-ecf34f88ae8e) Code: 404: NotFound: Volume attachment could not be found with filter: attachment_id = 32d409c8-e9ac-40cb-a0db-ff8c294ec0a8. (HTTP 404) (Request-ID: req-efcb2e5c-d86a-437c-9521-ecf34f88ae8e)
2018-05-31 08:18:37.822 18 WARNING nova.compute.api [req-3e6f3a69-9ad2-4a5f-8478-51dcf289570a 356063444c1046b5a82b2e5e2bab9ca4 3dd8317c620e4aeebf4a899344e3fafe - default default] [instance: 0a4a29ff-9b75-4359-9007-b7fa4b962a4e] Ignoring volume cleanup failure due to Volume attachment 32d409c8-e9ac-40cb-a0db-ff8c294ec0a8 could not be found.: VolumeAttachmentNotFound: Volume attachment 32d409c8-e9ac-40cb-a0db-ff8c294ec0a8 could not be found.
The entire log collection for this environment is here [4], but this feels a lot like a problem on the Cinder side, not able to survive to the interruption of the core resources.
[1] https://github.com/openstack/tripleo-ha-utils/blob/master/roles/validate-ha/templates/validate-ha-heat-template.yaml.j2
[2] https://thirdparty.logs.rdoproject.org/jenkins-oooq-master-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-170/overcloud-controller-0/var/log/containers/cinder/cinder-volume.log.txt.gz
[3] https://thirdparty.logs.rdoproject.org/jenkins-oooq-master-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-170/overcloud-controller-1/var/log/containers/nova/nova-api.log.txt.gz
[4] https://thirdparty.logs.rdoproject.org/jenkins-oooq-master-rdo_trunk-bmu-haa16-lab-float_nic_with_vlans-170/
The failure seems to be due to a rabbitmq glitch at the time when Nova first
requests Cinder delete the attachment. Cinder appears to properly handle the
request at [1], but note the glitch at [2] before it response 200 OK at [3].
[1] https:/ /thirdparty. logs.rdoproject .org/jenkins- oooq-master- rdo_trunk- bmu-haa16- lab-float_ nic_with_ vlans-170/ overcloud- controller- 2/var/log/ containers/ cinder/ cinder- api.log. txt.gz# _2018-05- 31_08_18_ 27_525 /thirdparty. logs.rdoproject .org/jenkins- oooq-master- rdo_trunk- bmu-haa16- lab-float_ nic_with_ vlans-170/ overcloud- controller- 2/var/log/ containers/ cinder/ cinder- api.log. txt.gz# _2018-05- 31_08_18_ 27_574 /thirdparty. logs.rdoproject .org/jenkins- oooq-master- rdo_trunk- bmu-haa16- lab-float_ nic_with_ vlans-170/ overcloud- controller- 2/var/log/ containers/ cinder/ cinder- api.log. txt.gz# _2018-05- 31_08_18_ 31_025
[2] https:/
[3] https:/
So, from Cinder's perspective the attachment has been deleted. This explains
why Cinder returns 404 for all subsequent requests to delete the attachment.
Note the [4],[5],[6] timestamps follow the successful deletion at [3].
[4] https:/ /thirdparty. logs.rdoproject .org/jenkins- oooq-master- rdo_trunk- bmu-haa16- lab-float_ nic_with_ vlans-170/ overcloud- controller- 2/var/log/ containers/ cinder/ cinder- api.log. txt.gz# _2018-05- 31_08_18_ 34_053 /thirdparty. logs.rdoproject .org/jenkins- oooq-master- rdo_trunk- bmu-haa16- lab-float_ nic_with_ vlans-170/ overcloud- controller- 2/var/log/ containers/ cinder/ cinder- api.log. txt.gz# _2018-05- 31_08_18_ 36_893 /thirdparty. logs.rdoproject .org/jenkins- oooq-master- rdo_trunk- bmu-haa16- lab-float_ nic_with_ vlans-170/ overcloud- controller- 0/var/log/ containers/ cinder/ cinder- api.log. txt.gz# _2018-05- 31_08_18_ 37_816
[5] https:/
[6] https:/
Lastly, Nova also reports a rabbitmq issue with controller-2 (which handled
the initial request to delete the attachment). Nova's log is [7].
[7] https:/ /thirdparty. logs.rdoproject .org/jenkins- oooq-master- rdo_trunk- bmu-haa16- lab-float_ nic_with_ vlans-170/ overcloud- controller- 1/var/log/ containers/ nova/nova- api.log. txt.gz# _2018-05- 31_08_18_ 25_174