OVB metalsmith deployment fails: Failed to attach VIF ... to bare metal node, Node ... is locked by host undercloud

Bug #1884518 reported by Sagi (Sergey) Shnaidman
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

OVB node deployment fails with error:

2020-06-22 06:01:21 | t):
  File "<stdin>", line 102, in <module>
  File "<stdin>", line 94, in _ansiballz_main
  File "<stdin>", line 40, in invoke_module
  File "/usr/lib64/python3.6/runpy.py", line 205, in run_module
    return _run_module_code(code, init_globals, run_name, mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/ansible_metalsmith_instances_payload__j2el0cr/ansible_metalsmith_instances_payload.zip/ansible/modules/metalsmith_instances.py", line 363, in <module>
  File "/tmp/ansible_metalsmith_instances_payload__j2el0cr/ansible_metalsmith_instances_payload.zip/ansible/modules/metalsmith_instances.py", line 333, in main
  File "/tmp/ansible_metalsmith_instances_payload__j2el0cr/ansible_metalsmith_instances_payload.zip/ansible/modules/metalsmith_instances.py", line 274, in provision
  File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/tmp/ansible_metalsmith_instances_payload__j2el0cr/ansible_metalsmith_instances_payload.zip/ansible/modules/metalsmith_instances.py", line 296, in _provision_instance
  File "/usr/lib/python3.6/site-packages/metalsmith/_provisioner.py", line 442, in provision_node
    self._clean_up(node, nics=nics)
  File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/usr/lib/python3.6/site-packages/metalsmith/_utils.py", line 144, in reraise_os_exc
    raise reraise_as(str(exc_info[1]))
metalsmith.exceptions.DeploymentFailed: Failed to attach VIF 9338d18c-6703-499d-9e61-7ec21fdbd062 to bare metal node 1830fa59-b9ff-4d00-aa48-f6924620090e: Client Error for url: https://192.168.24.2:13385/v1/nodes/1830fa59-b9ff-4d00-aa48-f6924620090e/vifs, Node 1830fa59-b9ff-4d00-aa48-f6924620090e is locked by host undercloud.localdomain, 2020-06-22 06:01:21.175889 | fa163e50-ce08-3f87-79de-000000000016 | FATAL | Provision instances | localhost | error={"attempts": 3, "changed": false, "module_stderr": "Deploy attempt failed on node baremetal-736944-3-35546-3 (UUID 70d8b8f0-ef4c-4db4-863a-1c4d5abc0600), cleaning up
Deploy attempt failed on node baremetal-736944-3-35546-1 (UUID 3103f988-3576-499d-9918-fd5d007498d4), cleaning up
Deploy attempt failed on node baremetal-736944-3-35546-0 (UUID f6c99aed-2e47-4a46-814d-a0b3bc2c9021), cleaning up
Deploy attempt failed on node baremetal-736944-3-35546-2 (UUID 1830fa59-b9ff-4d00-aa48-f6924620090e), cleaning up
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/metalsmith/_provisioner.py", line 402, in provision_node
    nics.create_and_attach_ports()
  File "/usr/lib/python3.6/site-packages/metalsmith/_nics.py", line 86, in create_and_attach_ports
    port.id)
  File "/usr/lib/python3.6/site-packages/openstack/baremetal/v1/_proxy.py", line 805, in attach_vif_to_node
    res.attach_vif(self, vif_id, retry_on_conflict=retry_on_conflict)
  File "/usr/lib/python3.6/site-packages/openstack/baremetal/v1/node.py", line 607, in attach_vif
    exceptions.raise_from_response(response, error_message=msg)
  File "/usr/lib/python3.6/site-packages/openstack/exceptions.py", line 236, in raise_from_response
    http_status=http_status, request_id=request_id
openstack.exceptions.ConflictException: Failed to attach VIF 9338d18c-6703-499d-9e61-7ec21fdbd062 to bare metal node 1830fa59-b9ff-4d00-aa48-f6924620090e: Client Error for url: https://192.168.24.2:13385/v1/nodes/1830fa59-b9ff-4d00-aa48-f6924620090e/vifs, Node 1830fa59-b9ff-4d00-aa48-f6924620090e is locked by host undercloud.localdomain, please retry after the current operation is completed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 102, in <module>
  File "<stdin>", line 94, in _ansiballz_main
  File "<stdin>", line 40, in invoke_module
  File "/usr/lib64/python3.6/runpy.py", line 205, in run_module
    return _run_module_code(code, init_globals, run_name, mod_spec)
  File "/usr/lib64/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/ansible_metalsmith_instances_payload__j2el0cr/ansible_metalsmith_instances_payload.zip/ansible/modules/metalsmith_instances.py", line 363, in <module>
  File "/tmp/ansible_metalsmith_instances_payload__j2el0cr/ansible_metalsmith_instances_payload.zip/ansible/modules/metalsmith_instances.py", line 333, in main
  File "/tmp/ansible_metalsmith_instances_payload__j2el0cr/ansible_metalsmith_instances_payload.zip/ansible/modules/metalsmith_instances.py", line 274, in provision
  File "/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/tmp/ansible_metalsmith_instances_payload__j2el0cr/ansible_metalsmith_instances_payload.zip/ansible/modules/metalsmith_instances.py", line 296, in _provision_instance
  File "/usr/lib/python3.6/site-packages/metalsmith/_provisioner.py", line 442, in provision_node
    self._clean_up(node, nics=nics)
  File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/usr/lib/python3.6/site-packages/metalsmith/_utils.py", line 144, in reraise_os_exc
    raise reraise_as(str(exc_info[1]))
metalsmith.exceptions.DeploymentFailed: Failed to attach VIF 9338d18c-6703-499d-9e61-7ec21fdbd062 to bare metal node 1830fa59-b9ff-4d00-aa48-f6924620090e: Client Error for url: https://192.168.24.2:13385/v1/nodes/1830fa59-b9ff-4d00-aa48-f6924620090e/vifs, Node 1830fa59-b9ff-4d00-aa48-f6924620090e is locked by host undercloud.localdomain, please retry after the current operation is completed.
", "module_stdout": "", "msg": "MODULE FAILURE
See stdout/stderr for the exact error", "rc": 1}

https://logserver.rdoproject.org/65/724365/2/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-vexxhost/e5a2763/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Revision history for this message
Harald Jensås (harald-jensas) wrote :

I think the problem here is iscsiadm discovery getting stuck. There was a bug opened against Ironic to do timeout/retries: https://storyboard.openstack.org/#!/story/2007836. BUT, in my experience a retry won't actually help.

2020-06-22 05:06:08.584 9 DEBUG oslo_concurrency.processutils [req-d51941a9-7bc4-47be-b2b7-61addb3af437 - - - - -] Running cmd (subprocess): sudo ironic-rootwrap /etc/ironic/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.30:3260 execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:371

2020-06-22 05:06:20.218 9 DEBUG oslo_concurrency.processutils [req-706507ca-d29e-4645-b59f-ca5635a3db46 - - - - -] Running cmd (subprocess): sudo ironic-rootwrap /etc/ironic/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.15:3260 execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:371

2020-06-22 05:06:22.466 9 DEBUG oslo_concurrency.processutils [req-d3d22e16-4345-43f0-9c98-9afcbcfdd4e1 - - - - -] Running cmd (subprocess): sudo ironic-rootwrap /etc/ironic/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.14:3260 execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:371

2020-06-22 05:07:53.305 9 DEBUG oslo_concurrency.processutils [req-a5c47c07-dd77-4770-941e-b846ed33f8e8 - - - - -] Running cmd (subprocess): sudo ironic-rootwrap /etc/ironic/rootwrap.conf iscsiadm -m discovery -t st -p 192.168.24.29:3260 execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:371

After these we keep seeing the """is locked by host undercloud.localdomain""" messages:
2020-06-22 05:29:11.192 9 WARNING ironic.drivers.modules.agent_base [-] Node 6b471817-d69d-477a-8636-bbae44889ee5 is currently locked, skipping heartbeat processing (will retry on the next heartbeat): ironic.common.exception.NodeLocked: Node 6b471817-d69d-477a-8636-bbae44889ee5 is locked by host undercloud.localdomain, please retry after the current operation is completed.

In log extra top and pstree output, iscsiadm is still running:
https://logserver.rdoproject.org/65/724365/2/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-vexxhost/e5a2763/logs/undercloud/var/log/extra/top.txt.gz

https://logserver.rdoproject.org/65/724365/2/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-vexxhost/e5a2763/logs/undercloud/var/log/extra/pstree.txt.gz

iscsiadm(15352+
iscsiadm(15381+
iscsiadm(15407+
iscsiadm(15632+

Revision history for this message
John Fulton (jfulton-org) wrote :
Revision history for this message
Harald Jensås (harald-jensas) wrote :

I have had this issue during for a week. I scratched my lab and re-deployed it today, and to my surprise it no longer fails. Is the problem still present in upstream CI?

Changed in tripleo:
milestone: victoria-1 → victoria-3
Changed in tripleo:
milestone: victoria-3 → wallaby-1
Changed in tripleo:
milestone: wallaby-1 → wallaby-2
Changed in tripleo:
milestone: wallaby-2 → wallaby-3
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.