OVB overcloud deploy fails sporadically with "not enough free physical ports" error

Bug #1879472 reported by Sagi (Sergey) Shnaidman
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

There are cases when OVB overcloud deploy fails on vexxhost with error like:

metalsmith.exceptions.DeploymentFailed: Failed to attach VIF bc2010f0-3aa1-4939-9cbd-d8917cf8bc08 to bare metal node 9224cbbf-eb10-4895-a09b-92f06550f01b: Client Error for url: https://192.168.24.2:13385/v1/nodes/9224cbbf-eb10-4895-a09b-92f06550f01b/vifs, Unable to attach VIF bc2010f0-3aa1-4939-9cbd-d8917cf8bc08, not enough free physical ports.

https://logserver.rdoproject.org/88/726488/2/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-vexxhost/0a25363/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

It happens on vexxhost cloud only, about 10 times a day. Although we run promotions on vexxhost cloud only and it affect them.

Revision history for this message
Javier Peña (jpena-c) wrote :

I have been following the life cycle of one of the ports mentioned in the log (ec3ddedf-3255-4f2f-ae8e-dd5b407322a7) across all different logs: http://paste.openstack.org/show/793977/

If I understood it correctly, the error is received by the undercloud log, so it is sent by the undercloud neutron server. This discards any issue with available ports in the Vexxhost tenant, but we still need to find out why it is only happening on Vexxhost.

We need someone from the Ironic team to have a look at the logs (and potentially check the OVB configuration at Vexxhost) to move forward.

wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Revision history for this message
Alan Pevec (apevec) wrote :

Vexxhost ticket for tracking this on cloudops side is PBK-646217

Revision history for this message
Alan Pevec (apevec) wrote :

I had traced this in the code, it ends up in Ironic driver function _get_free_portgroups_and_ports
https://opendev.org/openstack/ironic/src/branch/master/ironic/drivers/modules/network/common.py#L86
where it masked why it could not get a port.

Revision history for this message
Javier Peña (jpena-c) wrote :
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
summary: - OVB overcloud deploy on vexxhost fails sporadically with "not enough
- free physical ports" error
+ OVB overcloud deploy fails sporadically with "not enough free physical
+ ports" error
Revision history for this message
Bob Fournier (bfournie) wrote :

Looks like we're getting these errors with "qemu-img convert" in ironic-conductor.log resulting in deployment failures:

Stderr: 'qemu: qemu_thread_create: Resource temporarily unavailable\n': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.
Command: /usr/bin/python3 -m oslo_concurrency.prlimit --as=1073741824 -- qemu-img convert -O raw /var/lib/ironic/master_images/tmp5mezc69f/0d60c774-0f0e-4bd8-9220-412fcd5e7d7b.part /var/lib/ironic/master_images/tmp5mezc69f/0d60c774-0f0e-4bd8-9220-412fcd5e7d7b.converted
Exit code: -6
Stdout: ''
Stderr: 'qemu: qemu_thread_create: Resource temporarily unavailable\n'
2020-05-27 10:56:14.048 7 ERROR ironic.conductor.utils Traceback (most recent call last):

2020-05-27 10:56:14.276 7 ERROR ironic.conductor.task_manager [req-1b0a24f9-466e-45c8-8f81-a85849e2e28b 7d81b8a640f043f7b786b5739fe1e0db 4139c3b2701f4140b6749c792800b423 - default default] Node 4c7404d3-6cb2-456a-a1bb-8c3ea1e8b549 moved to provision state "deploy failed" from state "deploying"; target provision state is "active": oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.

https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-master/7fe2a05/logs/undercloud/var/log/containers/ironic/ironic-conductor.log.txt.gz

Revision history for this message
Bob Fournier (bfournie) wrote :

Dmitry has an upstream patch which may help, although currently failing a CI test - https://review.opendev.org/#/c/727677/

Revision history for this message
Bob Fournier (bfournie) wrote :

Note that the patch has merged to Train (https://review.opendev.org/#/c/727678/) and master but not to Ussuri.

Revision history for this message
Rafael Folco (rafaelfolco) wrote :

updates on "not enough free physical ports" issue:

- total of 97 hits since 06-03-2020
- 75% in master, 25% in ussuri
- 70% on vexxhost, 30% on rdocloud

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Revision history for this message
wes hayutin (weshayutin) wrote :

@Sagi ironic-lib needs a new ussuri release.. the fix is not in the build yet

Revision history for this message
wes hayutin (weshayutin) wrote :

https://trunk.rdoproject.org/centos8-ussuri/component/baremetal/consistent/
has python-ironic-lib-4.2.0-0.20200427095718.f6cdde2.el8.src.rpm

which does not have dimitry's fix
https://opendev.org/openstack/ironic-lib/commits/branch/stable/ussuri

Revision history for this message
Ronelle Landy (rlandy) wrote :

https://trunk.rdoproject.org/centos8-ussuri/component/baremetal/consistent/ now has:
python-ironic-lib-4.2.1-0.20200618094945.3a01335.el8.src.rpm 2020-06-18

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Revision history for this message
Bob Fournier (bfournie) wrote :

We would need to change image_convert_memory_limit in ironic.conf to 2048 (default is 1024). However this isn't currently exposed in THT.

Revision history for this message
Bob Fournier (bfournie) wrote :

I'm not sure those latest failures are the same problem with quemu-img however, as those command seem to complete fine in ironic-conductor.log.

2020-07-07 03:28:54.738 8 DEBUG ironic_lib.utils [req-a3a9eb4a-a789-424e-8b3d-d1c381dfa2a1 12fb25726d1f41d29954a74312965d21 3ad930a5b97c44e0bba810ee50c9c3f9 - default default] Execution completed, command line is "qemu-img convert -O raw /var/lib/ironic/master_images/tmpolqm10fk/aad967db-429b-43c2-ab33-f10ff8bf8129.part /var/lib/ironic/master_images/tmpolqm10fk/aad967db-429b-43c2-ab33-f10ff8bf8129.converted" execute /usr/lib/python3.6/site-packages/ironic_lib/utils.py:101
2020-07-07 03:28:54.738 8 DEBUG ironic_lib.utils [req-a3a9eb4a-a789-424e-8b3d-d1c381dfa2a1 12fb25726d1f41d29954a74312965d21 3ad930a5b97c44e0bba810ee50c9c3f9 - default default] Command stdout is: "" execute /usr/lib/python3.6/site-packages/ironic_lib/utils.py:103

No longer seeing qemu: qemu_thread_create: Resource temporarily unavailable, so the previous comment to image_convert_memory_limit is probably not relevant.

Revision history for this message
Bob Fournier (bfournie) wrote :

Including Steve Baker to take a look as this seems to be related to metalsmith provisioning:
2020-07-07 03:27:26 | + openstack overcloud node provision --stack overcloud --output /home/zuul/overcloud-baremetal-deployed.yaml /home/zuul/overcloud_baremetal_deploy.yaml

2020-07-07 04:00:20 | pt failed on node baremetal-739135-5-20561-0 (UUID a1324cd4-8bb6-4b10-8cab-91c1b4132409), cleaning up\nDeploy attempt failed on node baremetal-739135-5-20561-3 (UUID 94b5e54b-8122-45d8-87e1-4e5866a98cae), cleaning up\nTraceback (most recent call last):\n File \"/usr/lib/python3.6/site-packages/metalsmith/_provisioner.py\", line 402, in provision_node\n nics.create_and_attach_ports()\n File \"/usr/lib/python3.6/site-packages/metalsmith/_nics.py\", line 86,

Again, there are no qemu errors so the problem identified in Comment #6 has been resolved.

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/739970

Revision history for this message
Harald Jensås (harald-jensas) wrote :
Download full text (6.7 KiB)

Error seen with the attached fix:
https://logserver.rdoproject.org/70/739970/1/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-vexxhost/7610b87/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

Time: 2020-07-09 22:39:48
 Unable to attach VIF b96be173-94c1-4153-8e72-ddb1b5fda1fa, not enough free physical ports

From: undercloud/var/log/containers/ironic/app.log.1.gz:

2020-07-09 22:39:45.558 17 DEBUG ironic.api.expose [req-66c07696-27b4-453b-8aa9-64f81fe53287 fbd025eb8ae9484ab270e384d842699e 0d0744fff36c4ebba0b7f58b04d8fc57 - default default] Client-side error: Unable to attach VIF eb6aa628-1380-43e3-bb4a-77376ec80d5d, not enough free physical ports.
Traceback (most recent call last):

  File "/usr/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 241, in inner
    return func(*args, **kwargs)

  File "/usr/lib/python3.6/site-packages/ironic/conductor/manager.py", line 3188, in vif_attach
    task.driver.network.vif_attach(task, vif_info)

  File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/network/common.py", line 596, in vif_attach
    task, vif_id, physnets, vif_info)

  File "/usr/lib/python3.6/site-packages/ironic/drivers/modules/network/common.py", line 190, in get_free_port_like_object
    raise exception.NoFreePhysicalPorts(vif=vif_id)

2020-07-09 22:39:42.838 7 DEBUG ironic.conductor.manager [req-66c07696-27b4-453b-8aa9-64f81fe53287 fbd025eb8ae9484ab270e384d842699e 0d0744fff36c4ebba0b7f58b04d8fc57 - default default] RPC vif_attach called for the node fcff55bb-3772-473c-bf3b-567d094d07db with vif_info {'id': 'eb6aa628-1380-43e3-bb4a-77376ec80d5d'}
vif_attach /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3184

Afaict the node 'fcff55bb-3772-473c-bf3b-567d094d07db' was already provisioned earlier.

2020-07-09 22:25:10.128 7 DEBUG ironic.conductor.manager [req-a4276eb0-e7a0-4b5b-881f-50a511ec6ba2 fbd025eb8ae9484ab270e384d842699e 0d0744fff36c4ebba0b7f58b04d8fc57 - default default] RPC vif_attach called for the node 61ecfc9b-97ba-4348-9fd8-8cc976b9abe2 with vif_info {'id': 'e7c6e07d-9bc0-4955-88fb-a8dd44bb20a0'} vif_attach /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3184

2020-07-09 22:25:27.413 7 DEBUG ironic.conductor.task_manager [req-3b50002d-8f35-43de-9f37-b7bbd71b28e9 fbd025eb8ae9484ab270e384d842699e 0d0744fff36c4ebba0b7f58b04d8fc57 - default default] Attempting to get exclusive lock on node 61ecfc9b-97ba-4348-9fd8-8cc976b9abe2 (for node deployment) __init__ /usr/lib/python3.6/
site-packages/ironic/conductor/task_manager.py:222
2020-07-09 22:25:27.453 7 DEBUG ironic.conductor.task_manager [req-3b50002d-8f35-43de-9f37-b7bbd71b28e9 fbd025eb8ae9484ab270e384d842699e 0d0744fff36c4ebba0b7f58b04d8fc57 - default default] Node 61ecfc9b-97ba-4348-9fd8-8cc976b9abe2 successfully reserved for node deployment (took 0.04 seconds) reserve_node /usr/lib/py
thon3.6/site-packages/ironic/conductor/task_manager.py:277
2020-07-09 22:25:30.808 7 DEBUG ironic.drivers.modules.agent_base [req-3b50002d-8f35-43de-9f37-b7bbd71b28e9 fbd025eb8ae9484ab270e384d842699e 0d0744fff36c4ebba0b7f58b04d8fc57 - default default] Deploy steps are not yet available for node 61ecfc9b-97ba-4348-9f...

Read more...

Revision history for this message
Harald Jensås (harald-jensas) wrote :

Follow up in job: https://logserver.rdoproject.org/70/739970/1/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-vexxhost/7610b87/

6 allocations are made:
2020-07-09 22:24:56.209 7 DEBUG ironic.conductor.manager [req-391228b0-d03d-49d3-be58-e18c14766fd8 fbd025eb8ae9484ab270e384d842699e 0d0744fff36c4ebba0b7f58b04d8fc57 - default default] RPC create_allocation called for allocation 03243c82-4805-43e1-8645-5879e2dc283b. create_allocation /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3403
2020-07-09 22:24:58.470 7 DEBUG ironic.conductor.manager [req-d2eba6e3-0048-425d-a08d-ce30f457d85e fbd025eb8ae9484ab270e384d842699e 0d0744fff36c4ebba0b7f58b04d8fc57 - default default] RPC create_allocation called for allocation 5b255423-8a17-402d-8947-f61150d0f110. create_allocation /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3403
2020-07-09 22:25:00.735 7 DEBUG ironic.conductor.manager [req-be562696-c55c-48b0-a97b-a0aa0bba07be fbd025eb8ae9484ab270e384d842699e 0d0744fff36c4ebba0b7f58b04d8fc57 - default default] RPC create_allocation called for allocation ad5ec9e0-88ef-44d2-a543-b770c9e2c287. create_allocation /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3403
2020-07-09 22:25:02.944 7 DEBUG ironic.conductor.manager [req-e09a0e62-1c22-4e39-9a21-c2435329b958 fbd025eb8ae9484ab270e384d842699e 0d0744fff36c4ebba0b7f58b04d8fc57 - default default] RPC create_allocation called for allocation 08597023-9450-4fdd-887c-588fb31ad19e. create_allocation /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3403

2020-07-09 22:39:28.793 7 DEBUG ironic.conductor.manager [req-eff482a5-5ba5-4913-a0ae-58c5642c547e fbd025eb8ae9484ab270e384d842699e 0d0744fff36c4ebba0b7f58b04d8fc57 - default default] RPC create_allocation called for allocation 39cd3561-a99a-485c-bf48-6e5ca74c50ed. create_allocation /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3403
2020-07-09 22:39:41.341 7 DEBUG ironic.conductor.manager [req-5324118e-b86f-47f0-ba2e-ce87c087cdbb fbd025eb8ae9484ab270e384d842699e 0d0744fff36c4ebba0b7f58b04d8fc57 - default default] RPC create_allocation called for allocation dd8516fe-8474-4d4e-89db-8060fe99752a. create_allocation /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:3403

The job is supposed to deploy 4 nodes. The first 4 allocation deploys successfully.
Then the 5th and 6th allocation fail.

Something triggering retries despite successful node deployments?

Revision history for this message
Harald Jensås (harald-jensas) wrote :

afict the retries here: https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/playbooks/cli-overcloud-node-provision.yaml#L117-L128

 ... is happening:

2020-07-09 22:39:09 | 2020-07-09 22:39:09.510322 | fa163efe-f656-c4ad-c2a0-000000000016 | RETRY | Provision instances | localhost | 3 retries left
2020-07-09 22:39:25 | 2020-07-09 22:39:25.168629 | fa163efe-f656-c4ad-c2a0-000000000016 | RETRY | Provision instances | localhost | 2 retries left
2020-07-09 22:39:37 | 2020-07-09 22:39:37.054855 | fa163efe-f656-c4ad-c2a0-000000000016 | RETRY | Provision instances | localhost | 1 retries left

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/740600

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

Its not clear what caused the original provisioning failure, but the retries are cause by this change[1] which I've proposed a revert for[2].

If we get a reproducer on the revert it might show the root cause.

This issue also shows a gap in our logging strategy. There is useful python logging throughout metalsmith[3] but there is no obvious mechanism for writing this logging to a file.

If there was a pattern for a module having a logging file path option then the module could set up python logging to that file. This would be useful for many modules which internally make many API calls.

[1] https://review.opendev.org/#/c/714687/
[2] https://review.opendev.org/#/c/740600/
[3] https://opendev.org/openstack/metalsmith/src/branch/master/metalsmith/_provisioner.py

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/740606

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/740600
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=ddde8e573e626e6a0dd9ed77c6ffd97f86c6a924
Submitter: Zuul
Branch: master

commit ddde8e573e626e6a0dd9ed77c6ffd97f86c6a924
Author: Steve Baker <email address hidden>
Date: Sun Jul 12 21:49:04 2020 +0000

    Revert "Make the metalsmith module interactions more resilient"

    This reverts commit 476d0e3bef4ce15d586a7e8fa21e1e0643f87f4c.

    It looks like this change is masking the root cause of bug #1879472, it
    is probably not appropriate to do a retry on a whole
    metalsmith_instances call, it would be better to do retries on
    individual API calls and handle failures of instance provisioning on
    an instance-by-instance basis inside the module.

    Change-Id: If6814de77c110b622bd4e5b6430d65fce7dd19ff
    Related-Bug: #1879472

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/739970
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=d7de30471577c479563d83c4fdb937d72dae8a19
Submitter: Zuul
Branch: master

commit d7de30471577c479563d83c4fdb937d72dae8a19
Author: Harald Jensås <email address hidden>
Date: Wed Jul 8 14:31:16 2020 +0200

    Set image_convert_memory_limit to 2048 in CI UC

    Try to bump the memory limit to see if it solved the errors
    related to image convert.

    qemu-img: error while reading sector 2365184: Input/output error

    Change-Id: I70bf007a434a1ca3a13278e69bbd54c06d29d490
    Related-Bug: 1879472

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-ansible (master)

Reviewed: https://review.opendev.org/740606
Committed: https://git.openstack.org/cgit/openstack/tripleo-ansible/commit/?id=3ba91a22a4167790b9c723d21fdf5f5d94dd6443
Submitter: Zuul
Branch: master

commit 3ba91a22a4167790b9c723d21fdf5f5d94dd6443
Author: Steve Baker <email address hidden>
Date: Mon Jul 13 13:01:26 2020 +1200

    Capture metalsmith python logging

    The metalsmith_instances module calls into metalsmith and there is
    currently no logging mechanism to see what is happening. This change
    is an attempt to improve this by configuring logging and adding the
    output to the result dict as 'logging'.

    Change-Id: I3da2e72383787f96bf4d930e416b8fb5c0a4ff72
    Related-Bug: #1879472

Changed in tripleo:
milestone: victoria-1 → victoria-3
Changed in tripleo:
milestone: victoria-3 → wallaby-1
Changed in tripleo:
milestone: wallaby-1 → wallaby-2
Changed in tripleo:
milestone: wallaby-2 → wallaby-3
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart-extras (master)

Change abandoned by "Julia Kreger <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/769201

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.