CS8 Wallaby - ovb-3ctlr_1comp-featureset035 and featureset001 failing on node provision with: "conductor take over"

Bug #1970484 reported by Douglas Viroel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-wallaby is failing for a while now[1], but the latest results are showing always the same issue, in node provision step[2][3]:

90f-8a83-bf9bab91e419) to node baremetal-39960-29-50167-1 (UUID 78616239-f41d-474e-9877-4a0f3adaaaac)\nProvisioning started on node baremetal-39960-29-50167-0 (UUID 917f5589-d7c8-408e-8e39-6d90d4d6d6db)\nProvisioning started on node baremetal-39960-29-50167-1 (UUID 78616239-f41d-474e-9877-4a0f3adaaaac)\nProvisioning started on node baremetal-39960-29-50167-3 (UUID 96671d3b-4467-45a2-bf04-99f128d928a0)\n", "msg": "Node 78616239-f41d-474e-9877-4a0f3adaaaac reached failure state \"deploy failed\"; the last error is Operation was aborted due to conductor take over"}
2022-04-26 12:09:19.270488 | fa163e51-ef20-2adc-b71f-00000000001a | FATAL | Provision instances | localhost | error={"changed": false, "logging": "Created port overcloud-controller-0-ctlplane (UUID f28a6197-2ce2-4edb-87d3-b28c9e81e802) for node baremetal-39960-29-50167-2 (UUID 0356b750-b99a-4597-a296-c9ab37e5b72e) with {'network_id': '48407598-91cf-4eec-9ef8-0d59dea95df3', 'name': 'overcloud-controller-0-ctlplane'}\nCreated port overcloud-controller-2-ctlplane (UUID 28294131-ccaa-46ed-9f5b-067a526ae9de) for node baremetal-39960-29-50167-3 (UUID 96671d3b-4467-45a2-bf04-99f128d928a0) with {'network_id': '48407598-91cf-4eec-9ef8-0d59dea95df3', 'name': 'overcloud-controller-2-ctlplane'}\nCreated port overcloud-controller-1-ctlplane (UUID 65be5d00-6f12-40db-8b63-ca9404ccfd03) for node baremetal-39960-29-50167-0 (UUID 917f5589-d7c8-408e-8e39-6d90d4d6d6db) with {'network_id': '48407598-91cf-4eec-9ef8-0d59dea95df3', 'name': 'overcloud-controller-1-ctlplane'}\nCreated port overcloud-novacompute-0-ctlplane (UUID c071437e-5a58-490f-8a83-bf9bab91e419) for node baremetal-39960-29-50167-1 (UUID 78616239-f41d-474e-9877-4a0f3adaaaac) with {'network_id': '48407598-91cf-4eec-9ef8-0d59dea95df3', 'name': 'overcloud-novacompute-0-ctlplane'}\nAttached port overcloud-controller-0-ctlplane (UUID f28a6197-2ce2-4edb-87d3-b28c9e81e802) to node baremetal-39960-29-50167-2 (UUID 0356b750-b99a-4597-a296-c9ab37e5b72e)\nAttached port overcloud-controller-1-ctlplane (UUID 65be5d00-6f12-40db-8b63-ca9404ccfd03) to node baremetal-39960-29-50167-0 (UUID 917f5589-d7c8-408e-8e39-6d90d4d6d6db)\nAttached port overcloud-controller-2-ctlplane (UUID 28294131-ccaa-46ed-9f5b-067a526ae9de) to node baremetal-39960-29-50167-3 (UUID 96671d3b-4467-45a2-bf04-99f128d928a0)\nProvisioning started on node baremetal-39960-29-50167-2 (UUID 0356b750-b99a-4597-a296-c9ab37e5b72e)\nAttached port overcloud-novacompute-0-ctlplane (UUID c071437e-5a58-490f-8a83-bf9bab91e419) to node baremetal-39960-29-50167-1 (UUID 78616239-f41d-474e-9877-4a0f3adaaaac)\nProvisioning started on node baremetal-39960-29-50167-0 (UUID 917f5589-d7c8-408e-8e39-6d90d4d6d6db)\nProvisioning started on node baremetal-39960-29-50167-1 (UUID 78616239-f41d-474e-9877-4a0f3adaaaac)\nProvisioning started on node baremetal-39960-29-50167-3 (UUID 96671d3b-4467-45a2-bf04-99f128d928a0)\n", "msg": "Node 78616239-f41d-474e-9877-4a0f3adaaaac reached failure state \"deploy failed\"; the last error is Operation was aborted due to conductor take over"}

The full error from ironic-conductor is the following[4]:
ERROR ironic.conductor.task_manager [req-d28f7173-3be8-43bc-a610-1064171b7abe - - - - -] Node 78616239-f41d-474e-9877-4a0f3adaaaac moved to provision state "deploy failed" from state "deploying"; target provision state is "active"
WARNING ironic.conductor.utils [req-d28f7173-3be8-43bc-a610-1064171b7abe - - - - -] Aborted the current operation on node 78616239-f41d-474e-9877-4a0f3adaaaac due to conductor take over

[1] https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-wallaby
[2] https://logserver.rdoproject.org/60/39960/29/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-wallaby/5a93650/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz
[3]https://logserver.rdoproject.org/openstack-periodic-integration-stable1-cs8/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-wallaby/10dbf10/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz
[4] https://logserver.rdoproject.org/60/39960/29/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-wallaby/5a93650/logs/undercloud/var/log/containers/ironic/ironic-conductor.log.txt.gz

Revision history for this message
Ronelle Landy (rlandy) wrote :

https://logserver.rdoproject.org/15/15761b77d91ab3e398f6fa1d10d2f5da267b7931/openstack-periodic-integration-stable1-cs8/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/2f5dc02/logs/undercloud/home/zuul/overcloud_node_provision.log.txt.gz

This is still happening on wallaby c8:

be8241777191)\nProvisioning started on node baremetal-26955-0 (UUID 9187de8c-59e1-46d1-8fac-d6cb28fca0b4)\nProvisioning started on node baremetal-26955-3 (UUID ae5c76e4-f754-43f1-a33e-0cd0bb4382e7)\n", "msg": "Node ae5c76e4-f754-43f1-a33e-0cd0bb4382e7 reached failure state \"deploy failed\"; the last error is Operation was aborted due to conductor take over"}

summary: - CS8 Wallaby - ovb-3ctlr_1comp-featureset035 failing on node provision
- with: "conductor take over"
+ CS8 Wallaby - ovb-3ctlr_1comp-featureset035 and featureset001 failing
+ on node provision with: "conductor take over"
Revision history for this message
Harald Jensås (harald-jensas) wrote :

Ironic did not recive heartbeats from the conductor in time, and as a result of that the conductor was considered offline. By default - conductor.heartbeat_timeout = 60, and conductor.heartbeat_interval = 10. This heartbeat is simply writing a updated_at timestamp in the database.

Looking at dstat, I can see load peaking around the time the issue occured.
https://logserver.rdoproject.org/15/15761b77d91ab3e398f6fa1d10d2f5da267b7931/openstack-periodic-integration-stable1-cs8/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001-wallaby/2f5dc02/logs/undercloud/var/log/extra/dstat.html.gz#

Ideally Ironic should exclude the current conductor when checking for offline conductors. It does not, and in a situation where it was not able to write to the DB for a long period this can occur.

Initially, can we try to bump the conductor.heartbeat_timeout and conductor.heartbeat_interval settings in CI? Maby use 20/120? i.e 2x default values?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/839526
Committed: https://opendev.org/openstack/tripleo-quickstart-extras/commit/adb8f8a12fa8d782f1cee7851e1a05382fc15a69
Submitter: "Zuul (22348)"
Branch: master

commit adb8f8a12fa8d782f1cee7851e1a05382fc15a69
Author: Douglas Viroel <email address hidden>
Date: Wed Apr 27 10:19:13 2022 -0300

    Increase ironic conductor heartbeat times

    This patch increases ironic conductor heartbeat times
    to avoid conductor being considered offline, on high load
    systems. This patch is a workaround to avoid node provision
    failures on OVB jobs.

    Related-Bug: #1970484

    Change-Id: I9d1e8d0d6b50c0a5524bba8588c06af573cff780

Revision history for this message
Marios Andreou (marios-b) wrote :

bug is no longer blocking us with the time bumps https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/839526

there may still be some more investigation to see if something further is needed (i.e. root cause why the bump was required now) but moving to fix released for now please move back if you are investigating here

Changed in tripleo:
status: Triaged → Fix Released
status: Fix Released → Won't Fix
status: Won't Fix → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.