Node provisioning fails first, then works on second attempt

Bug #1948760 reported by Cédric Jeanneret
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Invalid
High
Unassigned

Bug Description

Using tripleo-lab[1], deploying master, I have the following thing:

2021-10-25 15:21:39.565924 | 244201df-9f46-ff13-7f36-000000000018 | FATAL | Provision instances | localhost | error={"changed": false, "logging": "Created port oc0-controller-0-ctlplane (UUID e5345e08-3cf3-428d-9d95-f94763f52c29) for node oc0-controller-0 (UUID a0a95dde-1f4c-4eb6-b980-48651612e1e0) with {'network_id': '912e3738-9bda-4575-9f68-d9105be6d69e', 'name': 'oc0-controller-0-ctlplane'}\nCreated port oc0-compute-0-ctlplane (UUID 29aeb588-fd14-481d-b887-0b69013b6dd3) for node oc0-compute-0 (UUID 1650b2da-13ea-465f-a71e-774c8d69068e) with {'network_id': '912e3738-9bda-4575-9f68-d9105be6d69e', 'name': 'oc0-compute-0-ctlplane'}\nAttached port oc0-controller-0-ctlplane (UUID e5345e08-3cf3-428d-9d95-f94763f52c29) to node oc0-controller-0 (UUID a0a95dde-1f4c-4eb6-b980-48651612e1e0)\nAttached port oc0-compute-0-ctlplane (UUID 29aeb588-fd14-481d-b887-0b69013b6dd3) to node oc0-compute-0 (UUID 1650b2da-13ea-465f-a71e-774c8d69068e)\nProvisioning started on node oc0-controller-0 (UUID a0a95dde-1f4c-4eb6-b980-48651612e1e0)\nProvisioning started on node oc0-compute-0 (UUID 1650b2da-13ea-465f-a71e-774c8d69068e)\n", "msg": "Node 1650b2da-13ea-465f-a71e-774c8d69068e reached failure state \"deploy failed\"; the last error is Deploy step deploy.deploy failed: IPMI call failed: power on.."}

It's failing every time on the first run. If I connect to the UC, run `openstack overcloud node unprovision --all --yes --stack overcloud-0 ~/metalsmith-0.yaml' and re-run the node provision script, it works without any issue.

It smells some kind of race condition (see ironic logs bellow)

When looking in ironic logs (specifically in ironic-conductor.log), I can see IPMI related errors:

2021-10-25 15:20:28.763 8 DEBUG oslo_concurrency.processutils [req-1937e072-5fdd-4a77-83c9-b8f6e412841f admin - - - -] CMD "ipmitool -I lanplus -H 192.168.122.1 -L ADMINISTRATOR -p 16784 -U ADMIN -R 1 -N 5 -f /tmp/tmpkj7z92mp power status" returned: 0 in 0.040s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423
2021-10-25 15:20:28.764 8 DEBUG ironic.common.utils [req-1937e072-5fdd-4a77-83c9-b8f6e412841f admin - - - -] Execution completed, command line is "ipmitool -I lanplus -H 192.168.122.1 -L ADMINISTRATOR -p 16784 -U ADMIN -R 1 -N 5 -f /tmp/tmpkj7z92mp power status" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:87
2021-10-25 15:20:28.765 8 DEBUG ironic.common.utils [req-1937e072-5fdd-4a77-83c9-b8f6e412841f admin - - - -] Command stdout is: "Chassis Power is off
" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:88
2021-10-25 15:20:28.765 8 DEBUG ironic.common.utils [req-1937e072-5fdd-4a77-83c9-b8f6e412841f admin - - - -] Command stderr is: "" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:89
2021-10-25 15:20:28.797 8 INFO ironic.conductor.deployments [req-0251ce4b-48d6-4721-a64d-c4dffa25af72 admin - - - -] Deploying on node a0a95dde-1f4c-4eb6-b980-48651612e1e0, remaining steps: [{'step': 'deploy', 'priority': 100, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'write_image', 'priority': 80, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'prepare_instance_boot', 'priority': 60, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'tear_down_agent', 'priority': 40, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'switch_to_tenant_network', 'priority': 30, 'argsinfo': None, 'interface': 'deploy'}, {'step': 'boot_instance', 'priority': 20, 'argsinfo': None, 'interface': 'deploy'}]
2021-10-25 15:20:28.815 8 INFO ironic.conductor.deployments [req-0251ce4b-48d6-4721-a64d-c4dffa25af72 admin - - - -] Executing {'step': 'deploy', 'priority': 100, 'argsinfo': None, 'interface': 'deploy'} on node a0a95dde-1f4c-4eb6-b980-48651612e1e0
2021-10-25 15:20:33.772 8 DEBUG oslo_concurrency.processutils [req-0251ce4b-48d6-4721-a64d-c4dffa25af72 admin - - - -] Running cmd (subprocess): ipmitool -I lanplus -H 192.168.122.1 -L ADMINISTRATOR -p 15566 -U ADMIN -R 1 -N 5 -f /tmp/tmp1r7rk8cd power status execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:384
2021-10-25 15:20:33.787 8 DEBUG oslo_concurrency.processutils [req-1937e072-5fdd-4a77-83c9-b8f6e412841f admin - - - -] Running cmd (subprocess): ipmitool -I lanplus -H 192.168.122.1 -L ADMINISTRATOR -p 16784 -U ADMIN -R 1 -N 5 -f /tmp/tmphqntdu5e power on execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:384
2021-10-25 15:20:33.817 8 DEBUG oslo_concurrency.processutils [req-0251ce4b-48d6-4721-a64d-c4dffa25af72 admin - - - -] CMD "ipmitool -I lanplus -H 192.168.122.1 -L ADMINISTRATOR -p 15566 -U ADMIN -R 1 -N 5 -f /tmp/tmp1r7rk8cd power status" returned: 0 in 0.045s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423
2021-10-25 15:20:33.818 8 DEBUG ironic.common.utils [req-0251ce4b-48d6-4721-a64d-c4dffa25af72 admin - - - -] Execution completed, command line is "ipmitool -I lanplus -H 192.168.122.1 -L ADMINISTRATOR -p 15566 -U ADMIN -R 1 -N 5 -f /tmp/tmp1r7rk8cd power status" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:87
2021-10-25 15:20:33.818 8 DEBUG ironic.common.utils [req-0251ce4b-48d6-4721-a64d-c4dffa25af72 admin - - - -] Command stdout is: "Chassis Power is off
" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:88
2021-10-25 15:20:33.819 8 DEBUG ironic.common.utils [req-0251ce4b-48d6-4721-a64d-c4dffa25af72 admin - - - -] Command stderr is: "" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:89
2021-10-25 15:20:34.292 8 DEBUG oslo_concurrency.processutils [req-1937e072-5fdd-4a77-83c9-b8f6e412841f admin - - - -] CMD "ipmitool -I lanplus -H 192.168.122.1 -L ADMINISTRATOR -p 16784 -U ADMIN -R 1 -N 5 -f /tmp/tmphqntdu5e power on" returned: 1 in 0.506s execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:423
2021-10-25 15:20:34.292 8 DEBUG oslo_concurrency.processutils [req-1937e072-5fdd-4a77-83c9-b8f6e412841f admin - - - -] 'ipmitool -I lanplus -H 192.168.122.1 -L ADMINISTRATOR -p 16784 -U ADMIN -R 1 -N 5 -f /tmp/tmphqntdu5e power on' failed. Not Retrying. execute /usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py:474
2021-10-25 15:20:34.293 8 WARNING ironic.drivers.modules.ipmitool [req-1937e072-5fdd-4a77-83c9-b8f6e412841f admin - - - -] IPMI Error encountered, retrying "ipmitool -I lanplus -H 192.168.122.1 -L ADMINISTRATOR -p 16784 -U ADMIN -R 1 -N 5 -f /tmp/tmphqntdu5e power on" for node 1650b2da-13ea-465f-a71e-774c8d69068e. Error: Unexpected error while running command.
Command: ipmitool -I lanplus -H 192.168.122.1 -L ADMINISTRATOR -p 16784 -U ADMIN -R 1 -N 5 -f /tmp/tmphqntdu5e power on
Exit code: 1
Stdout: ''
Stderr: 'Set Chassis Power Control to Up/On failed: Node busy\n': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command.

As we can see, ironic is able to detect the node is off (Chassis Power is off), but is then unable to actually start it.

When I clean up things, and re-run the provisioning, it just works. No more issue.

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Side note: my env is deploying using tls-e (with freeIPA and all). Maybe it has an impact, it appears things are working as expected without this option...

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Setting "use_ipmitool_retries=True" in ironic.conf doesn't help. It retries, but fails anyway.

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

After some more digging, the issue was due to "some" RAM over-provisioning on the hypervisor - vbmc was sending error when trying to start the VM.

Reducing the memory footprint of the OC nodes leads to a plain success.

Good to know!

Changed in tripleo:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.