tripleo

Bug #1947403
Comment #22

Comment 22 for bug 1947403

Revision history for this message

Harald Jensås (harald-jensas) wrote on 2021-10-28:

#22

Looking at the BMC log[1] the instances took very long to shutdown:

Instances power off time:
cf56643a-932d-468a-95f5-e46cacd743a0 :: 71.645662 seconds
3098f5ef-ceec-4b30-b54e-169a999444f7 :: 65.899861 seconds
d9ececd0-50c5-4910-8619-df004e7c68c5 :: 64.710398 seconds
5196ab60-6fb3-433b-a2c9-20ed85578577 :: 46.557931 seconds

62.203463 seconds/avg time to power off

If we compare that to a master job[2]:
32153366-01d4-44fe-af19-b3c8513241bc :: 68.145963 seconds
6301e44a-092c-4d90-956f-cbe6da0120da :: 120.570997 seconds
a2c99a5b-3b8f-427a-bd05-c1c6f7def36f :: 120.436390 seconds
36029b10-d981-4d02-83fc-8f5f8df8a351 :: 70.800235 seconds

94.98839625 seconds/avg time to power off

Now, another difference between train[1] and master[2] is that on master the BMC events occur in a different order.

In the train job we can see it is setting boot device to network, then comes the power off:

[ 3156.139144] openstackbmc[8149]: Set boot device to network
[ 3176.987111] openstackbmc[8192]: Powered off 5196ab60-6fb3-433b-a2c9-20ed85578577

In the master job[2] the "Set boot device to network" happens after the nodes has been powered off:

[ 2322.935851] openstackbmc[8281]: Powered off 32153366-01d4-44fe-af19-b3c8513241bc
[ 2391.081814] openstackbmc[8281]: Reporting power state "False" for instance 32153366-01d4-44fe-af19-b3c8513241bc
[ 2403.399431] openstackbmc[8281]: Set boot device to network

So, why does it work on master despite OVB nodes actually taking even longer to power off?

The workflow is different, the master job uses os_tripleo_baremetal_node_introspection which power off nodes prior to starting inspection in prepare_for_attempt[3] which has a very long timeout[4].

1200 seconds !!!!!! Thats 20 minutes!!

Another interesting observation is that the initial power off in OVB seem to be very slow, but subsequent power offs are much quicker at ~5 seconds!!

INITIAL for node 36029b10-d981-4d02-83fc-8f5f8df8a351: (2397.319949-2326.519714 = 70.800235)
[ 2326.519714] openstackbmc[8186]: Powered off 36029b10-d981-4d02-83fc-8f5f8df8a351
[ 2397.319949] openstackbmc[8186]: Reporting power state "False" for instance 36029b10-d981-4d02-83fc-8f5f8df8a351
SECOND for node 36029b10-d981-4d02-83fc-8f5f8df8a351: (2553.212485−2547.906511 = 5.305974)
[ 2547.906511] openstackbmc[8186]: Powered off 36029b10-d981-4d02-83fc-8f5f8df8a351
[ 2553.212485] openstackbmc[8186]: Reporting power state "False" for instance 36029b10-d981-4d02-83fc-8f5f8df8a351

INITIAL for node a2c99a5b-3b8f-427a-bd05-c1c6f7def36f: (2443.568209-2323.131819 = 120.43639)
[ 2323.131819] openstackbmc[8146]: Powered off a2c99a5b-3b8f-427a-bd05-c1c6f7def36f
[ 2443.568209] openstackbmc[8146]: Reporting power state "False" for instance a2c99a5b-3b8f-427a-bd05-c1c6f7def36f
SECOND for node a2c99a5b-3b8f-427a-bd05-c1c6f7def36f: (2609.103883-2603.517865 = 5.586018)
[ 2603.517865] openstackbmc[8146]: Powered off a2c99a5b-3b8f-427a-bd05-c1c6f7def36f
[ 2609.103883] openstackbmc[8146]: Reporting power state "False" for instance a2c99a5b-3b8f-427a-bd05-c1c6f7def36f

Conclusion:
- bump power_state_change_timeout to 180 seconds in train?
- Since initial/secondary poweroffs have such a big timedifference. It may be an idea to implement a change in openstack-virtual-baremetal that will power off the nodes when openstackbmc is initially started.

[1] https://logserver.rdoproject.org/56/36356/3/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train/13d5d98/logs/bmc_3_25330-console.log
[2] https://logserver.rdoproject.org/50/815650/3/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001/08e0d06/logs/bmc_3_37470-console.log
[3] https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/ansible_plugins/modules/os_tripleo_baremetal_node_introspection.py#L255-L277
[4] https://opendev.org/openstack/tripleo-ansible/src/branch/master/tripleo_ansible/ansible_plugins/modules/os_tripleo_baremetal_node_introspection.py#L60-L64

Looking at the BMC log[1] the instances took very long to shutdown:

Instances power off time:
 cf56643a-932d-468a-95f5-e46cacd743a0 :: 71.645662 seconds
 3098f5ef-ceec-4b30-b54e-169a999444f7 :: 65.899861 seconds
 d9ececd0-50c5-4910-8619-df004e7c68c5 :: 64.710398 seconds
 5196ab60-6fb3-433b-a2c9-20ed85578577 :: 46.557931 seconds

62.203463 seconds/avg time to power off

If we compare that to a master job[2]:
 32153366-01d4-44fe-af19-b3c8513241bc ::  68.145963 seconds 
 6301e44a-092c-4d90-956f-cbe6da0120da :: 120.570997 seconds
 a2c99a5b-3b8f-427a-bd05-c1c6f7def36f :: 120.436390 seconds
 36029b10-d981-4d02-83fc-8f5f8df8a351 ::  70.800235 seconds

94.98839625 seconds/avg time to power off

Now, another difference between train[1] and master[2] is that on master the BMC events occur in a different order.

In the train job we can see it is setting boot device to network, then comes the power off:

[ 3156.139144] openstackbmc[8149]: Set boot device to network
[ 3176.987111] openstackbmc[8192]: Powered off 5196ab60-6fb3-433b-a2c9-20ed85578577

In the master job[2] the "Set boot device to network" happens after the nodes has been powered off:

So, why does it work on master despite OVB nodes actually taking even longer to power off?

The workflow is different, the master job uses os_tripleo_baremetal_node_introspection which power off nodes prior to starting inspection in prepare_for_attempt[3] which has a very long timeout[4].

1200 seconds !!!!!! Thats 20 minutes!!

Another interesting observation is that the initial power off in OVB seem to be very slow, but subsequent power offs are much quicker at ~5 seconds!!