Source: https://bugzilla.redhat.com/show_bug.cgi?id=1849975
Description of problem:
All the OSP service containers go down after resizing the vCPU and Memory resources of virtual undercloud.
Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.1.0 Beta (Train)
Red Hat Enterprise Linux release 8.2 (Ootpa)
How reproducible: 100% reproduced during in scale lab environment
Steps to Reproduce:
1. Deployed OSP16.1 Beta undercloud in VM with below resources.
# virsh dumpxml osp_16_Director|grep -E "vcpu|memory"
<memory unit='KiB'>209715200</memory>
<vcpu placement='static'>54</vcpu>
2. OSP service status before resize.
# podman ps --format "{{.ID}}" | wc -l
45
3. Resized the Undercloud vCPU from 54 to 24 and after a reboot, all the containers went down.
# virsh dumpxml osp_16_Director|grep -E "vcpu|memory"
<memory unit='KiB'>67108864</memory>
<vcpu placement='static'>24</vcpu>
# podman ps --format "{{.ID}}" | wc -l
0
# systemctl list-units| grep -i failed
_ tripleo_glance_api.service loaded failed failed glance_api container
_ tripleo_haproxy.service loaded failed failed haproxy container
_ tripleo_heat_api.service loaded failed failed heat_api container
_ tripleo_heat_api_cron.service loaded failed failed heat_api_cron container
_ tripleo_heat_engine.service loaded failed failed heat_engine container
_ tripleo_ironic_api.service loaded failed failed ironic_api container
_ tripleo_ironic_conductor.service loaded failed failed ironic_conductor container
_ tripleo_ironic_inspector.service loaded failed failed ironic_inspector container
_ tripleo_ironic_inspector_dnsmasq.service loaded failed failed ironic_inspector_dnsmasq container
_ tripleo_ironic_neutron_agent.service loaded failed failed ironic_neutron_agent container
_ tripleo_ironic_pxe_http.service loaded failed failed ironic_pxe_http container
_ tripleo_ironic_pxe_tftp.service loaded failed failed ironic_pxe_tftp container
_ tripleo_iscsid.service loaded failed failed iscsid container
_ tripleo_keepalived.service loaded failed failed keepalived container
_ tripleo_keystone.service loaded failed failed keystone container
_ tripleo_logrotate_crond.service loaded failed failed logrotate_crond container
_ tripleo_memcached.service loaded failed failed memcached container
_ tripleo_mistral_api.service loaded failed failed mistral_api container
_ tripleo_mistral_engine.service loaded failed failed mistral_engine container
_ tripleo_mistral_event_engine.service loaded failed failed mistral_event_engine container
_ tripleo_mistral_executor.service loaded failed failed mistral_executor container
_ tripleo_mysql.service loaded failed failed mysql container
_ tripleo_neutron_api.service loaded failed failed neutron_api container
_ tripleo_neutron_dhcp.service loaded failed failed neutron_dhcp container
_ tripleo_neutron_l3_agent.service loaded failed failed neutron_l3_agent container
_ tripleo_neutron_ovs_agent.service loaded failed failed neutron_ovs_agent container
_ tripleo_nova_api.service loaded failed failed nova_api container
_ tripleo_nova_api_cron.service loaded failed failed nova_api_cron container
_ tripleo_nova_compute.service loaded failed failed nova_compute container
_ tripleo_nova_conductor.service loaded failed failed nova_conductor container
_ tripleo_nova_scheduler.service loaded failed failed nova_scheduler container
_ tripleo_placement_api.service loaded failed failed placement_api container
_ tripleo_rabbitmq.service loaded failed failed rabbitmq container
_ tripleo_swift_account_reaper.service loaded failed failed swift_account_reaper container
_ tripleo_swift_account_server.service loaded failed failed swift_account_server container
_ tripleo_swift_container_server.service loaded failed failed swift_container_server container
_ tripleo_swift_container_updater.service loaded failed failed swift_container_updater container
_ tripleo_swift_object_expirer.service loaded failed failed swift_object_expirer container
_ tripleo_swift_object_server.service loaded failed failed swift_object_server container
_ tripleo_swift_object_updater.service loaded failed failed swift_object_updater container
_ tripleo_swift_proxy.service loaded failed failed swift_proxy container
_ tripleo_swift_rsync.service loaded failed failed swift_rsync container
_ tripleo_zaqar.service loaded failed failed zaqar container
_ tripleo_zaqar_websocket.service loaded failed failed zaqar_websocket container
_ tripleo_heat_api_healthcheck.timer loaded failed failed heat_api container healthcheck
_ tripleo_mistral_executor_healthcheck.timer loaded failed failed mistral_executor container healthcheck
_ tripleo_nova_conductor_healthcheck.timer loaded failed failed nova_conductor container healthcheck
_ tripleo_swift_proxy_healthcheck.timer loaded failed failed swift_proxy container healthcheck
4. It looks like all the containers are pinned with entire cores available in the VM
http://paste.openstack.org/show/795089/
5. After restoring the CPU count with 54, then all containers came UP and running as expected after reboot
It looks like vCPUs mapping in "CpusetCpus" container properties is a limitation of virtual Undercloud if VM resources get resized in runtime.
We need a solution to keep container UP and running if the undercloud CPU count ger resized.
Related fix proposed to branch: stable/ussuri /review. opendev. org/737527
Review: https:/