tripleo

cpus allowed list should not be set when there is no isolcpus in cmdline

Bug #1884765 reported by Emilien Macchi on 2020-06-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	High	Emilien Macchi	tripleo victoria-1 "tripleo victoria"

Bug Description

Source: https://bugzilla.redhat.com/show_bug.cgi?id=1849975

Description of problem:
All the OSP service containers go down after resizing the vCPU and Memory resources of virtual undercloud.

Version-Release number of selected component (if applicable):
Red Hat OpenStack Platform release 16.1.0 Beta (Train)
Red Hat Enterprise Linux release 8.2 (Ootpa)

How reproducible: 100% reproduced during in scale lab environment

Steps to Reproduce:
1. Deployed OSP16.1 Beta undercloud in VM with below resources.

# virsh dumpxml osp_16_Director|grep -E "vcpu|memory"
<memory unit='KiB'>209715200</memory>
<vcpu placement='static'>54</vcpu>

2. OSP service status before resize.
# podman ps --format "{{.ID}}" | wc -l
45

3. Resized the Undercloud vCPU from 54 to 24 and after a reboot, all the containers went down.

# virsh dumpxml osp_16_Director|grep -E "vcpu|memory"
<memory unit='KiB'>67108864</memory>
<vcpu placement='static'>24</vcpu>

# podman ps --format "{{.ID}}" | wc -l
0

# systemctl list-units| grep -i failed
_ tripleo_glance_api.service loaded failed failed glance_api container
_ tripleo_haproxy.service loaded failed failed haproxy container
_ tripleo_heat_api.service loaded failed failed heat_api container
_ tripleo_heat_api_cron.service loaded failed failed heat_api_cron container
_ tripleo_heat_engine.service loaded failed failed heat_engine container
_ tripleo_ironic_api.service loaded failed failed ironic_api container
_ tripleo_ironic_conductor.service loaded failed failed ironic_conductor container
_ tripleo_ironic_inspector.service loaded failed failed ironic_inspector container
_ tripleo_ironic_inspector_dnsmasq.service loaded failed failed ironic_inspector_dnsmasq container
_ tripleo_ironic_neutron_agent.service loaded failed failed ironic_neutron_agent container
_ tripleo_ironic_pxe_http.service loaded failed failed ironic_pxe_http container
_ tripleo_ironic_pxe_tftp.service loaded failed failed ironic_pxe_tftp container
_ tripleo_iscsid.service loaded failed failed iscsid container
_ tripleo_keepalived.service loaded failed failed keepalived container
_ tripleo_keystone.service loaded failed failed keystone container
_ tripleo_logrotate_crond.service loaded failed failed logrotate_crond container
_ tripleo_memcached.service loaded failed failed memcached container
_ tripleo_mistral_api.service loaded failed failed mistral_api container
_ tripleo_mistral_engine.service loaded failed failed mistral_engine container
_ tripleo_mistral_event_engine.service loaded failed failed mistral_event_engine container
_ tripleo_mistral_executor.service loaded failed failed mistral_executor container
_ tripleo_mysql.service loaded failed failed mysql container
_ tripleo_neutron_api.service loaded failed failed neutron_api container
_ tripleo_neutron_dhcp.service loaded failed failed neutron_dhcp container
_ tripleo_neutron_l3_agent.service loaded failed failed neutron_l3_agent container
_ tripleo_neutron_ovs_agent.service loaded failed failed neutron_ovs_agent container
_ tripleo_nova_api.service loaded failed failed nova_api container
_ tripleo_nova_api_cron.service loaded failed failed nova_api_cron container
_ tripleo_nova_compute.service loaded failed failed nova_compute container
_ tripleo_nova_conductor.service loaded failed failed nova_conductor container
_ tripleo_nova_scheduler.service loaded failed failed nova_scheduler container
_ tripleo_placement_api.service loaded failed failed placement_api container
_ tripleo_rabbitmq.service loaded failed failed rabbitmq container
_ tripleo_swift_account_reaper.service loaded failed failed swift_account_reaper container
_ tripleo_swift_account_server.service loaded failed failed swift_account_server container
_ tripleo_swift_container_server.service loaded failed failed swift_container_server container
_ tripleo_swift_container_updater.service loaded failed failed swift_container_updater container
_ tripleo_swift_object_expirer.service loaded failed failed swift_object_expirer container
_ tripleo_swift_object_server.service loaded failed failed swift_object_server container
_ tripleo_swift_object_updater.service loaded failed failed swift_object_updater container
_ tripleo_swift_proxy.service loaded failed failed swift_proxy container
_ tripleo_swift_rsync.service loaded failed failed swift_rsync container
_ tripleo_zaqar.service loaded failed failed zaqar container
_ tripleo_zaqar_websocket.service loaded failed failed zaqar_websocket container
_ tripleo_heat_api_healthcheck.timer loaded failed failed heat_api container healthcheck
_ tripleo_mistral_executor_healthcheck.timer loaded failed failed mistral_executor container healthcheck
_ tripleo_nova_conductor_healthcheck.timer loaded failed failed nova_conductor container healthcheck
_ tripleo_swift_proxy_healthcheck.timer loaded failed failed swift_proxy container healthcheck

4. It looks like all the containers are pinned with entire cores available in the VM
http://paste.openstack.org/show/795089/

5. After restoring the CPU count with 54, then all containers came UP and running as expected after reboot

It looks like vCPUs mapping in "CpusetCpus" container properties is a limitation of virtual Undercloud if VM resources get resized in runtime.

We need a solution to keep container UP and running if the undercloud CPU count ger resized.

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-23: Related fix proposed to paunch (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/737527

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-23: Related fix proposed to paunch (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/737549

Emilien Macchi (emilienm) on 2020-06-23

tags:	added: train ussuri-backport-potential
tags:	added: train-backport-potential removed: train

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-25: Related fix merged to paunch (stable/train)

Reviewed: https://review.opendev.org/737549
Committed: https://git.openstack.org/cgit/openstack/paunch/commit/?id=c8661e0505f905dbdc96a680b81c0e7edb968d3e
Submitter: Zuul
Branch: stable/train

commit c8661e0505f905dbdc96a680b81c0e7edb968d3e
Author: Emilien Macchi <email address hidden>
Date: Tue Jun 23 08:51:03 2020 -0400

podman: get cpus allowed list only when isolcpus in cmdline

When building the podman command, only get cpus allowed list when there
is isolcpus in the /proc/cmdline otherwise skip the argument.

Note: re-working the Mocks for cpu affinity so we can properly test the
/proc/cmdline reads.

    Change-Id: I270c90d3adc8824991896443c6074f8f7357c942
    Related-Bug: #1884765
    Co-Authored-By: Alex Schultz <email address hidden>
    (cherry picked from commit 7ea9455221148bfdbfcda83fad95a7eca760807b)

tags:

added: in-stable-train

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-06-26: Related fix merged to paunch (stable/ussuri)

Reviewed: https://review.opendev.org/737527
Committed: https://git.openstack.org/cgit/openstack/paunch/commit/?id=7ea9455221148bfdbfcda83fad95a7eca760807b
Submitter: Zuul
Branch: stable/ussuri

commit 7ea9455221148bfdbfcda83fad95a7eca760807b
Author: Emilien Macchi <email address hidden>
Date: Tue Jun 23 08:51:03 2020 -0400

podman: get cpus allowed list only when isolcpus in cmdline

When building the podman command, only get cpus allowed list when there
is isolcpus in the /proc/cmdline otherwise skip the argument.

Note: re-working the Mocks for cpu affinity so we can properly test the
/proc/cmdline reads.

    Change-Id: I270c90d3adc8824991896443c6074f8f7357c942
    Related-Bug: #1884765
    Co-Authored-By: Alex Schultz <email address hidden>

tags:

added: in-stable-ussuri

Emilien Macchi (emilienm) on 2020-07-03

Changed in tripleo:
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.