cpu_shared_set and cpu_dedicated_set values are wrongly set

Bug #1928683 reported by Thiago Paiva Brito
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Thiago Paiva Brito

Bug Description

Brief Description
-----------------

After installing Openstack I tried to launch a VM in a test, it creates a VM but it was impossible to ping to that VM, and after that is not possible to access the host without rebooting. Alarms are raised for cpu's usage that has reached 98% of the usage and we can see that the kvm-qemu processes are being scheduled for platform cores using "top -c".

Severity
--------

Critical: It is not possible to launch vms

Steps to Reproduce
------------------

Install stx-openstack
Launch VMs

Expected Behavior
-----------------

Create VM and make it available/reachable

Actual Behavior
---------------

after creating a VM it is not possible to ping it and the controller stops responding because it reaches 98% of cpus usage

Reproducibility
---------------

Reproducible. State if the issue is 100% reproducible. It might take a few attempts since the VM can be scheduled at any other core.

System Configuration
--------------------

AIO-SX

Branch/Pull Time/Commit
-----------------------

Last Pass
---------

Unknown

Timestamp/Logs
--------------

openstack version if necessary
[sysadmin@controller-0 scratch(keystone_admin)]$ openstack --version
openstack 4.0.0

----------------------------------------------------------------------------------------------------------

Alarm Reason Text Entity ID Severity Time Stamp
ID
----------------------------------------------------------------------------------------------------------

400. Service group web-services has no active service_domain=controller. critical 2021-05-17T
002 members available; expected 1 active member service_group=web-services 02:25:33.
        105148

800. Potential data loss. No available OSDs in cluster=615b863d-b996-4962-805f- critical 2021-05-17T
010 storage replication group group-0: OSDs are cc389fdc39e6.peergroup=group-0.host= 02:25:32.
  down controller-0 306562

800. Storage Alarm Condition: HEALTH_WARN. Please cluster=615b863d-b996-4962-805f- warning 2021-05-17T
001 check 'ceph -s' for more details. cc389fdc39e6 02:25:32.
        118213

100. Platform CPU threshold exceeded ; threshold host=controller-0 critical 2021-05-17T
101 95.00%, actual 99.96% 01:42:32.
        217693

----------------------------------------------------------------------------------------------------------

controller-0:~$ grep "platform cpu usage" /var/log/daemon.log |tail -1
2021-05-17T01:58:02.156 controller-0 collectd[109364]: info platform cpu usage plugin Usage: 98.6% (avg per cpu); cpus: 2, Platform: 139.6% (Base: 136.2, k8s-system: 3.3), k8s-addon: 13.9

controller-0:~$ lscpu | grep -e 'CPU(s)'
CPU(s): 28
On-line CPU(s) list: 0-27
NUMA node0 CPU(s): 0-13
NUMA node1 CPU(s): 14-27
controller-0:~$

* nova.conf on nova-compute-0 container
[root@controller-0 /]# head -40 /etc/nova/nova.conf
[DEFAULT]
allow_resize_to_same_host = true
block_device_allocate_retries = 2400
block_device_allocate_retries_interval = 3
compute_driver = libvirt.LibvirtDriver
compute_monitors = cpu.virt_driver
cpu_allocation_ratio = 16
cpu_dedicated_set = "4-27"
cpu_shared_set = "4-27"
default_ephemeral_format = ext4
default_mempages_size = 2048
disk_allocation_ratio = 1
enable_new_services = false
firewall_driver = nova.virt.firewall.NoopFirewallDriver
instance_usage_audit = true
instance_usage_audit_period = hour
linuxnet_interface_driver = openvswitch
log_config_append = /etc/nova/logging.conf
long_rpc_timeout = 400
map_new_hosts = false
metadata_host = ::
metadata_listen = ::
metadata_port = 80
metadata_workers = 1
mkisofs_cmd = /usr/bin/genisoimage
my_ip = 192.168.206.2
network_allocate_retries = 2
notify_on_state_change = vm_and_task_state
osapi_compute_listen = ::
osapi_compute_listen_port = 8774
osapi_compute_workers = 1
ram_allocation_ratio = 1
remove_unused_original_minimum_age_seconds = 3600
reserved_host_memory_mb = 11048
reserved_huge_pages = node:0,size:4,count:2048000
reserved_huge_pages = node:0,size:1048576,count:1
reserved_huge_pages = node:1,size:4,count:256000
reserved_huge_pages = node:1,size:1048576,count:1
resume_guests_state_on_host_boot = true
running_deleted_instance_poll_interval = 60

Test Activity
-------------

Feature Testing

Workaround
----------

No workaround

Changed in starlingx:
status: New → In Progress
Revision history for this message
Thiago Paiva Brito (outbrito) wrote :

I figured that that change introduced the cpu_shared_set and cpu_dedicated_set in the wrong section of nova.conf. It should be on the [compute] section, not on [DEFAULT]: https://github.com/openstack/nova/blob/stable/ussuri/nova/conf/compute.py#L317

Opened a review to put those configs on the right config section, remove the `shared_pcpu_map` that is a legacy config that is not in use anymore and also fix the value for `cpu_dedicated_set` that was using the wrong variable: https://review.opendev.org/c/starlingx/openstack-armada-app/+/791526

Already tested with a custom build and the VMs are now being scheduled on the right cores.
This probably will need to be cherry-picked to stx.5.0.

Changed in starlingx:
assignee: nobody → Thiago Paiva Brito (outbrito)
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
tags: added: stx.5.0 stx.distro.openstack
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: This was introduced by the code changes for LP: https://bugs.launchpad.net/starlingx/+bug/1904729 which was merged on 2021-04-09 and included in the stx.5.0 release, so the fix will need to be cherrypicked to the r/stx.5.0 release branch.

tags: added: stx.6.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/791526
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/963e63cd55d5be4f5ddfc148ae00b6a46e071295
Submitter: "Zuul (22348)"
Branch: master

commit 963e63cd55d5be4f5ddfc148ae00b6a46e071295
Author: Thiago Brito <email address hidden>
Date: Fri May 14 15:36:07 2021 -0300

    Fix cpu_shared/dedicated_set config location

    Change I61514389b616db754b0d2f35deb0101f90dbdd02 removed the deprecated
    property vcpu_pin_set in favor of the newer cpu_shared_set and
    cpu_dedicated_set, but those new configs are placed under the [compute]
    section of nova.conf instead of [DEFAULT]. This is causing VMs to be
    scheduled on platform reserved cores. This commit will fix it.

    Closes-Bug: #1928683

    Signed-off-by: Thiago Brito <email address hidden>
    Change-Id: I541760619f4c79c66a2bf22715afdc873b8343ce

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Thiago, please cherrypick the fix to the r/stx.5.0 release branch

tags: added: stx.cherrypickneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (r/stx.5.0)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (f/centos8)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (r/stx.5.0)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/792185
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/61fc9116940ddfb9e477eedb7e8acce04338242f
Submitter: "Zuul (22348)"
Branch: r/stx.5.0

commit 61fc9116940ddfb9e477eedb7e8acce04338242f
Author: Thiago Brito <email address hidden>
Date: Fri May 14 15:36:07 2021 -0300

    Fix cpu_shared/dedicated_set config location

    Change I61514389b616db754b0d2f35deb0101f90dbdd02 removed the deprecated
    property vcpu_pin_set in favor of the newer cpu_shared_set and
    cpu_dedicated_set, but those new configs are placed under the [compute]
    section of nova.conf instead of [DEFAULT]. This is causing VMs to be
    scheduled on platform reserved cores. This commit will fix it.

    Closes-Bug: #1928683

    Signed-off-by: Thiago Brito <email address hidden>
    (cherry picked from commit 963e63cd55d5be4f5ddfc148ae00b6a46e071295)
    Change-Id: I6bb7dee74e18b2889d683757adb8bb91987f45db

Ghada Khalil (gkhalil)
tags: added: in-r-stx50
removed: stx.cherrypickneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (f/centos8)
Download full text (6.7 KiB)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/792235
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/1bf694661282a019bf79f253fc148baede65db64
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 963e63cd55d5be4f5ddfc148ae00b6a46e071295
Author: Thiago Brito <email address hidden>
Date: Fri May 14 15:36:07 2021 -0300

    Fix cpu_shared/dedicated_set config location

    Change I61514389b616db754b0d2f35deb0101f90dbdd02 removed the deprecated
    property vcpu_pin_set in favor of the newer cpu_shared_set and
    cpu_dedicated_set, but those new configs are placed under the [compute]
    section of nova.conf instead of [DEFAULT]. This is causing VMs to be
    scheduled on platform reserved cores. This commit will fix it.

    Closes-Bug: #1928683

    Signed-off-by: Thiago Brito <email address hidden>
    Change-Id: I541760619f4c79c66a2bf22715afdc873b8343ce

commit 58f4d9ffcaf47fe969267149135201aec01624a8
Author: Gustavo Santos <email address hidden>
Date: Mon Mar 8 14:56:55 2021 -0300

    Add k8s proxy-body-size to horizon overrides

    The current network.dashboard.ingress.annotations in horizon's
    values.yaml helm charts do not include the kubernetes property
    'proxy-body-size'. This makes the resulting nginx.conf file in ingress
    add the default rule 'max_body_size 1m' to the horizon servers,
    which limits all http requests' size inside horizon to 1MiB, making it
    impossible to upload images larger than that to glance using the
    horizon GUI, for example.

    This change adds said property to the horizon overrides, making
    horizon's servers in nginx.conf include a 'max_body_size' of 2500MiB,
    which makes uploading images up to that size possible again.

    Story: 2008692
    Task: 41996
    Change-Id: I91888ce238d5304c08eb1e97918989b8f93ee34f

commit b5c1f62088778287e4b50aeac1f17d166a7a177a
Author: Dan Voiculeasa <email address hidden>
Date: Wed Feb 3 16:00:47 2021 +0200

    Introduce metadata for app behavior control

    Keep existing behavior when evaluating app reapplies.

    Story: 2007960
    Task: 41755
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: Ie02743cdf056dda3feb66911c74f9dabe69d98dd

commit eab750b7ff03808002acf35deebdf762e687b332
Author: Martin, Chen <email address hidden>
Date: Sat May 30 09:11:05 2020 +0800

    Add override setting in openstack helm plugin for rook-ceph

    Deploy with rook-ceph, without "system storage-backend-add ceph"
    there is no object storage-ceph in database. As current openstack
    helm plugin fixed on object storage-ceph, in rook-ceph case
    use a fixed override setting

    Story: 2005527
    Task: 39914

    Depends-On: https://review.opendev.org/#/c/713084/

    Change-Id: Ied852d60e8b15d55865747e0b6f4b54f2392d6df
    Signed-off-by: Martin, Chen <email address hidden>

commit 852d8d61dbfc4f9f29afe8da10924731a58028ea
Author: Dan Voiculeasa <email address hidden>
Date: Mon Nov 16 12:41:55 2020 +0200

    Introduce lifecycle operator to openstack app

    A big chunk...

Read more...

tags: added: in-f-centos8
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.