StarlingX

cpu_shared_set and cpu_dedicated_set values are wrongly set

Bug #1928683 reported by Thiago Paiva Brito on 2021-05-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	Thiago Paiva Brito

Bug Description

Brief Description
-----------------

After installing Openstack I tried to launch a VM in a test, it creates a VM but it was impossible to ping to that VM, and after that is not possible to access the host without rebooting. Alarms are raised for cpu's usage that has reached 98% of the usage and we can see that the kvm-qemu processes are being scheduled for platform cores using "top -c".

Severity
--------

Critical: It is not possible to launch vms

Steps to Reproduce
------------------

Install stx-openstack
Launch VMs

Expected Behavior
-----------------

Create VM and make it available/reachable

Actual Behavior
---------------

after creating a VM it is not possible to ping it and the controller stops responding because it reaches 98% of cpus usage

Reproducibility
---------------

Reproducible. State if the issue is 100% reproducible. It might take a few attempts since the VM can be scheduled at any other core.

System Configuration
--------------------

AIO-SX

Branch/Pull Time/Commit
-----------------------

Last Pass
---------

Unknown

Timestamp/Logs
--------------

openstack version if necessary
[sysadmin@controller-0 scratch(keystone_admin)]$ openstack --version
openstack 4.0.0

----------------------------------------------------------------------------------------------------------

Alarm Reason Text Entity ID Severity Time Stamp
ID
----------------------------------------------------------------------------------------------------------

400. Service group web-services has no active service_domain=controller. critical 2021-05-17T
002 members available; expected 1 active member service_group=web-services 02:25:33.
105148

800. Potential data loss. No available OSDs in cluster=615b863d-b996-4962-805f- critical 2021-05-17T
010 storage replication group group-0: OSDs are cc389fdc39e6.peergroup=group-0.host= 02:25:32.
down controller-0 306562

800. Storage Alarm Condition: HEALTH_WARN. Please cluster=615b863d-b996-4962-805f- warning 2021-05-17T
001 check 'ceph -s' for more details. cc389fdc39e6 02:25:32.
118213

100. Platform CPU threshold exceeded ; threshold host=controller-0 critical 2021-05-17T
101 95.00%, actual 99.96% 01:42:32.
217693

----------------------------------------------------------------------------------------------------------

controller-0:~$ grep "platform cpu usage" /var/log/daemon.log |tail -1
2021-05-17T01:58:02.156 controller-0 collectd[109364]: info platform cpu usage plugin Usage: 98.6% (avg per cpu); cpus: 2, Platform: 139.6% (Base: 136.2, k8s-system: 3.3), k8s-addon: 13.9

controller-0:~$ lscpu | grep -e 'CPU(s)'
CPU(s): 28
On-line CPU(s) list: 0-27
NUMA node0 CPU(s): 0-13
NUMA node1 CPU(s): 14-27
controller-0:~$

* nova.conf on nova-compute-0 container
[root@controller-0 /]# head -40 /etc/nova/nova.conf
[DEFAULT]
allow_resize_to_same_host = true
block_device_allocate_retries = 2400
block_device_allocate_retries_interval = 3
compute_driver = libvirt.LibvirtDriver
compute_monitors = cpu.virt_driver
cpu_allocation_ratio = 16
cpu_dedicated_set = "4-27"
cpu_shared_set = "4-27"
default_ephemeral_format = ext4
default_mempages_size = 2048
disk_allocation_ratio = 1
enable_new_services = false
firewall_driver = nova.virt.firewall.NoopFirewallDriver
instance_usage_audit = true
instance_usage_audit_period = hour
linuxnet_interface_driver = openvswitch
log_config_append = /etc/nova/logging.conf
long_rpc_timeout = 400
map_new_hosts = false
metadata_host = ::
metadata_listen = ::
metadata_port = 80
metadata_workers = 1
mkisofs_cmd = /usr/bin/genisoimage
my_ip = 192.168.206.2
network_allocate_retries = 2
notify_on_state_change = vm_and_task_state
osapi_compute_listen = ::
osapi_compute_listen_port = 8774
osapi_compute_workers = 1
ram_allocation_ratio = 1
remove_unused_original_minimum_age_seconds = 3600
reserved_host_memory_mb = 11048
reserved_huge_pages = node:0,size:4,count:2048000
reserved_huge_pages = node:0,size:1048576,count:1
reserved_huge_pages = node:1,size:4,count:256000
reserved_huge_pages = node:1,size:1048576,count:1
resume_guests_state_on_host_boot = true
running_deleted_instance_poll_interval = 60

Test Activity
-------------

Feature Testing

Workaround
----------

No workaround

Tags:

OpenStack Infra (hudson-openstack) on 2021-05-17

Changed in starlingx:
status:	New → In Progress

Revision history for this message

Thiago Paiva Brito (outbrito) wrote on 2021-05-17:

I figured that that change introduced the cpu_shared_set and cpu_dedicated_set in the wrong section of nova.conf. It should be on the [compute] section, not on [DEFAULT]: https://github.com/openstack/nova/blob/stable/ussuri/nova/conf/compute.py#L317

Opened a review to put those configs on the right config section, remove the `shared_pcpu_map` that is a legacy config that is not in use anymore and also fix the value for `cpu_dedicated_set` that was using the wrong variable: https://review.opendev.org/c/starlingx/openstack-armada-app/+/791526

Already tested with a custom build and the VMs are now being scheduled on the right cores.
This probably will need to be cherry-picked to stx.5.0.

Changed in starlingx:
assignee:	nobody → Thiago Paiva Brito (outbrito)

Ghada Khalil (gkhalil) on 2021-05-17

Changed in starlingx:
importance:	Undecided → High
tags:	added: stx.5.0 stx.distro.openstack

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-05-17:

screening: This was introduced by the code changes for LP: https://bugs.launchpad.net/starlingx/+bug/1904729 which was merged on 2021-04-09 and included in the stx.5.0 release, so the fix will need to be cherrypicked to the r/stx.5.0 release branch.

tags:

added: stx.6.0

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-19: Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/791526
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/963e63cd55d5be4f5ddfc148ae00b6a46e071295
Submitter: "Zuul (22348)"
Branch: master

commit 963e63cd55d5be4f5ddfc148ae00b6a46e071295
Author: Thiago Brito <email address hidden>
Date: Fri May 14 15:36:07 2021 -0300

Fix cpu_shared/dedicated_set config location

    Change I61514389b616db754b0d2f35deb0101f90dbdd02 removed the deprecated
    property vcpu_pin_set in favor of the newer cpu_shared_set and
    cpu_dedicated_set, but those new configs are placed under the [compute]
    section of nova.conf instead of [DEFAULT]. This is causing VMs to be
    scheduled on platform reserved cores. This commit will fix it.

Closes-Bug: #1928683

Signed-off-by: Thiago Brito <email address hidden>
Change-Id: I541760619f4c79c66a2bf22715afdc873b8343ce

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-05-19:

@Thiago, please cherrypick the fix to the r/stx.5.0 release branch

tags:

added: stx.cherrypickneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-19: Fix proposed to openstack-armada-app (r/stx.5.0)

Fix proposed to branch: r/stx.5.0
Review: https://review.opendev.org/c/starlingx/openstack-armada-app/+/792185

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-19: Fix proposed to openstack-armada-app (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/openstack-armada-app/+/792235

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-05-20: Fix merged to openstack-armada-app (r/stx.5.0)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/792185
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/61fc9116940ddfb9e477eedb7e8acce04338242f
Submitter: "Zuul (22348)"
Branch: r/stx.5.0

commit 61fc9116940ddfb9e477eedb7e8acce04338242f
Author: Thiago Brito <email address hidden>
Date: Fri May 14 15:36:07 2021 -0300

Fix cpu_shared/dedicated_set config location

Closes-Bug: #1928683

    Signed-off-by: Thiago Brito <email address hidden>
    (cherry picked from commit 963e63cd55d5be4f5ddfc148ae00b6a46e071295)
    Change-Id: I6bb7dee74e18b2889d683757adb8bb91987f45db

Ghada Khalil (gkhalil) on 2021-05-20

tags:

added: in-r-stx50
removed: stx.cherrypickneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-06-01: Fix merged to openstack-armada-app (f/centos8)

Download full text (6.7 KiB)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/792235
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/1bf694661282a019bf79f253fc148baede65db64
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 963e63cd55d5be4f5ddfc148ae00b6a46e071295
Author: Thiago Brito <email address hidden>
Date: Fri May 14 15:36:07 2021 -0300

Fix cpu_shared/dedicated_set config location

Closes-Bug: #1928683

Signed-off-by: Thiago Brito <email address hidden>
Change-Id: I541760619f4c79c66a2bf22715afdc873b8343ce

commit 58f4d9ffcaf47fe969267149135201aec01624a8
Author: Gustavo Santos <email address hidden>
Date: Mon Mar 8 14:56:55 2021 -0300

Add k8s proxy-body-size to horizon overrides

    The current network.dashboard.ingress.annotations in horizon's
    values.yaml helm charts do not include the kubernetes property
    'proxy-body-size'. This makes the resulting nginx.conf file in ingress
    add the default rule 'max_body_size 1m' to the horizon servers,
    which limits all http requests' size inside horizon to 1MiB, making it
    impossible to upload images larger than that to glance using the
    horizon GUI, for example.

    This change adds said property to the horizon overrides, making
    horizon's servers in nginx.conf include a 'max_body_size' of 2500MiB,
    which makes uploading images up to that size possible again.

    Story: 2008692
    Task: 41996
    Change-Id: I91888ce238d5304c08eb1e97918989b8f93ee34f

commit b5c1f62088778287e4b50aeac1f17d166a7a177a
Author: Dan Voiculeasa <email address hidden>
Date: Wed Feb 3 16:00:47 2021 +0200

Introduce metadata for app behavior control

Keep existing behavior when evaluating app reapplies.

    Story: 2007960
    Task: 41755
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: Ie02743cdf056dda3feb66911c74f9dabe69d98dd

commit eab750b7ff03808002acf35deebdf762e687b332
Author: Martin, Chen <email address hidden>
Date: Sat May 30 09:11:05 2020 +0800

Add override setting in openstack helm plugin for rook-ceph

    Deploy with rook-ceph, without "system storage-backend-add ceph"
    there is no object storage-ceph in database. As current openstack
    helm plugin fixed on object storage-ceph, in rook-ceph case
    use a fixed override setting

Story: 2005527
Task: 39914

Depends-On: https://review.opendev.org/#/c/713084/

Change-Id: Ied852d60e8b15d55865747e0b6f4b54f2392d6df
Signed-off-by: Martin, Chen <email address hidden>

commit 852d8d61dbfc4f9f29afe8da10924731a58028ea
Author: Dan Voiculeasa <email address hidden>
Date: Mon Nov 16 12:41:55 2020 +0200

Introduce lifecycle operator to openstack app

A big chunk...

Reviewed:  https://review.opendev.org/c/starlingx/openstack-armada-app/+/792235
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/1bf694661282a019bf79f253fc148baede65db64
Submitter: "Zuul (22348)"
Branch:    f/centos8

commit 963e63cd55d5be4f5ddfc148ae00b6a46e071295
Author: Thiago Brito <thiago.brito@windriver.com>
Date:   Fri May 14 15:36:07 2021 -0300

Fix cpu_shared/dedicated_set config location
    
    Change I61514389b616db754b0d2f35deb0101f90dbdd02 removed the deprecated
    property vcpu_pin_set in favor of the newer cpu_shared_set and
    cpu_dedicated_set, but those new configs are placed under the [compute]
    section of nova.conf instead of [DEFAULT]. This is causing VMs to be
    scheduled on platform reserved cores. This commit will fix it.
    
    Closes-Bug: #1928683
    
    Signed-off-by: Thiago Brito <thiago.brito@windriver.com>
    Change-Id: I541760619f4c79c66a2bf22715afdc873b8343ce

commit 58f4d9ffcaf47fe969267149135201aec01624a8
Author: Gustavo Santos <gustavofaganello.santos@windriver.com>
Date:   Mon Mar 8 14:56:55 2021 -0300

Add k8s proxy-body-size to horizon overrides
    
    The current network.dashboard.ingress.annotations in horizon's
    values.yaml helm charts do not include the kubernetes property
    'proxy-body-size'. This makes the resulting nginx.conf file in ingress
    add the default rule 'max_body_size 1m' to the horizon servers,
    which limits all http requests' size inside horizon to 1MiB, making it
    impossible to upload images larger than that to glance using the
    horizon GUI, for example.
    
    This change adds said property to the horizon overrides, making
    horizon's servers in nginx.conf include a 'max_body_size' of 2500MiB,
    which makes uploading images up to that size possible again.
    
    Story: 2008692
    Task: 41996
    Change-Id: I91888ce238d5304c08eb1e97918989b8f93ee34f

commit b5c1f62088778287e4b50aeac1f17d166a7a177a
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Wed Feb 3 16:00:47 2021 +0200

Introduce metadata for app behavior control
    
    Keep existing behavior when evaluating app reapplies.
    
    Story: 2007960
    Task: 41755
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>
    Change-Id: Ie02743cdf056dda3feb66911c74f9dabe69d98dd

commit eab750b7ff03808002acf35deebdf762e687b332
Author: Martin, Chen <haochuan.z.chen@intel.com>
Date:   Sat May 30 09:11:05 2020 +0800

Add override setting in openstack helm plugin for rook-ceph
    
    Deploy with rook-ceph, without "system storage-backend-add ceph"
    there is no object storage-ceph in database. As current openstack
    helm plugin fixed on object storage-ceph, in rook-ceph case
    use a fixed override setting
    
    Story: 2005527
    Task: 39914
    
    Depends-On: https://review.opendev.org/#/c/713084/
    
    Change-Id: Ied852d60e8b15d55865747e0b6f4b54f2392d6df
    Signed-off-by: Martin, Chen <haochuan.z.chen@intel.com>

commit 852d8d61dbfc4f9f29afe8da10924731a58028ea
Author: Dan Voiculeasa <dan.voiculeasa@windriver.com>
Date:   Mon Nov 16 12:41:55 2020 +0200

Introduce lifecycle operator to openstack app
    
    A big chunk of logic is moved from sysinv conductor to application
    itself.
    
    Following hooks were necessary:
    pre-apply, post-apply, pre-manifest-apply, pre-apply-rbd,
    pre-apply-resource, post-remove-rbd, post-remove-resource, post-remove
    
    Change-Id: I41858c831a4af564dbdf38934d51d34489bf8a9a
    Story: 2007960
    Task: 41293
    Signed-off-by: Dan Voiculeasa <dan.voiculeasa@windriver.com>

commit b64b0204465dc51c57b45bf9925b0fbb88749a45
Author: hbrito <hugo.brito@windriver.com>
Date:   Fri Dec 18 13:52:51 2020 +0000

Increase proxy-connect-timeout to avoid nginx timeout errors
    
    This patch increases the proxy-connect-timeout from 5 to 30 seconds,
    avoiding the Bad Gateway 502 error when CLI commands are executed.
    
    Closes-bug: 1908720
    Change-Id: I557456e9d0550a906b6d849d682de7ea3f0f42ad
    Signed-off-by: hbrito <hugo.brito@windriver.com>

commit ca527c227653a13e617122051c18591aa1212f98
Author: Don Penney <don.penney@windriver.com>
Date:   Wed Jan 6 14:35:03 2021 -0500

Remove empty package from python-k8sapp-openstack
    
    Packages defined in a spec with no files do not result in an RPM
    produced by the build. On a rebuild, the build tools scan the spec and
    sees the package defined but does not find a corresponding RPM, and so
    flags the package for a rebuild as a result.
    
    This commit removes the empty package definition from the spec.
    
    Partial-Bug: 1910439
    Signed-off-by: Don Penney <don.penney@windriver.com>
    Change-Id: Ie1f18b1592f8187900624d993434ba04b23cbcff

commit cb9854c701ab631628902ae1e9d9e76f0e2785b0
Author: Zhipeng Liu <zhipengs.liu@intel.com>
Date:   Thu Dec 24 22:21:41 2020 +0800

Update cpu_shared_set and cpu_dedicated_set in nova config
    
    Starting from Ussuri, OpenStack is deprecating vcpu_pin_set
    in favor of cpu_dedicated_set and cpu_shared_set. These
    overriders must be supported to be generated via Starlingx
    system commands.
    
    Closes-Bug: 1904729
    Change-Id: I61514389b616db754b0d2f35deb0101f90dbdd02
    Signed-off-by: Zhipeng Liu <zhipengs.liu@intel.com>

commit 0c30ffc410d9ca720ce80cb5fb08ae81adf05d2a
Author: Don Penney <don.penney@windriver.com>
Date:   Thu Dec 17 13:23:11 2020 -0500

Add auto-version for remaining stx/openstack-armada-app packages
    
    Update remaining StarlingX packages with hardcoded TIS_PATCH_VER to
    use PKG_GITREVCOUNT where possible, with offsets as needed to ensure
    the version is incremented above the hardcoded version.
    
    Story: 2008455
    Task: 41455
    Signed-off-by: Don Penney <don.penney@windriver.com>
    Change-Id: Icdc9d71d1268a4d3dd9e569c8642717bceadda5e

commit fc68439414816b2384aed1e88120713f645db8d8
Author: Mihnea Saracin <Mihnea.Saracin@windriver.com>
Date:   Fri Sep 11 18:46:48 2020 +0300

Fix apply of stx-openstack when host is locked
    
    Currently, all of the stx-openstack services have the
    replica count set to the number of the controllers.
    If one of the controllers is locked their replicas
    number will still be 2 which is incorrect.
    We solve this by changing the number of replicas
    to be equal to the number of the active controllers.
    The rabbitmq and mariadb services cannot use this approach because
    they are unable to work properly if their replica number
    is decreased from 2 to 1. So a kubernetes toleration
    is used here to allow the rabbitmq and mariadb pods to be
    deployed on the locked controller.
    
    Change-Id: I15cf2a3f62525751435ddbe66760935f3ab21d2b
    Closes-Bug: 1879018
    Signed-off-by: Mihnea Saracin <Mihnea.Saracin@windriver.com>

tags:

added: in-f-centos8

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.