overlapping pinned CPUs after unshelve
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
It seems that after unshelve, occasionally the request for a dedicated CPU is ignored. More precisely the first pinned CPU does not seem to be marked as consumed, so the second may end up on the same CPU. This was first observed on victoria (6 times out of 46 tries), but then I was able to reproduce it on master too (6 times out of 20 tries). The logs attached are from the victoria environment, which was a single-host all-in-one devstack running only the vms used for this reprouduction.
stable/victoria
devstack 3eb6e2d7
nova 1aca09b966
master
devstack b10c0602
nova 2aea80c0af
config:
[[post-
[DEFAULT]
scheduler_
[compute]
cpu_dedicated_set = 0,1
Confirming this config in placement:
$ openstack --os-placement-
+------
| Field | Value |
+------
| allocation_ratio | 1.0 |
| min_unit | 1 |
| max_unit | 2 |
| reserved | 0 |
| step_size | 1 |
| total | 2 |
+------
Reproduction steps:
openstack flavor create cirros256-pinned --public --vcpus 1 --ram 256 --disk 1 --property hw_rng:allowed=True --property hw:cpu_
openstack server list -f value -c ID | xargs -r openstack server delete --wait
openstack server create --flavor cirros256-pinned --image cirros-
openstack server shelve vm0
sleep 10 # make sure shelve finished
openstack server create --flavor cirros256-pinned --image cirros-
openstack server shelve vm1
sleep 10
openstack server unshelve vm0 ; sleep 15 ; openstack server unshelve vm1 # the amount of sleep could easily be relevant
watch openstack server list # wait until both go ACTIVE
# both vms ended up on the same cpu
$ for vm in $( sudo virsh list --name ) ; do sudo virsh dumpxml $vm | xmlstarlet sel -t -v '//vcpupin/@cpuset' ; echo ; done
0
0
Data collected from the environment where the above reproduction triggered the bug:
$ openstack server list
+------
| ID | Name | Status | Networks | Image | Flavor |
+------
| 4734b8a5-
| e30de509-
+------
$ openstack server show vm0
+------
| Field | Value |
+------
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-STS:vm_state | active |
| OS-SRV-
| OS-SRV-
| accessIPv4 | |
| accessIPv6 | |
| addresses | private=10.0.0.6, fdfb:ab27:
| config_drive | |
| created | 2023-06-
| flavor | cirros256-pinned (a61fac7f-
| hostId | a660a4498447eda
| id | e30de509-
| image | cirros-
| key_name | None |
| name | vm0 |
| progress | 0 |
| project_id | 0ad9573f6c57482
| properties | |
| security_groups | name='default' |
| status | ACTIVE |
| updated | 2023-06-
| user_id | 971105d66de5425
| volumes_attached | |
+------
$ openstack server show vm1
+------
| Field | Value |
+------
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-STS:vm_state | active |
| OS-SRV-
| OS-SRV-
| accessIPv4 | |
| accessIPv6 | |
| addresses | private=10.0.0.27, fdfb:ab27:
| config_drive | |
| created | 2023-06-
| flavor | cirros256-pinned (a61fac7f-
| hostId | a660a4498447eda
| id | 4734b8a5-
| image | cirros-
| key_name | None |
| name | vm1 |
| progress | 0 |
| project_id | 0ad9573f6c57482
| properties | |
| security_groups | name='default' |
| status | ACTIVE |
| updated | 2023-06-
| user_id | 971105d66de5425
| volumes_attached | |
+------
$ openstack flavor show cirros256-pinned
+------
| Field | Value |
+------
| OS-FLV-
| OS-FLV-
| access_project_ids | None |
| disk | 1 |
| id | a61fac7f-
| name | cirros256-pinned |
| os-flavor-
| properties | hw:cpu_
| ram | 256 |
| rxtx_factor | 1.0 |
| swap | |
| vcpus | 1 |
+------
$ openstack server event list vm0
+------
| Request ID | Server ID | Action | Start Time |
+------
| req-e09149a8-
| req-4628f975-
| req-d7595149-
+------
$ openstack server event show vm0 req-e09149a8-
+------
| Field | Value |
+------
| action | unshelve |
| events | [{'event': 'compute_
| | '2023-06-
| | '2023-06-
| instance_uuid | e30de509-
| message | None |
| project_id | 0ad9573f6c57482
| request_id | req-e09149a8-
| start_time | 2023-06-
| user_id | 971105d66de5425
+------
$ openstack server event show vm0 req-4628f975-
+------
| Field | Value |
+------
| action | shelve |
| events | [{'event': 'compute_
| instance_uuid | e30de509-
| message | None |
| project_id | 0ad9573f6c57482
| request_id | req-4628f975-
| start_time | 2023-06-
| user_id | 971105d66de5425
+------
$ openstack server event list vm1
+------
| Request ID | Server ID | Action | Start Time |
+------
| req-9ab0eabc-
| req-1dc9b980-
| req-5071fd9f-
+------
$ openstack server event show vm1 req-9ab0eabc-
+------
| Field | Value |
+------
| action | unshelve |
| events | [{'event': 'compute_
| | '2023-06-
| | '2023-06-
| instance_uuid | 4734b8a5-
| message | None |
| project_id | 0ad9573f6c57482
| request_id | req-9ab0eabc-
| start_time | 2023-06-
| user_id | 971105d66de5425
+------
$ openstack server event show vm1 req-1dc9b980-
+------
| Field | Value |
+------
| action | shelve |
| events | [{'event': 'compute_
| instance_uuid | 4734b8a5-
| message | None |
| project_id | 0ad9573f6c57482
| request_id | req-1dc9b980-
| start_time | 2023-06-
| user_id | 971105d66de5425
+------
Logs that seemed relevant to me:
vm0 unshelve:
jún 29 10:45:19 devstack1v nova-compute[
jún 29 10:45:19 devstack1v nova-compute[
jún 29 10:45:19 devstack1v nova-compute[
jún 29 10:45:19 devstack1v nova-compute[
jún 29 10:45:19 devstack1v nova-compute[
jún 29 10:45:19 devstack1v nova-compute[
jún 29 10:45:19 devstack1v nova-compute[
vm1 unshelve:
jún 29 10:45:37 devstack1v nova-compute[
jún 29 10:45:37 devstack1v nova-compute[
jún 29 10:45:37 devstack1v nova-compute[
jún 29 10:45:37 devstack1v nova-compute[
jún 29 10:45:37 devstack1v nova-compute[
jún 29 10:45:37 devstack1v nova-compute[
jún 29 10:45:37 devstack1v nova-compute[
The first error in the logs came not during unshelve but well after:
jún 29 10:47:27 devstack1v nova-compute[
be a subset of free CPU set [1]
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:47:28 devstack1v nova-compute[
Between the two unshelves the update_
$ egrep 'Running periodic task ComputeManager.
jún 29 10:35:16 devstack1v nova-compute[
jún 29 10:36:17 devstack1v nova-compute[
jún 29 10:37:17 devstack1v nova-compute[
jún 29 10:38:19 devstack1v nova-compute[
jún 29 10:39:20 devstack1v nova-compute[
jún 29 10:40:22 devstack1v nova-compute[
jún 29 10:41:22 devstack1v nova-compute[
jún 29 10:42:23 devstack1v nova-compute[
jún 29 10:43:24 devstack1v nova-compute[
jún 29 10:44:24 devstack1v nova-compute[
jún 29 10:45:24 devstack1v nova-compute[
jún 29 10:46:26 devstack1v nova-compute[
jún 29 10:47:27 devstack1v nova-compute[
jún 29 10:48:28 devstack1v nova-compute[
jún 29 10:49:28 devstack1v nova-compute[
I also attached a .tar.gz with the following contents:
for service in n-api n-cond-cell1 n-cpu n-sch n-super-cond
do
sudo journalctl -u devstack@$service -S '2023-06-29 10:35:00' -U '2023-06-29 10:50:00' > ~/$service.log
done
$ sudo virsh list
Id Name State
-------
63 instance-0000001f running
64 instance-00000020 running
sudo virsh dumpxml instance-0000001f > ~/instance-
sudo virsh dumpxml instance-00000020 > ~/instance-
Let me know if I can collect more information.
By the way here is the state of placement allocations when the bug is present (on master):
$ openstack resource provider list ------- ------- ------- ------- ----+-- ------- --+---- ------- -+----- ------- ------- ------- ------- -----+- ------- ------- ------- + provider_ uuid | ------- ------- ------- ------- ----+-- ------- --+---- ------- -+----- ------- ------- ------- ------- -----+- ------- ------- ------- + 38af-4fa0- 9199-3f6d16d600 a6 | devstack0 | 108 | 513a47b7- 38af-4fa0- 9199-3f6d16d600 a6 | None | ------- ------- ------- ------- ----+-- ------- --+---- ------- -+----- ------- ------- ------- ------- -----+- ------- ------- ------- + 38af-4fa0- 9199-3f6d16d600 a6 ------- ------- --+---- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------+ ------- ------- --+---- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------- ------+ 38af-4fa0- 9199-3f6d16d600 a6 | 38af-4fa0- 9199-3f6d16d600 a6 | provider_ uuid | None | c1eb-4ea6- bf1a-6459b68be5 5a': {'resources': {'DISK_GB': 1, 'MEMORY_MB': 256, 'PCPU': 1}, 'consumer_ generation' : 1}, '6bdb6725- 8c3e-422d- 8c95-1b7459b2f4 bd': {'resources': {'DISK_GB': 1, 'MEMORY_MB': | generation' : 1}} ...
+------
| uuid | name | generation | root_provider_uuid | parent_
+------
| 513a47b7-
+------
$ openstack resource provider show --allocations 513a47b7-
+------
| Field | Value |
+------
| uuid | 513a47b7-
| name | devstack0 |
| generation | 108 |
| root_provider_uuid | 513a47b7-
| parent_
| allocations | {'29ea91aa-
| | 256, 'PCPU': 1}, 'consumer_