Incorrect Shelved_offloaded instance metrics on openstack usage show output

Bug #1913641 reported by Rodrigo Barbieri
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Won't Fix
Undecided
Unassigned

Bug Description

env: bionic-ussuri and bionic-wallaby (devstack)

When running "openstack usage show --project <project>", having only shelved_offloaded instances in the project, it continues to track metrics as if the instance was running, even though it is not. See output below:

$ openstack server list
+--------------------------------------+------+-------------------+--------------------------------------------------------+--------------------------+-----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------+-------------------+--------------------------------------------------------+--------------------------+-----------+
| a8d3fbb6-1734-4e3f-81db-b1c42a462bf7 | ins1 | SHELVED_OFFLOADED | private=10.0.0.30, fd6b:5cf:38bb:0:f816:3eff:fe66:c5b0 | cirros-0.5.1-x86_64-disk | cirros256 |
+--------------------------------------+------+-------------------+--------------------------------------------------------+--------------------------+-----------+

$ openstack usage show --project admin

Usage from 2020-12-31 to 2021-01-29 on project 1bfc9c13d7da4a4183c0b16cfa80020f:
+---------------+-------+
| Field | Value |
+---------------+-------+
| CPU Hours | 0.04 |
| Disk GB-Hours | 0.04 |
| RAM MB-Hours | 9.43 |
| Servers | 1 |
+---------------+-------+

$ openstack server show ins1

+-------------------------------------+-----------------------------------------------------------------+
| Field | Value |
+-------------------------------------+-----------------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | |
| OS-EXT-SRV-ATTR:host | None |
| OS-EXT-SRV-ATTR:hypervisor_hostname | None |
| OS-EXT-SRV-ATTR:instance_name | instance-00000001 |
| OS-EXT-STS:power_state | Shutdown |
| OS-EXT-STS:task_state | None |
| OS-EXT-STS:vm_state | shelved_offloaded |
| OS-SRV-USG:launched_at | 2021-01-28T19:33:34.000000 |
| OS-SRV-USG:terminated_at | None |
| accessIPv4 | |
| accessIPv6 | |
| addresses | private=10.0.0.30, fd6b:5cf:38bb:0:f816:3eff:fe66:c5b0 |
| config_drive | |
| created | 2021-01-28T19:33:25Z |
| flavor | cirros256 (c1) |
| hostId | |
| id | a8d3fbb6-1734-4e3f-81db-b1c42a462bf7 |
| image | cirros-0.5.1-x86_64-disk (9e09f573-99f7-4f7c-bf16-47d475320207) |
| key_name | None |
| name | ins1 |
| project_id | 1bfc9c13d7da4a4183c0b16cfa80020f |
| properties | |
| security_groups | name='default' |
| status | SHELVED_OFFLOADED |
| updated | 2021-01-28T19:34:10Z |
| user_id | 1330bb68cfe147a580cc6085705ea319 |
| volumes_attached | |
+-------------------------------------+-----------------------------------------------------------------+

The reason it happens is because of the logic in [0], where it relies on the terminated_at field to make stop tracking the metrics. As can be seen in the "openstack server show" output, shelving the instance doesn't fill in the terminated_at field.

[0] https://github.com/openstack/nova/blob/6c0ceda3659405149b7c0b5c283275ef0a896269/nova/api/openstack/compute/simple_tenant_usage.py#L74

Steps to reproduce:

1) create an instance
2) shelve it, wait for shelved_offloaded state
3) run "openstack usage show --project <project>" multiple times and notice the metrics increasing

Expected result: Metrics for the shelved_offloaded instances shouldn't be accounted for while it is in that state.

Tags: sts
tags: added: sts
Revision history for this message
melanie witt (melwitt) wrote :

This behavior is expected/intentional, but it has been like this for many years and it's possible we may want to revisit the logic.

Historically, shelved offloaded instances have counted toward usage and quota. However, as you noted in #openstack-nova today, when [quota]/count_usage_from_placement = True shelved offloaded instances will *not* be counted against quota [1]:

"Behavior will be different for servers in SHELVED_OFFLOADED state. A server in SHELVED_OFFLOADED state will not have placement allocations, so it will not consume quota usage for cores and ram. Note that because of this, it will be possible for a request to unshelve a server to be rejected if the user does not have enough quota available to support the cores and ram needed by the server to be unshelved."

[1] https://docs.openstack.org/nova/latest/admin/quotas.html#id2

Changed in nova:
assignee: nobody → Rodrigo Barbieri (rodrigo-barbieri2010)
status: New → In Progress
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

There is a small problem implementing this fix, which is how to detected when to stop counting the usage.

For a regular instance, counting stops at the terminated_at value, which is the timestamp of when the instance is completely deleted.

Shelved instances do not have the terminated_at field filled. They have the field shelved_at from instance.system_metadata, which is the timestamp when the instance is shelved, but not when it is offloaded. A shelved instance may be offloaded after some time (or never) according to the "shelved_offload_time" config option of computes (which in some deployments may not be accessible by the API service, but could be queried though). It also has the potential extra time of "shelved_poll_interval" which is by default 1h. So, an instance can potentially have 1 extra hour of usage time depending on polling timing, which is impossible for the API to accurately calculate.

Looking at the code, the shelve method powers down the instance, but for some reason the shelve_offload_instance method a shelved instance also does that. I'm unaware of a scenario where a shelved instance may be running and using resources. I guess it will depend on how we would like to track usage. Essentially, a shelved non-offloaded is the same as a powered down instance (with snapshot taken), which currently is tracked the same as a running instance, so to be consistent, it should be tracked.

Looking further at the code, the purpose of the shelved_at value is just so that the offload polling can check whether the amount of time specified in shelved_offload_time has passed to offload it, when the instance status is SHELVED.

Therefore there are 2 alternatives:
1) Create a new timestamp field for when the instance is offloaded.
2) Overload the shelved_at field setting a new value when the instance is successfully offloaded, given its meaning depends on the instance status field. It will only be read by the Usage API when the instance status is SHELVED_OFFLOADED.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

While working on the fix following alternative #2 described in the comment above, I hit another problem:

Whenever an instance is unshelved, the launched_at field resets as if it was a new instance. This messes up the "openstack usage show" result, as all the previous executed time will not be accounted for.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

I logged a new bug https://bugs.launchpad.net/nova/+bug/1915055 for the one I've hit, however, after discussing it with the nova team in the IRC channel, we came to the conclusion that both this and the new bug are not going to be fixed, as the usage API is discouraged.

Changed in nova:
status: In Progress → Won't Fix
Changed in nova:
assignee: Rodrigo Barbieri (rodrigo-barbieri2010) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.