Inefficient host_status lookup when listing servers with details (regression)

Bug #1830260 reported by Matt Riedemann on 2019-05-23
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Matt Riedemann
Stein
Low
Matt Riedemann

Bug Description

We have a performance regression since Stein [1] when listing servers with details concerning the host_status field. The code used to rely on this method [2] to cache the host status information per host when iterating over a list of instances but now it fetches it per host per instance in the view builder [3]. Granted by default policy this would only affect performance for an admin, but if I'm an admin listing 1000 servers across all tenants using "nova list --all-tenants" (which is going to use a microversion high enough to hit this) it could be a noticeable slow down compared to before Stein.

[1] https://review.opendev.org/#/c/584590/
[2] https://github.com/openstack/nova/blob/c7e9e667426a6d88d396a59cb40d30763a3265f9/nova/compute/api.py#L4926
[3] https://github.com/openstack/nova/blob/c7e9e667426a6d88d396a59cb40d30763a3265f9/nova/api/openstack/compute/views/servers.py#L325

Matt Riedemann (mriedem) on 2019-05-23
Changed in nova:
status: New → Triaged
summary: Inefficient host_status lookup when listing servers with details
+ (regression)
Matt Riedemann (mriedem) wrote :

https://review.opendev.org/#/c/663502/ should probably be reverted when fixing this bug.

tags: added: api
Changed in nova:
importance: Medium → High
Matt Riedemann (mriedem) on 2019-06-18
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) wrote :

To try and get some idea of the performance regression around this I created an 8VCPU/8GB RAM devstack with 4 fake compute nodes and set API_WORKERS=1, and disabled some services I don't need like horizon/tempest/cinder to cut out their load on the host. I then created 150 servers across the 4 hosts:

stack@devstack:~$ openstack server list --host devstack1 | grep test-vm | wc -l
105
stack@devstack:~$ openstack server list --host devstack2 | grep test-vm | wc -l
10
stack@devstack:~$ openstack server list --host devstack3 | grep test-vm | wc -l
15
stack@devstack:~$ openstack server list --host devstack4 | grep test-vm | wc -l
20

Listing servers with the 2.16 microversion to get the host_status field takes somewhere between 1 and 2 seconds:

http://paste.openstack.org/show/753145/

I'll then try to fix this bug and compare the timings with the fix.

Fix proposed to branch: master
Review: https://review.opendev.org/666042

Changed in nova:
status: Triaged → In Progress
Matt Riedemann (mriedem) wrote :

With the patch applied, listing 150 servers as admin with microversion 2.16 didn't seem to make much difference. I'm going to try and get up to 500 servers and see if that is more noticeable.

Matt Riedemann (mriedem) wrote :

This time with the patch and 600 test servers it takes between 3.2 and 3.7 seconds to list servers with details as admin using 2.16:

http://paste.openstack.org/show/753149/

Without the patch I'm not getting much different timings:

http://paste.openstack.org/show/753150/

Matt Riedemann (mriedem) wrote :

This time with 1000 servers:

$ openstack --os-compute-api-version 2.73 server list | grep test-vm | wc -l
1000

Without the patch here are the timings over 10 runs:

http://paste.openstack.org/show/753152/

The average is 5.7822576.

With the patch here are the timings over 10 runs:

http://paste.openstack.org/show/753153/

The average is 5.2145055 which is better but not very noticeable, especially in an unrealistic scenario like I have where I have hundreds of servers on a single compute node.

Based on this I'll drop the severity on this bug to low.

Changed in nova:
importance: High → Low
Changed in nova:
assignee: Matt Riedemann (mriedem) → Eric Fried (efried)
Eric Fried (efried) on 2019-07-03
Changed in nova:
assignee: Eric Fried (efried) → Matt Riedemann (mriedem)

Reviewed: https://review.opendev.org/666042
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ab7d923ae7ed4d7525649bc16ff012c1bedca0f5
Submitter: Zuul
Branch: master

commit ab7d923ae7ed4d7525649bc16ff012c1bedca0f5
Author: Matt Riedemann <email address hidden>
Date: Tue Jun 18 11:13:32 2019 -0400

    Fix GET /servers/detail host_status performance regression

    Change I82b11b8866ac82b05eae04351605d52fa8b91453 moved the
    host_status extended server attribute processing from an
    extension to the main servers view builder. This, however,
    caused a regression in the detailed listing of servers because
    it didn't incorporate the caching mechanism used previously
    by the extension so now for each server with details when
    microversion 2.16 or greater is used (and the request passes
    the policy check), we get the host status per server even if
    we have multiple servers on the same host.

    This moves the host_status processing out of the show() method
    when listing servers with details and processes them in aggregate
    similar to security groups and attached volumes.

    One catch is the show() method handles instances from down cells
    for us so we have to handle that separately in the new host_status
    processing, but it's trivial (just don't get host_status for
    instances without a host field set).

    This reverts commit 0cecd2ac324dc9a20e7d60987b9ef6e5082c426d.

    Change-Id: I8278d4ea993ed1600919e34c9759600c8c7dbb41
    Closes-Bug: #1830260

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/669958
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ef10d8d9a67855895f4ea9f9c49e1976b84e1547
Submitter: Zuul
Branch: stable/stein

commit ef10d8d9a67855895f4ea9f9c49e1976b84e1547
Author: Matt Riedemann <email address hidden>
Date: Tue Jun 18 11:13:32 2019 -0400

    Fix GET /servers/detail host_status performance regression

    Change I82b11b8866ac82b05eae04351605d52fa8b91453 moved the
    host_status extended server attribute processing from an
    extension to the main servers view builder. This, however,
    caused a regression in the detailed listing of servers because
    it didn't incorporate the caching mechanism used previously
    by the extension so now for each server with details when
    microversion 2.16 or greater is used (and the request passes
    the policy check), we get the host status per server even if
    we have multiple servers on the same host.

    This moves the host_status processing out of the show() method
    when listing servers with details and processes them in aggregate
    similar to security groups and attached volumes.

    One catch is the show() method handles instances from down cells
    for us so we have to handle that separately in the new host_status
    processing, but it's trivial (just don't get host_status for
    instances without a host field set).

    NOTE(mriedem): This backport does not revert commit
    0cecd2ac324dc9a20e7d60987b9ef6e5082c426d since that
    change was only in Train.

    Change-Id: I8278d4ea993ed1600919e34c9759600c8c7dbb41
    Closes-Bug: #1830260
    (cherry picked from commit ab7d923ae7ed4d7525649bc16ff012c1bedca0f5)

This issue was fixed in the openstack/nova 19.0.2 release.

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers