hypervisor statistics could be incorrect

Bug #1692397 reported by Zhenyu Zheng on 2017-05-22
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Zhenyu Zheng
Newton
Low
Matt Riedemann
Ocata
Low
Matt Riedemann
Ubuntu Cloud Archive
Undecided
Unassigned
Mitaka
Low
Unassigned
nova (Ubuntu)
Undecided
Unassigned
Xenial
Low
Unassigned

Bug Description

[Impact]

If you deploy a nova-compute service to a node, delete that service (via the api), then deploy a new nova-compute service to that same node i.e. same hostname, the database will now have two service records one marked as deleted and the other not. So far so good until you do an 'openstack hypervisor stats show' at which point the api will aggregate the resource counts from both services. This has been fixed and backported all the way down to Newton so the problem still exists on Mitaka. I assume the reason why the patch was not backported to Mitaka is that the code in nova.db.sqlalchemy.apy.compute_node_statistics() changed quite a bit. However it only requires a one line change in the old code (that does the same thing as the new code) to fix this issue.

[Test Case]

 * Deploy Mitaka with bundle http://pastebin.ubuntu.com/25968008/

 * Do 'openstack hypervisor stats show' and verify that count is 3

 * Do 'juju remove-unit nova-compute/2' to delete a compute service but not its physical host

 * Do 'openstack compute service delete <id>' to delete a compute service we just removed (choosing correct id)

 * Do 'openstack hypervisor stats show' and verify that count is 2

 * Do juju add-unit nova-compute --to <machine id of deleted unit>

 * Do 'openstack hypervisor stats show' and verify that count is 3 (not 4 as it would be before fix)

[Regression Potential]

None anticipated other than for clients that were interpreting invalid counts as correct.

[Other Info]

===========================================================================

Hypervisor statistics could be incorrect:

When we killed a nova-compute service and deleted the service from nova DB, and then
start the nova-compute service again, the result of Hypervisor/statistics API (nova hypervisor-stats) will be
incorrect;

How to reproduce:

Step1. Check the correct statistics before we do anything:
root@SZX1000291919:/opt/stack/nova# nova hypervisor-stats
+----------------------+-------+
| Property | Value |
+----------------------+-------+
| count | 1 |
| current_workload | 0 |
| disk_available_least | 14 |
| free_disk_gb | 34 |
| free_ram_mb | 6936 |
| local_gb | 35 |
| local_gb_used | 1 |
| memory_mb | 7960 |
| memory_mb_used | 1024 |
| running_vms | 1 |
| vcpus | 8 |
| vcpus_used | 1 |
+----------------------+-------+

Step2. Kill the compute service:
root@SZX1000291919:/var/log/nova# ps -ef | grep nova-com
root 120419 120411 0 11:06 pts/27 00:00:00 sg libvirtd /usr/local/bin/nova-compute --config-file /etc/nova/nova.conf --log-file /var/log/nova/nova-compute.log
root 120420 120419 0 11:06 pts/27 00:00:07 /usr/bin/python /usr/local/bin/nova-compute --config-file /etc/nova/nova.conf --log-file /var/log/nova/nova-compute.log

root@SZX1000291919:/var/log/nova# kill -9 120419
root@SZX1000291919:/var/log/nova# /usr/local/bin/stack: line 19: 120419 Killed sg libvirtd '/usr/local/bin/nova-compute --config-file /etc/nova/nova.conf --log-file /var/log/nova/nova-compute.log' > /dev/null 2>&1

root@SZX1000291919:/var/log/nova# nova service-list
+----+------------------+---------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason |
+----+------------------+---------------+----------+---------+-------+----------------------------+-----------------+
| 4 | nova-conductor | SZX1000291919 | internal | enabled | up | 2017-05-22T03:24:36.000000 | - |
| 6 | nova-scheduler | SZX1000291919 | internal | enabled | up | 2017-05-22T03:24:36.000000 | - |
| 7 | nova-consoleauth | SZX1000291919 | internal | enabled | up | 2017-05-22T03:24:37.000000 | - |
| 8 | nova-compute | SZX1000291919 | nova | enabled | down | 2017-05-22T03:23:38.000000 | - |
| 9 | nova-cert | SZX1000291919 | internal | enabled | down | 2017-05-17T02:50:13.000000 | - |
+----+------------------+---------------+----------+---------+-------+----------------------------+-----------------+

Step3. Delete the service from DB:

root@SZX1000291919:/var/log/nova# nova service-delete 8
root@SZX1000291919:/var/log/nova# nova service-list
+----+------------------+---------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason |
+----+------------------+---------------+----------+---------+-------+----------------------------+-----------------+
| 4 | nova-conductor | SZX1000291919 | internal | enabled | up | 2017-05-22T03:25:16.000000 | - |
| 6 | nova-scheduler | SZX1000291919 | internal | enabled | up | 2017-05-22T03:25:16.000000 | - |
| 7 | nova-consoleauth | SZX1000291919 | internal | enabled | up | 2017-05-22T03:25:17.000000 | - |
| 9 | nova-cert | SZX1000291919 | internal | enabled | down | 2017-05-17T02:50:13.000000 | - |
+----+------------------+---------------+----------+---------+-------+----------------------------+-----------------+

Step4. Start the compute service again:
root@SZX1000291919:/var/log/nova# nova service-list
+----+------------------+---------------+----------+---------+-------+----------------------------+-----------------+
| Id | Binary | Host | Zone | Status | State | Updated_at | Disabled Reason |
+----+------------------+---------------+----------+---------+-------+----------------------------+-----------------+
| 4 | nova-conductor | SZX1000291919 | internal | enabled | up | 2017-05-22T03:48:55.000000 | - |
| 6 | nova-scheduler | SZX1000291919 | internal | enabled | up | 2017-05-22T03:48:56.000000 | - |
| 7 | nova-consoleauth | SZX1000291919 | internal | enabled | up | 2017-05-22T03:48:56.000000 | - |
| 9 | nova-cert | SZX1000291919 | internal | enabled | down | 2017-05-17T02:50:13.000000 | - |
| 10 | nova-compute | SZX1000291919 | nova | enabled | up | 2017-05-22T03:48:57.000000 | - |
+----+------------------+---------------+----------+---------+-------+----------------------------+-----------------+

Step5. Check again the hyervisor statistics, the result is incorrect:

root@SZX1000291919:/var/log/nova# nova hypervisor-stats
+----------------------+-------+
| Property | Value |
+----------------------+-------+
| count | 2 |
| current_workload | 0 |
| disk_available_least | 28 |
| free_disk_gb | 68 |
| free_ram_mb | 13872 |
| local_gb | 70 |
| local_gb_used | 2 |
| memory_mb | 15920 |
| memory_mb_used | 2048 |
| running_vms | 2 |
| vcpus | 16 |
| vcpus_used | 2 |
+----------------------+-------+

Changed in nova:
assignee: nobody → Zhenyu Zheng (zhengzhenyu)
description: updated
description: updated
description: updated

Fix proposed to branch: master
Review: https://review.openstack.org/467220

Changed in nova:
status: New → In Progress
Matt Riedemann (mriedem) wrote :

Looks like it doubled everything when you restarted the compute service, is it reporting 2 compute nodes records instead of one for that single service (or vice-versa)?

Matt Riedemann (mriedem) on 2017-05-25
Changed in nova:
importance: Undecided → Low
Changed in nova:
assignee: Zhenyu Zheng (zhengzhenyu) → Matt Riedemann (mriedem)

Reviewed: https://review.openstack.org/467220
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3d3e9cdd774efe96f468f2bcba6c09a40f5e71d3
Submitter: Jenkins
Branch: master

commit 3d3e9cdd774efe96f468f2bcba6c09a40f5e71d3
Author: Kevin_Zheng <email address hidden>
Date: Tue May 23 20:28:28 2017 +0800

    Exclude deleted service records when calling hypervisor statistics

    Hypervisor statistics could be incorrect if not
    exclude deleted service records from DB.

    User may stop 'nova-compute' service on some
    compute nodes and delete the service from nova.
    When delete 'nova-compute' service, it performs
    'soft-delete' to the corresponding db records in
    both 'service' table and 'compute_nodes' table if
    the compute_nodes record is old, i.e. it is linked
    to the service record. For modern compute_nodes
    records, they aren't linked to the services table
    so deleting the services record will not delete
    the compute_nodes record, and the ResourceTracker
    won't recreate the compute_nodes record if the host
    and hypervisor_hostname still match the existing
    record, but restarting the process after deleting
    the service will create a new services table record
    with the same host/binary/topic.

    If the 'nova-compute' service on that server
    re-starts, it will automatically add a record
    in 'compute_nodes' table (assuming it was deleted
    because it was an old-style record) and also a correspoding
    record in 'service' table, and if the host name
    of the compute node did not change, the newly
    created records in 'service' and 'compute_nodes'
    table will be identical to the priously soft-deleted
    records except the deleted row.

    When calling Hypervisor-statistics, the DB layer
    joined records across the whole deployment by
    comparing records' host field selected from
    serivce table and records' host field selected
    from compute_nodes table, and the calculated
    results could be multiplied if multiple records
    from service table have the same host field,
    and this scenario could happen if user perform
    the above actions.

    Co-Authored-By: Matt Riedemann <email address hidden>

    Change-Id: I9dfa15f69f8ef9c6cb36b2734a8601bd73e9d6b3
    Closes-Bug: #1692397

Changed in nova:
status: In Progress → Fix Released
Matt Riedemann (mriedem) on 2017-05-26
Changed in nova:
assignee: Matt Riedemann (mriedem) → Zhenyu Zheng (zhengzhenyu)

Reviewed: https://review.openstack.org/468526
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=74e2a400b2ea3b011f88a7dbd8bb0fa3547b3bfa
Submitter: Jenkins
Branch: stable/ocata

commit 74e2a400b2ea3b011f88a7dbd8bb0fa3547b3bfa
Author: Kevin_Zheng <email address hidden>
Date: Tue May 23 20:28:28 2017 +0800

    Exclude deleted service records when calling hypervisor statistics

    Hypervisor statistics could be incorrect if not
    exclude deleted service records from DB.

    User may stop 'nova-compute' service on some
    compute nodes and delete the service from nova.
    When delete 'nova-compute' service, it performs
    'soft-delete' to the corresponding db records in
    both 'service' table and 'compute_nodes' table if
    the compute_nodes record is old, i.e. it is linked
    to the service record. For modern compute_nodes
    records, they aren't linked to the services table
    so deleting the services record will not delete
    the compute_nodes record, and the ResourceTracker
    won't recreate the compute_nodes record if the host
    and hypervisor_hostname still match the existing
    record, but restarting the process after deleting
    the service will create a new services table record
    with the same host/binary/topic.

    If the 'nova-compute' service on that server
    re-starts, it will automatically add a record
    in 'compute_nodes' table (assuming it was deleted
    because it was an old-style record) and also a correspoding
    record in 'service' table, and if the host name
    of the compute node did not change, the newly
    created records in 'service' and 'compute_nodes'
    table will be identical to the priously soft-deleted
    records except the deleted row.

    When calling Hypervisor-statistics, the DB layer
    joined records across the whole deployment by
    comparing records' host field selected from
    serivce table and records' host field selected
    from compute_nodes table, and the calculated
    results could be multiplied if multiple records
    from service table have the same host field,
    and this scenario could happen if user perform
    the above actions.

    Co-Authored-By: Matt Riedemann <email address hidden>

    Change-Id: I9dfa15f69f8ef9c6cb36b2734a8601bd73e9d6b3
    Closes-Bug: #1692397
    (cherry picked from commit 3d3e9cdd774efe96f468f2bcba6c09a40f5e71d3)

This issue was fixed in the openstack/nova 16.0.0.0b2 development milestone.

This issue was fixed in the openstack/nova 15.0.6 release.

Reviewed: https://review.openstack.org/468528
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6dc2a0ec1cfe70bbf0f50a36bca5d2794e34e1b1
Submitter: Jenkins
Branch: stable/newton

commit 6dc2a0ec1cfe70bbf0f50a36bca5d2794e34e1b1
Author: Kevin_Zheng <email address hidden>
Date: Tue May 23 20:28:28 2017 +0800

    Exclude deleted service records when calling hypervisor statistics

    Hypervisor statistics could be incorrect if not
    exclude deleted service records from DB.

    User may stop 'nova-compute' service on some
    compute nodes and delete the service from nova.
    When delete 'nova-compute' service, it performs
    'soft-delete' to the corresponding db records in
    both 'service' table and 'compute_nodes' table if
    the compute_nodes record is old, i.e. it is linked
    to the service record. For modern compute_nodes
    records, they aren't linked to the services table
    so deleting the services record will not delete
    the compute_nodes record, and the ResourceTracker
    won't recreate the compute_nodes record if the host
    and hypervisor_hostname still match the existing
    record, but restarting the process after deleting
    the service will create a new services table record
    with the same host/binary/topic.

    If the 'nova-compute' service on that server
    re-starts, it will automatically add a record
    in 'compute_nodes' table (assuming it was deleted
    because it was an old-style record) and also a correspoding
    record in 'service' table, and if the host name
    of the compute node did not change, the newly
    created records in 'service' and 'compute_nodes'
    table will be identical to the priously soft-deleted
    records except the deleted row.

    When calling Hypervisor-statistics, the DB layer
    joined records across the whole deployment by
    comparing records' host field selected from
    serivce table and records' host field selected
    from compute_nodes table, and the calculated
    results could be multiplied if multiple records
    from service table have the same host field,
    and this scenario could happen if user perform
    the above actions.

    Co-Authored-By: Matt Riedemann <email address hidden>

    Change-Id: I9dfa15f69f8ef9c6cb36b2734a8601bd73e9d6b3
    Closes-Bug: #1692397
    (cherry picked from commit 3d3e9cdd774efe96f468f2bcba6c09a40f5e71d3)
    (cherry picked from commit 74e2a400b2ea3b011f88a7dbd8bb0fa3547b3bfa)

This issue was fixed in the openstack/nova 14.0.8 release.

Edward Hope-Morley (hopem) wrote :

SRUing to Xenial/Mitaka (see bug 1719770 for more info)

Changed in nova (Ubuntu):
status: New → Fix Released
Changed in cloud-archive:
status: New → Fix Committed
status: Fix Committed → Fix Released
tags: added: sts sts-sru-needed
Edward Hope-Morley (hopem) wrote :
description: updated
tags: added: sts-sponsor
Changed in nova (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → Low
Corey Bryant (corey.bryant) wrote :

Thanks Edward. I've uploaded to xenial [1] and it is now awaiting review by the SRU team.

[1] https://launchpad.net/ubuntu/xenial/+queue?queue_state=1&queue_text=

Hello Zhenyu, or anyone else affected,

Accepted nova into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/nova/2:13.1.4-0ubuntu4.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in nova (Ubuntu Xenial):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-xenial
Edward Hope-Morley (hopem) wrote :

Verified using testcase in description.

tags: added: verification-done-xenial
removed: verification-needed-xenial
tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nova - 2:13.1.4-0ubuntu4.2

---------------
nova (2:13.1.4-0ubuntu4.2) xenial; urgency=medium

  [ Seyeong Kim ]
  * Add supporting http_proxy_to_wsgi to api-paste.ini (LP: #1573766)
    - d/p/0001-Add-http_proxy_to_wsgi-to-api-paste.patch
    - d/p/0002-Add-proxy-middleware-to-application-pipeline.patch

  [ Edward Hope-Morley ]
  * Patch nova.db.sqlalchemy.api.compute_node_statistics() to
    exclude deleted services from stats count. This is the same
    fix as that backported to newton in bug 1692397 except that
    the actual patch is not backportable due to the underlying
    code changing extensively.
    - d/p/exlude-deleted-service-from-stats-count.patch (LP: #1692397)

 -- Corey Bryant <email address hidden> Fri, 08 Dec 2017 15:44:43 -0500

Changed in nova (Ubuntu Xenial):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Hua Zhang (zhhuabj) wrote :

Hello Zhenyu, or anyone else affected,

Accepted nova into mitaka-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:mitaka-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-mitaka-needed to verification-mitaka-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-mitaka-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-mitaka-needed
Edward Hope-Morley (hopem) wrote :

Verified trusty-proposed/mitaka using test case in bug description.

tags: added: sts-sru-done verification-mitaka-done
removed: sts-sru-needed verification-mitaka-needed
tags: removed: sts-sponsor
Corey Bryant (corey.bryant) wrote :

This has also been regression tested successfully for trusty-mitaka-proposed and xenial-mitaka-proposed:

trusty-mitaka-proposed:

======
Totals
======
Ran: 102 tests in 767.0600 sec.
 - Passed: 94
 - Skipped: 8
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 472.5363 sec.

xenial-mitaka-proposed:

======
Totals
======
Ran: 102 tests in 861.6688 sec.
 - Passed: 94
 - Skipped: 8
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 524.2865 sec.

The verification of the Stable Release Update for nova has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package nova - 2:13.1.4-0ubuntu4.2~cloud0
---------------

 nova (2:13.1.4-0ubuntu4.2~cloud0) trusty-mitaka; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 nova (2:13.1.4-0ubuntu4.2) xenial; urgency=medium
 .
   [ Seyeong Kim ]
   * Add supporting http_proxy_to_wsgi to api-paste.ini (LP: #1573766)
     - d/p/0001-Add-http_proxy_to_wsgi-to-api-paste.patch
     - d/p/0002-Add-proxy-middleware-to-application-pipeline.patch
 .
   [ Edward Hope-Morley ]
   * Patch nova.db.sqlalchemy.api.compute_node_statistics() to
     exclude deleted services from stats count. This is the same
     fix as that backported to newton in bug 1692397 except that
     the actual patch is not backportable due to the underlying
     code changing extensively.
     - d/p/exlude-deleted-service-from-stats-count.patch (LP: #1692397)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers