Nova may erronously look up service version of a deleted service, when hostname have been reused

Bug #1778305 reported by Trygve Vea on 2018-06-22
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
melanie witt
Pike
Medium
Matt Riedemann
Queens
Medium
Matt Riedemann
Rocky
Medium
Matt Riedemann
Stein
Medium
Matt Riedemann

Bug Description

Prerequisites:

- A compute node running an old version of nova has been deleted. (In our case, version 9)
- The hostname of said compute node has been reused, and has been upgraded as per normal. (To version 16)
- The services table in the nova database contains both the old and the new node defined, where the deleted one are clearly marked as deleted - and with the old version specified in the version column. The new node also exist, upgraded as it is.
- One has at least one instance running on the upgraded node.
- Perform upgrade from ocata to pike
- Any projects with instances running on the upgraded node, may erronously get an error message that "ERROR (BadRequest): This service is older (v9) than the minimum (v16) version of the rest of the deployment. Unable to continue. (HTTP 400) (Request-ID: req-3e0ababe-e09b-4ef8-ba3a-43060bc1f807)" --- when performing 'nova list'.

Example of how this may look in the database:

MariaDB [nova]> SELECT * FROM services WHERE host = 'node11.acme.org';
+---------------------+---------------------+---------------------+-----+-----------------+--------------+---------+--------------+----------+---------+-----------------+---------------------+-------------+---------+--------------------------------------+
| created_at | updated_at | deleted_at | id | host | binary | topic | report_count | disabled | deleted | disabled_reason | last_seen_up | forced_down | version | uuid |
+---------------------+---------------------+---------------------+-----+-----------------+--------------+---------+--------------+----------+---------+-----------------+---------------------+-------------+---------+--------------------------------------+
| 2017-10-17 13:06:10 | 2018-06-22 21:42:42 | NULL | 179 | node11.acme.org | nova-compute | compute | 2138069 | 0 | 0 | NULL | 2018-06-22 21:42:42 | 0 | 22 | 63e1cb55-ee00-4cb8-b304-160dd5c45fdd |
| 2016-08-13 08:20:05 | 2016-11-15 00:01:21 | 2016-11-27 15:11:30 | 104 | node11.acme.org | nova-compute | compute | 796220 | 1 | 104 | NULL | 2016-11-15 00:01:21 | 0 | 9 | NULL |
+---------------------+---------------------+---------------------+-----+-----------------+--------------+---------+--------------+----------+---------+-----------------+---------------------+-------------+---------+--------------------------------------+
2 rows in set (0.01 sec)

Removing the old service from the database is an effective workaround for this problem.

Andrey Volkov (avolkov) wrote :

It looks like both Service.get_by_host_and_binary and Service.get_minimum_version should not account deleted services.

Could you please provide a traceback as well?

Matt Riedemann (mriedem) wrote :

Which release are you on? Pike I guess? The traceback would be useful, but I'm guessing we're failing as a result of lazy-loading the instance.services field which does a joinedload in the DB API here:

https://github.com/openstack/nova/blob/stable/pike/nova/db/sqlalchemy/api.py#L1956

Although the backref says that should only pull non-deleted services for instance.services:

https://github.com/openstack/nova/blob/stable/pike/nova/db/sqlalchemy/models.py#L103

So yeah, a stacktrace with the error would be useful here.

Trygve Vea (trygve-vea-gmail) wrote :
Download full text (8.6 KiB)

We're running Pike (nova 16.1.4), on Centos 7.5 with the RDO packages.

This error message didn't come with a stack trace, it was just logged in nova-api.log

Running nova-api in debug mode yielded the following logs when making the call:

2018-06-22 23:26:50.348 22073 DEBUG nova.api.openstack.wsgi [req-3e0ababe-e09b-4ef8-ba3a-43060bc1f807 f5aa981c393d716b7a4d6d4407a5e29445732b73c062edab167017160a65b680 2b633b0b2ca84e748e37b0b12bbb81fc - 5039e916e83448099a0de89c349fdaf8 5039e916e83448099a0de89c349fdaf8] Calling method '<bound method ServersController.detail of <nova.api.openstack.compute.servers.ServersController object at 0x7f92cdf14d90>>' _process_stack /usr/lib/python2.7/site-packages/nova/api/openstack/wsgi.py:612
2018-06-22 23:26:50.353 22073 DEBUG nova.compute.api [req-3e0ababe-e09b-4ef8-ba3a-43060bc1f807 f5aa981c393d716b7a4d6d4407a5e29445732b73c062edab167017160a65b680 2b633b0b2ca84e748e37b0b12bbb81fc - 5039e916e83448099a0de89c349fdaf8 5039e916e83448099a0de89c349fdaf8] Searching by: {'deleted': False, 'project_id': u'2b633b0b2ca84e748e37b0b12bbb81fc'} get_all /usr/lib/python2.7/site-packages/nova/compute/api.py:2356
2018-06-22 23:26:50.376 22073 DEBUG oslo_db.sqlalchemy.engines [req-3e0ababe-e09b-4ef8-ba3a-43060bc1f807 f5aa981c393d716b7a4d6d4407a5e29445732b73c062edab167017160a65b680 2b633b0b2ca84e748e37b0b12bbb81fc - 5039e916e83448099a0de89c349fdaf8 5039e916e83448099a0de89c349fdaf8] MySQL server mode set to STRICT_TRANS_TABLES,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,TRADITIONAL,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION _check_effective_sql_mode /usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/engines.py:285
2018-06-22 23:26:50.400 22073 DEBUG oslo_concurrency.lockutils [req-3e0ababe-e09b-4ef8-ba3a-43060bc1f807 f5aa981c393d716b7a4d6d4407a5e29445732b73c062edab167017160a65b680 2b633b0b2ca84e748e37b0b12bbb81fc - 5039e916e83448099a0de89c349fdaf8 5039e916e83448099a0de89c349fdaf8] Lock "00000000-0000-0000-0000-000000000000" acquired by "nova.context.get_or_set_cached_cell_and_set_connections" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:270
2018-06-22 23:26:50.402 22073 DEBUG oslo_concurrency.lockutils [req-3e0ababe-e09b-4ef8-ba3a-43060bc1f807 f5aa981c393d716b7a4d6d4407a5e29445732b73c062edab167017160a65b680 2b633b0b2ca84e748e37b0b12bbb81fc - 5039e916e83448099a0de89c349fdaf8 5039e916e83448099a0de89c349fdaf8] Lock "00000000-0000-0000-0000-000000000000" released by "nova.context.get_or_set_cached_cell_and_set_connections" :: held 0.001s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:282
2018-06-22 23:26:50.404 22073 WARNING oslo_db.sqlalchemy.engines [req-3e0ababe-e09b-4ef8-ba3a-43060bc1f807 f5aa981c393d716b7a4d6d4407a5e29445732b73c062edab167017160a65b680 2b633b0b2ca84e748e37b0b12bbb81fc - 5039e916e83448099a0de89c349fdaf8 5039e916e83448099a0de89c349fdaf8] URL mysql://nova:***@ctrl.acme.org/nova_cell0 does not contain a '+drivername' portion, and will make use of a default driver. A full dbname+drivername:// protocol is recommended. For MySQL, it is strongly recommended that mysql+pymysql:// be specified for maximum service compatibility
2018-06-2...

Read more...

melanie witt (melwitt) wrote :

Thanks for including the debug log.

This reminded me of another bug related to upgrading from Ocata => Pike and getting UUIDs generated for services for hosts on which instances are residing [1]. We can see the UUID generation happening in your nova-api log "Generated UUID 2dd86bf1-a805-42ef-9fb8-c94c9c3408fe for service 104 _from_db_object /usr/lib/python2.7/site-packages/nova/objects/service.py:245".

Having a look around the code, it looks like you have a situation where some of your instances are still associated in the database with the old, deleted=0 service record. In the link from Matt [2] about the table join, it's only filtering out Instance.deleted=0, *not* Service.deleted=0. So it looks like there's nothing preventing a join of the 'instances' and 'services' tables including deleted services.

During a 'nova list', the services associated with the hosts on which the instances reside will be queried and if they lack a UUID, one will be generated and saved. During the save [3] is where we hit the ServiceTooOld exception (which is logged as "This service is older (v9) than the minimum (v16) version of the rest of the deployment. Unable to continue.").

So, it seems like we need to do something to avoid pulling deleted services when we join the 'instances' and 'services' tables.

I'm also reminded of another bug related to this where we made a change to block deletion of a service that is still hosting instances [4]. That means it will force you to have to migrate or delete your instances associated with that compute host before allowing the service to be deleted. That patch landed in stable/pike a few days ago and has *not* yet been released.

[1] https://bugs.launchpad.net/nova/pike/+bug/1746509
[2] https://github.com/openstack/nova/blob/stable/pike/nova/db/sqlalchemy/models.py#L103
[3] https://github.com/openstack/nova/blob/16.1.4/nova/objects/service.py#L244-L246
[4] https://bugs.launchpad.net/nova/+bug/1763183

Related fix proposed to branch: master
Review: https://review.openstack.org/582407

Changed in nova:
assignee: nobody → melanie witt (melwitt)
status: New → In Progress
melanie witt (melwitt) on 2019-04-10
Changed in nova:
importance: Undecided → Medium
tags: added: upgrade
Changed in nova:
assignee: melanie witt (melwitt) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) on 2019-07-30
Changed in nova:
assignee: Matt Riedemann (mriedem) → melanie witt (melwitt)

Reviewed: https://review.opendev.org/582407
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1d9aaef3d1c880b02d6b7c186e8de79cbaac9ae0
Submitter: Zuul
Branch: master

commit 1d9aaef3d1c880b02d6b7c186e8de79cbaac9ae0
Author: melanie witt <email address hidden>
Date: Thu Jul 12 20:46:24 2018 +0000

    Add functional regression test for bug 1778305

    This tests the scenario where an operator has upgraded from Ocata =>
    Pike but has old, deleted services that share the same hostname with
    currently existing, running services. The current behavior is that
    ServiceTooOld is raised when we try to generate a UUID for the old,
    deleted service while querying for a list of instances. After the bug
    is fixed, we should be able to successfully query for a list of
    instances.

    Related-Bug: #1778305

    Change-Id: I1c62a0a5be6c44176b5714af00cc181eb5437585

Reviewed: https://review.opendev.org/582408
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=16e163053ca39886f11fdb8a3af10a28619fc105
Submitter: Zuul
Branch: master

commit 16e163053ca39886f11fdb8a3af10a28619fc105
Author: melanie witt <email address hidden>
Date: Thu Jul 12 21:48:23 2018 +0000

    Don't generate service UUID for deleted services

    In Pike, we added a UUID field to services and during an upgrade from
    Ocata => Pike, when instances are accessed joined with their associated
    services, we generate a UUID for the services on-the-fly.

    This causes a problem in the scenario where an operator upgrades their
    cluster and has old, deleted services with hostnames matching existing
    services associated with instances. When we go to generate the service
    UUID for the old, deleted service, we hit a ServiceTooOld exception.

    This addresses the problem by not bothering to generate a UUID for a
    deleted service. One alternative would be to exclude deleted services
    when we join the 'instances' and 'services' tables, but I'm not sure
    whether that approach might cause unintended effects where service
    information that used to be viewable for instances becomes hidden.

    Closes-Bug: #1778305
    Closes-Bug: #1764556

    Change-Id: I347096a527c257075cefe7b81210622f6cd87daf

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/673813
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=06ae1ff21495a51d24376c457f51162e17ae3235
Submitter: Zuul
Branch: stable/stein

commit 06ae1ff21495a51d24376c457f51162e17ae3235
Author: melanie witt <email address hidden>
Date: Thu Jul 12 20:46:24 2018 +0000

    Add functional regression test for bug 1778305

    This tests the scenario where an operator has upgraded from Ocata =>
    Pike but has old, deleted services that share the same hostname with
    currently existing, running services. The current behavior is that
    ServiceTooOld is raised when we try to generate a UUID for the old,
    deleted service while querying for a list of instances. After the bug
    is fixed, we should be able to successfully query for a list of
    instances.

    Related-Bug: #1778305

    Change-Id: I1c62a0a5be6c44176b5714af00cc181eb5437585
    (cherry picked from commit 1d9aaef3d1c880b02d6b7c186e8de79cbaac9ae0)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/673814
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=11fde850e6163b68e006e746406e6c5e6ac980e0
Submitter: Zuul
Branch: stable/stein

commit 11fde850e6163b68e006e746406e6c5e6ac980e0
Author: melanie witt <email address hidden>
Date: Thu Jul 12 21:48:23 2018 +0000

    Don't generate service UUID for deleted services

    In Pike, we added a UUID field to services and during an upgrade from
    Ocata => Pike, when instances are accessed joined with their associated
    services, we generate a UUID for the services on-the-fly.

    This causes a problem in the scenario where an operator upgrades their
    cluster and has old, deleted services with hostnames matching existing
    services associated with instances. When we go to generate the service
    UUID for the old, deleted service, we hit a ServiceTooOld exception.

    This addresses the problem by not bothering to generate a UUID for a
    deleted service. One alternative would be to exclude deleted services
    when we join the 'instances' and 'services' tables, but I'm not sure
    whether that approach might cause unintended effects where service
    information that used to be viewable for instances becomes hidden.

    Closes-Bug: #1778305
    Closes-Bug: #1764556

    Change-Id: I347096a527c257075cefe7b81210622f6cd87daf
    (cherry picked from commit 16e163053ca39886f11fdb8a3af10a28619fc105)

Reviewed: https://review.opendev.org/673818
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=36246000949f29036ed5b9eed563975d063b29c9
Submitter: Zuul
Branch: stable/rocky

commit 36246000949f29036ed5b9eed563975d063b29c9
Author: melanie witt <email address hidden>
Date: Thu Jul 12 20:46:24 2018 +0000

    Add functional regression test for bug 1778305

    This tests the scenario where an operator has upgraded from Ocata =>
    Pike but has old, deleted services that share the same hostname with
    currently existing, running services. The current behavior is that
    ServiceTooOld is raised when we try to generate a UUID for the old,
    deleted service while querying for a list of instances. After the bug
    is fixed, we should be able to successfully query for a list of
    instances.

    Related-Bug: #1778305

    Change-Id: I1c62a0a5be6c44176b5714af00cc181eb5437585
    (cherry picked from commit 1d9aaef3d1c880b02d6b7c186e8de79cbaac9ae0)
    (cherry picked from commit 06ae1ff21495a51d24376c457f51162e17ae3235)

tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/673821
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=72f9aa720f9137a35d55b9f96e2220d2d0e5588d
Submitter: Zuul
Branch: stable/rocky

commit 72f9aa720f9137a35d55b9f96e2220d2d0e5588d
Author: melanie witt <email address hidden>
Date: Thu Jul 12 21:48:23 2018 +0000

    Don't generate service UUID for deleted services

    In Pike, we added a UUID field to services and during an upgrade from
    Ocata => Pike, when instances are accessed joined with their associated
    services, we generate a UUID for the services on-the-fly.

    This causes a problem in the scenario where an operator upgrades their
    cluster and has old, deleted services with hostnames matching existing
    services associated with instances. When we go to generate the service
    UUID for the old, deleted service, we hit a ServiceTooOld exception.

    This addresses the problem by not bothering to generate a UUID for a
    deleted service. One alternative would be to exclude deleted services
    when we join the 'instances' and 'services' tables, but I'm not sure
    whether that approach might cause unintended effects where service
    information that used to be viewable for instances becomes hidden.

    Closes-Bug: #1778305
    Closes-Bug: #1764556

    Conflicts:
          nova/tests/functional/regressions/test_bug_1764556.py

    NOTE(mriedem): The conflict is due to eadd78efe3 removing the
    func_fixtures import.

    Change-Id: I347096a527c257075cefe7b81210622f6cd87daf
    (cherry picked from commit 16e163053ca39886f11fdb8a3af10a28619fc105)
    (cherry picked from commit 8601ca75b1515e7434f1d3c563a0e65458c8c86a)

Reviewed: https://review.opendev.org/673825
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=57538f0d8d11189b348c5820e88169862e25800a
Submitter: Zuul
Branch: stable/queens

commit 57538f0d8d11189b348c5820e88169862e25800a
Author: melanie witt <email address hidden>
Date: Thu Jul 12 20:46:24 2018 +0000

    Add functional regression test for bug 1778305

    This tests the scenario where an operator has upgraded from Ocata =>
    Pike but has old, deleted services that share the same hostname with
    currently existing, running services. The current behavior is that
    ServiceTooOld is raised when we try to generate a UUID for the old,
    deleted service while querying for a list of instances. After the bug
    is fixed, we should be able to successfully query for a list of
    instances.

    Related-Bug: #1778305

    Change-Id: I1c62a0a5be6c44176b5714af00cc181eb5437585
    (cherry picked from commit 1d9aaef3d1c880b02d6b7c186e8de79cbaac9ae0)
    (cherry picked from commit 06ae1ff21495a51d24376c457f51162e17ae3235)
    (cherry picked from commit 36246000949f29036ed5b9eed563975d063b29c9)

tags: added: in-stable-queens

Reviewed: https://review.opendev.org/673827
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=97e78f8cc194c60d06a37b11c9c09405bf87b295
Submitter: Zuul
Branch: stable/queens

commit 97e78f8cc194c60d06a37b11c9c09405bf87b295
Author: melanie witt <email address hidden>
Date: Thu Jul 12 21:48:23 2018 +0000

    Don't generate service UUID for deleted services

    In Pike, we added a UUID field to services and during an upgrade from
    Ocata => Pike, when instances are accessed joined with their associated
    services, we generate a UUID for the services on-the-fly.

    This causes a problem in the scenario where an operator upgrades their
    cluster and has old, deleted services with hostnames matching existing
    services associated with instances. When we go to generate the service
    UUID for the old, deleted service, we hit a ServiceTooOld exception.

    This addresses the problem by not bothering to generate a UUID for a
    deleted service. One alternative would be to exclude deleted services
    when we join the 'instances' and 'services' tables, but I'm not sure
    whether that approach might cause unintended effects where service
    information that used to be viewable for instances becomes hidden.

    Closes-Bug: #1778305
    Closes-Bug: #1764556

    Change-Id: I347096a527c257075cefe7b81210622f6cd87daf
    (cherry picked from commit 16e163053ca39886f11fdb8a3af10a28619fc105)
    (cherry picked from commit 8601ca75b1515e7434f1d3c563a0e65458c8c86a)
    (cherry picked from commit d9094e54fbb6a25c8cf189c4fd9fbb8d66cec145)

This issue was fixed in the openstack/nova 19.0.2 release.

This issue was fixed in the openstack/nova 18.2.2 release.

This issue was fixed in the openstack/nova 17.0.12 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers