Nova services still have RPC version pinning after upgrade

Bug #1833069 reported by Mark Goddard
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Committed
Medium
Mark Goddard
Queens
Fix Committed
Medium
Mark Goddard
Rocky
Fix Committed
Medium
Mark Goddard
Stein
Fix Released
Medium
Mark Goddard
Train
Fix Committed
Medium
Mark Goddard

Bug Description

During an upgrade, nova pins the version of RPC calls to the minimum seen across all services. This ensures that old services to not receive data they cannot handle. After the upgrade is complete, all nova services are supposed to be reloaded via SIGHUP to cause them to check again the RPC versions of services and use the new latest version which should now be supported by all running services.

Due to a bug [1] in oslo.service, sending services SIGHUP is currently broken. We replaced the HUP with a restart for the nova_compute container for bug 1821362, but not other nova services. It seems we need to restart all nova services to allow the RPC version pin to be removed.

Testing in a Queens to Rocky upgrade, we find the following in the logs:

Automatically selected compute RPC version 5.0 from minimum service version 30

However, the service version in Rocky is 35.

[1] https://bugs.launchpad.net/oslo.service/+bug/1715374

Mark Goddard (mgoddard)
description: updated
Mark Goddard (mgoddard)
Changed in kolla-ansible:
importance: Undecided → Medium
milestone: none → 9.0.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)

Fix proposed to branch: master
Review: https://review.opendev.org/665660

Changed in kolla-ansible:
assignee: nobody → Mark Goddard (mgoddard)
status: New → In Progress
Revision history for this message
Mark Goddard (mgoddard) wrote :

Turns out there is another thing going on here. When nova services start up, they store their service version in the services table in the database. If the service SIGHUP/restart occurs before this has happened, then the version cap will not be removed since some services will still appear to be running the old version still. This is really all about nova-compute's version.

Testing suggests that it takes about 10 seconds for the version to be updated in an AIO setup on Rocky. This is likely to be more time than Ansible allows between these tasks.

Ideally there would be some nova mechanism (API call or nova-manage) to wait until all computes are running the latest version, then only restart services after that. In the absence of that, a sleep seems the best option right now.

Revision history for this message
Mark Goddard (mgoddard) wrote :

I raised a nova bug about this: https://bugs.launchpad.net/nova/+bug/1833542.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/665660
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=e6d2b92200d02715649d923b0ef2d6981905a6b9
Submitter: Zuul
Branch: master

commit e6d2b92200d02715649d923b0ef2d6981905a6b9
Author: Mark Goddard <email address hidden>
Date: Mon Jun 17 13:48:13 2019 +0100

    Restart all nova services after upgrade

    During an upgrade, nova pins the version of RPC calls to the minimum
    seen across all services. This ensures that old services do not receive
    data they cannot handle. After the upgrade is complete, all nova
    services are supposed to be reloaded via SIGHUP to cause them to check
    again the RPC versions of services and use the new latest version which
    should now be supported by all running services.

    Due to a bug [1] in oslo.service, sending services SIGHUP is currently
    broken. We replaced the HUP with a restart for the nova_compute
    container for bug 1821362, but not other nova services. It seems we need
    to restart all nova services to allow the RPC version pin to be removed.

    Testing in a Queens to Rocky upgrade, we find the following in the logs:

    Automatically selected compute RPC version 5.0 from minimum service
    version 30

    However, the service version in Rocky is 35.

    There is a second issue in that it takes some time for the upgraded
    services to update the nova services database table with their new
    version. We need to wait until all nova-compute services have done this
    before the restart is performed, otherwise the RPC version cap will
    remain in place. There is currently no interface in nova available for
    checking these versions [2], so as a workaround we use a configurable
    delay with a default duration of 30 seconds. Testing showed it takes
    about 10 seconds for the version to be updated, so this gives us some
    headroom.

    This change restarts all nova services after an upgrade, after a 30
    second delay.

    [1] https://bugs.launchpad.net/oslo.service/+bug/1715374
    [2] https://bugs.launchpad.net/nova/+bug/1833542

    Change-Id: Ia6fc9011ee6f5461f40a1307b72709d769814a79
    Closes-Bug: #1833069
    Related-Bug: #1833542

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/667934

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/667936

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/667937

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/stein)

Reviewed: https://review.opendev.org/667934
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=96a96a26557092971604441a53f74a0cb98979a5
Submitter: Zuul
Branch: stable/stein

commit 96a96a26557092971604441a53f74a0cb98979a5
Author: Mark Goddard <email address hidden>
Date: Mon Jun 17 13:48:13 2019 +0100

    Restart all nova services after upgrade

    During an upgrade, nova pins the version of RPC calls to the minimum
    seen across all services. This ensures that old services do not receive
    data they cannot handle. After the upgrade is complete, all nova
    services are supposed to be reloaded via SIGHUP to cause them to check
    again the RPC versions of services and use the new latest version which
    should now be supported by all running services.

    Due to a bug [1] in oslo.service, sending services SIGHUP is currently
    broken. We replaced the HUP with a restart for the nova_compute
    container for bug 1821362, but not other nova services. It seems we need
    to restart all nova services to allow the RPC version pin to be removed.

    Testing in a Queens to Rocky upgrade, we find the following in the logs:

    Automatically selected compute RPC version 5.0 from minimum service
    version 30

    However, the service version in Rocky is 35.

    There is a second issue in that it takes some time for the upgraded
    services to update the nova services database table with their new
    version. We need to wait until all nova-compute services have done this
    before the restart is performed, otherwise the RPC version cap will
    remain in place. There is currently no interface in nova available for
    checking these versions [2], so as a workaround we use a configurable
    delay with a default duration of 30 seconds. Testing showed it takes
    about 10 seconds for the version to be updated, so this gives us some
    headroom.

    This change restarts all nova services after an upgrade, after a 30
    second delay.

    [1] https://bugs.launchpad.net/oslo.service/+bug/1715374
    [2] https://bugs.launchpad.net/nova/+bug/1833542

    Change-Id: Ia6fc9011ee6f5461f40a1307b72709d769814a79
    Closes-Bug: #1833069
    Related-Bug: #1833542
    (cherry picked from commit e6d2b92200d02715649d923b0ef2d6981905a6b9)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 8.0.0.0rc2

This issue was fixed in the openstack/kolla-ansible 8.0.0.0rc2 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/rocky)

Reviewed: https://review.opendev.org/667936
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=e4b550e53756b097f893a2e44e82e6403d8d7535
Submitter: Zuul
Branch: stable/rocky

commit e4b550e53756b097f893a2e44e82e6403d8d7535
Author: Mark Goddard <email address hidden>
Date: Mon Jun 17 13:48:13 2019 +0100

    Restart all nova services after upgrade

    During an upgrade, nova pins the version of RPC calls to the minimum
    seen across all services. This ensures that old services do not receive
    data they cannot handle. After the upgrade is complete, all nova
    services are supposed to be reloaded via SIGHUP to cause them to check
    again the RPC versions of services and use the new latest version which
    should now be supported by all running services.

    Due to a bug [1] in oslo.service, sending services SIGHUP is currently
    broken. We replaced the HUP with a restart for the nova_compute
    container for bug 1821362, but not other nova services. It seems we need
    to restart all nova services to allow the RPC version pin to be removed.

    Testing in a Queens to Rocky upgrade, we find the following in the logs:

    Automatically selected compute RPC version 5.0 from minimum service
    version 30

    However, the service version in Rocky is 35.

    There is a second issue in that it takes some time for the upgraded
    services to update the nova services database table with their new
    version. We need to wait until all nova-compute services have done this
    before the restart is performed, otherwise the RPC version cap will
    remain in place. There is currently no interface in nova available for
    checking these versions [2], so as a workaround we use a configurable
    delay with a default duration of 30 seconds. Testing showed it takes
    about 10 seconds for the version to be updated, so this gives us some
    headroom.

    This change restarts all nova services after an upgrade, after a 30
    second delay.

    [1] https://bugs.launchpad.net/oslo.service/+bug/1715374
    [2] https://bugs.launchpad.net/nova/+bug/1833542

    Change-Id: Ia6fc9011ee6f5461f40a1307b72709d769814a79
    Closes-Bug: #1833069
    Related-Bug: #1833542
    (cherry picked from commit e6d2b92200d02715649d923b0ef2d6981905a6b9)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/queens)

Reviewed: https://review.opendev.org/667937
Committed: https://git.openstack.org/cgit/openstack/kolla-ansible/commit/?id=621a4d6fd3c45457da3433c91a3a1de4daa14260
Submitter: Zuul
Branch: stable/queens

commit 621a4d6fd3c45457da3433c91a3a1de4daa14260
Author: Mark Goddard <email address hidden>
Date: Mon Jun 17 13:48:13 2019 +0100

    Restart all nova services after upgrade

    During an upgrade, nova pins the version of RPC calls to the minimum
    seen across all services. This ensures that old services do not receive
    data they cannot handle. After the upgrade is complete, all nova
    services are supposed to be reloaded via SIGHUP to cause them to check
    again the RPC versions of services and use the new latest version which
    should now be supported by all running services.

    Due to a bug [1] in oslo.service, sending services SIGHUP is currently
    broken. We replaced the HUP with a restart for the nova_compute
    container for bug 1821362, but not other nova services. It seems we need
    to restart all nova services to allow the RPC version pin to be removed.

    Testing in a Queens to Rocky upgrade, we find the following in the logs:

    Automatically selected compute RPC version 5.0 from minimum service
    version 30

    However, the service version in Rocky is 35.

    There is a second issue in that it takes some time for the upgraded
    services to update the nova services database table with their new
    version. We need to wait until all nova-compute services have done this
    before the restart is performed, otherwise the RPC version cap will
    remain in place. There is currently no interface in nova available for
    checking these versions [2], so as a workaround we use a configurable
    delay with a default duration of 30 seconds. Testing showed it takes
    about 10 seconds for the version to be updated, so this gives us some
    headroom.

    This change restarts all nova services after an upgrade, after a 30
    second delay.

    [1] https://bugs.launchpad.net/oslo.service/+bug/1715374
    [2] https://bugs.launchpad.net/nova/+bug/1833542

    Change-Id: Ia6fc9011ee6f5461f40a1307b72709d769814a79
    Closes-Bug: #1833069
    Related-Bug: #1833542
    (cherry picked from commit e6d2b92200d02715649d923b0ef2d6981905a6b9)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 6.2.2

This issue was fixed in the openstack/kolla-ansible 6.2.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 7.1.2

This issue was fixed in the openstack/kolla-ansible 7.1.2 release.

Revision history for this message
Zdenek Dvorak (zdenek-dvorak) wrote :

Hello,
I have a question to this bug fix. We are testing upgrade to rocky. Our testbed consists 3 nodes (1 controller and 2 computes). Task "[nova : Restart nova services to remove RPC version cap]" is started and finishes successfully. But only on compute nodes. This results to restart of nova-compute containers. Containers running on controller node are NOT restarted (nova-api, nova-scheduler and nova-conductor in our test lab). Therefore the RPC version is not updated and original 4.x version is present. This fix will run fine on All in one deployment. Have you performed tests also on multi node deployment?
What do you thing about this use case?

Regards Zdenek

Revision history for this message
Mark Goddard (mgoddard) wrote :

Hello Zdenek, I also noticed this issue as I have been making some large changes to the nova deployment. My test case for the original bug was a baremetal system, where the nova-compute service runs on the controllers, so I did not see the issue. Could you raise a new bug for this new issue?

Revision history for this message
Zdenek Dvorak (zdenek-dvorak) wrote :

Hello Mark,
I will create new bug report as you suggested.

Regards Zdenek

Revision history for this message
Zdenek Dvorak (zdenek-dvorak) wrote :

Hello Mark,
New bug report submitted. Please have a look.
https://bugs.launchpad.net/kolla-ansible/+bug/1847990

regards Zdenek

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 9.0.0.0rc1

This issue was fixed in the openstack/kolla-ansible 9.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/821862

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/821862
Committed: https://opendev.org/openstack/kolla-ansible/commit/80a32c3c74cad4d46671faff067b274ffed74bba
Submitter: "Zuul (22348)"
Branch: master

commit 80a32c3c74cad4d46671faff067b274ffed74bba
Author: Mark Goddard <email address hidden>
Date: Wed Dec 15 16:07:50 2021 +0000

    cinder: restart services after upgrade

    This patch is roughly an adaptation of
    Ia6fc9011ee6f5461f40a1307b72709d769814a79 for cinder.

    During an upgrade, cinder pins the version of RPC calls to the minimum
    seen across all services. This ensures that old services do not receive
    data they cannot handle. After the upgrade is complete, all cinder
    services are supposed to be reloaded to cause them to check again the
    RPC versions of services and use the new latest version which should now
    be supported by all running services.

    There is a second issue in that it takes some time for the upgraded
    services to update the cinder services database table with their new
    version. We need to wait until all cinder services have done this
    before the restart is performed, otherwise the RPC version cap will
    remain in place. There is currently no interface in cinder available for
    checking these versions, so as a workaround we use a configurable
    delay with a default duration of 30 seconds, as we do for nova.

    This change restarts all cinder services after an upgrade, after a 30
    second delay.

    Closes-Bug: #1954932
    Related-Bug: #1833069

    Change-Id: I9164dc589386d2c2d4daf1bf84061b806ba9988d

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (stable/xena)

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/834298

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/834299

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to kolla-ansible (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/834300

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/834300
Committed: https://opendev.org/openstack/kolla-ansible/commit/d2b62b50c958640b905647e03dc3ba649239f353
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit d2b62b50c958640b905647e03dc3ba649239f353
Author: Mark Goddard <email address hidden>
Date: Wed Dec 15 16:07:50 2021 +0000

    cinder: restart services after upgrade

    This patch is roughly an adaptation of
    Ia6fc9011ee6f5461f40a1307b72709d769814a79 for cinder.

    During an upgrade, cinder pins the version of RPC calls to the minimum
    seen across all services. This ensures that old services do not receive
    data they cannot handle. After the upgrade is complete, all cinder
    services are supposed to be reloaded to cause them to check again the
    RPC versions of services and use the new latest version which should now
    be supported by all running services.

    There is a second issue in that it takes some time for the upgraded
    services to update the cinder services database table with their new
    version. We need to wait until all cinder services have done this
    before the restart is performed, otherwise the RPC version cap will
    remain in place. There is currently no interface in cinder available for
    checking these versions, so as a workaround we use a configurable
    delay with a default duration of 30 seconds, as we do for nova.

    This change restarts all cinder services after an upgrade, after a 30
    second delay.

    Closes-Bug: #1954932
    Related-Bug: #1833069

    Change-Id: I9164dc589386d2c2d4daf1bf84061b806ba9988d
    (cherry picked from commit 80a32c3c74cad4d46671faff067b274ffed74bba)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/834299
Committed: https://opendev.org/openstack/kolla-ansible/commit/4d61344c14f348eae3dcc7964c0489e9efe63c1e
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 4d61344c14f348eae3dcc7964c0489e9efe63c1e
Author: Mark Goddard <email address hidden>
Date: Wed Dec 15 16:07:50 2021 +0000

    cinder: restart services after upgrade

    This patch is roughly an adaptation of
    Ia6fc9011ee6f5461f40a1307b72709d769814a79 for cinder.

    During an upgrade, cinder pins the version of RPC calls to the minimum
    seen across all services. This ensures that old services do not receive
    data they cannot handle. After the upgrade is complete, all cinder
    services are supposed to be reloaded to cause them to check again the
    RPC versions of services and use the new latest version which should now
    be supported by all running services.

    There is a second issue in that it takes some time for the upgraded
    services to update the cinder services database table with their new
    version. We need to wait until all cinder services have done this
    before the restart is performed, otherwise the RPC version cap will
    remain in place. There is currently no interface in cinder available for
    checking these versions, so as a workaround we use a configurable
    delay with a default duration of 30 seconds, as we do for nova.

    This change restarts all cinder services after an upgrade, after a 30
    second delay.

    Closes-Bug: #1954932
    Related-Bug: #1833069

    Change-Id: I9164dc589386d2c2d4daf1bf84061b806ba9988d
    (cherry picked from commit 80a32c3c74cad4d46671faff067b274ffed74bba)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to kolla-ansible (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/834298
Committed: https://opendev.org/openstack/kolla-ansible/commit/60c80ffacd8ef20959696bcc1fc4ff03c691616a
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 60c80ffacd8ef20959696bcc1fc4ff03c691616a
Author: Mark Goddard <email address hidden>
Date: Wed Dec 15 16:07:50 2021 +0000

    cinder: restart services after upgrade

    This patch is roughly an adaptation of
    Ia6fc9011ee6f5461f40a1307b72709d769814a79 for cinder.

    During an upgrade, cinder pins the version of RPC calls to the minimum
    seen across all services. This ensures that old services do not receive
    data they cannot handle. After the upgrade is complete, all cinder
    services are supposed to be reloaded to cause them to check again the
    RPC versions of services and use the new latest version which should now
    be supported by all running services.

    There is a second issue in that it takes some time for the upgraded
    services to update the cinder services database table with their new
    version. We need to wait until all cinder services have done this
    before the restart is performed, otherwise the RPC version cap will
    remain in place. There is currently no interface in cinder available for
    checking these versions, so as a workaround we use a configurable
    delay with a default duration of 30 seconds, as we do for nova.

    This change restarts all cinder services after an upgrade, after a 30
    second delay.

    Closes-Bug: #1954932
    Related-Bug: #1833069

    Change-Id: I9164dc589386d2c2d4daf1bf84061b806ba9988d
    (cherry picked from commit 80a32c3c74cad4d46671faff067b274ffed74bba)

tags: added: in-stable-xena
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.