Service version check breaks FFU

Bug #1958883 reported by Dan Smith
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Dan Smith

Bug Description

As reported on the mailing list:

http://lists.openstack.org/pipermail/openstack-discuss/2022-January/026603.html

The service version check at startup can prevent FFUs from being possible without hacking the database. As implemented here:

https://review.opendev.org/c/openstack/nova/+/738482

We currently filter "forced down" computes from the check, but we should probably also eliminate those down long enough due to missed heartbeats (i.e. offline during the upgrade). However, a fast-moving FFU where everything is switched from an old container to a new one would easily still find computes that are considered "up" and effectively force a wait.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/826097

Dan Smith (danms)
Changed in nova:
assignee: nobody → Dan Smith (danms)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/826097
Committed: https://opendev.org/openstack/nova/commit/7d2e4815892ddd523c21bf1785cc113981871998
Submitter: "Zuul (22348)"
Branch: master

commit 7d2e4815892ddd523c21bf1785cc113981871998
Author: Dan Smith <email address hidden>
Date: Fri Jan 21 12:51:35 2022 -0800

    Add service version check workaround for FFU

    We recently added a hard failure to nova service startup for the case
    where computes were more than one version old (as indicated by their
    service record). This helps to prevent starting up new control
    services when a very old compute is still running. However, during an
    FFU, control services that have skipped multiple versions will be
    started and find the older compute records (which could not be updated
    yet due to their reliance on the control services being up) and refuse
    to start. This creates a cross-dependency which is not resolvable
    without hacking the database.

    This patch adds a workaround flag to allow turning that hard fail into
    a warning to proceed past the issue. This less-than-ideal solution
    is simple and backportable, but perhaps a better solution can be
    implemented for the future.

    Related-Bug: #1958883

    Change-Id: Iddbc9b2a13f19cea9a996aeadfe891f4ef3b0264

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/xena)

Related fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/831174

Revision history for this message
melanie witt (melwitt) wrote :

Marking this as Confirmed with Medium priority given that a workaround has been implemented and is now available.

Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
tags: added: upgrade
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/nova/+/831174
Committed: https://opendev.org/openstack/nova/commit/3a5f8924ff42ce3f691f33abf2c2daee88e90fe5
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 3a5f8924ff42ce3f691f33abf2c2daee88e90fe5
Author: Dan Smith <email address hidden>
Date: Fri Jan 21 12:51:35 2022 -0800

    Add service version check workaround for FFU

    We recently added a hard failure to nova service startup for the case
    where computes were more than one version old (as indicated by their
    service record). This helps to prevent starting up new control
    services when a very old compute is still running. However, during an
    FFU, control services that have skipped multiple versions will be
    started and find the older compute records (which could not be updated
    yet due to their reliance on the control services being up) and refuse
    to start. This creates a cross-dependency which is not resolvable
    without hacking the database.

    This patch adds a workaround flag to allow turning that hard fail into
    a warning to proceed past the issue. This less-than-ideal solution
    is simple and backportable, but perhaps a better solution can be
    implemented for the future.

    Related-Bug: #1958883

    Change-Id: Iddbc9b2a13f19cea9a996aeadfe891f4ef3b0264
    (cherry picked from commit 7d2e4815892ddd523c21bf1785cc113981871998)

tags: added: in-stable-xena
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/844202

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/844202
Committed: https://opendev.org/openstack/nova/commit/e8b079a91ee723d0dc45e3d8b80f4efa2c1ce34d
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit e8b079a91ee723d0dc45e3d8b80f4efa2c1ce34d
Author: Dan Smith <email address hidden>
Date: Fri Jan 21 12:51:35 2022 -0800

    Add service version check workaround for FFU

    We recently added a hard failure to nova service startup for the case
    where computes were more than one version old (as indicated by their
    service record). This helps to prevent starting up new control
    services when a very old compute is still running. However, during an
    FFU, control services that have skipped multiple versions will be
    started and find the older compute records (which could not be updated
    yet due to their reliance on the control services being up) and refuse
    to start. This creates a cross-dependency which is not resolvable
    without hacking the database.

    This patch adds a workaround flag to allow turning that hard fail into
    a warning to proceed past the issue. This less-than-ideal solution
    is simple and backportable, but perhaps a better solution can be
    implemented for the future.

    Related-Bug: #1958883

    Change-Id: Iddbc9b2a13f19cea9a996aeadfe891f4ef3b0264
    (cherry picked from commit 7d2e4815892ddd523c21bf1785cc113981871998)
    (cherry picked from commit 3a5f8924ff42ce3f691f33abf2c2daee88e90fe5)

tags: added: in-stable-wallaby
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.