Automatic configuration of TSX flag on cmdline

Bug #1916758 reported by David Vallee Delisle
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
David Vallee Delisle

Bug Description

Description
-----------

Fast-forward upgrade from OSP-13 (RHEL-7.9) to OSP-16.2 (RHEL-8.3)
fails[1] during live migration with:

    [...] libvirt.libvirtError: operation failed: guest CPU doesn't
    match specification: missing features: hle,rtm

The failure is due to RHEL-8.3 (destination host) disabling an Intel
"TSX". And disabling TSX disables the 'hle' and 'rtm' features.

This was discovered during OSP fast-forward upgrades testing[+] where a
guest was being live-migrated from RHEL-7.9 (with TSX=on) to RHEL-8.3
(breaking change: TSX=off), and the migration failed with the
above-mentioned error.

[+] https://bugzilla.redhat.com/show_bug.cgi?id=1921070#c14 — Live
    migration during OSP16.2 hybrid state from RHEL7.9 to RHEL8.3 not
    working

Why?
----

RHEL-8.3 kernel disabled Intel TSX by default, because it is considered
a potential security risk:

    https://bugzilla.redhat.com/show_bug.cgi?id=1828642
    kernel: Disable Intel TSX by default on newer CPUs

Still, it is not acceptable for RHEL-8.3 kernel to break user-space in a
minor RHEL release. (See also:
https://bugzilla.redhat.com/show_bug.cgi?id=1921070#c16)

Workaround for OSP upgrades
---------------------------

This is unpalatable, but unfortunately there's no other option currently:

(1) have a TripleO config attribute that will enable TSX on the
    destination RHEL-8.3 host; set the following in /etc/default/grub:

        GRUB_CMDLINE_LINUX_DEFAULT="[...] tsx=on"

    ... and reboot the 8.3 host;

(2) live-migrate the guests from RHEL-7.9 to the RHEL-8.3;

(3) now turn off TSX on the RHEL-8.3 host kernel command-line;
    shutdown the guests;

(4) reboot the 8.3 host again, and start the guests

https://bugzilla.redhat.com/show_bug.cgi?id=1923165
https://bugzilla.redhat.com/show_bug.cgi?id=1921070

Changed in tripleo:
assignee: nobody → David Vallee Delisle (valleedelisle)
status: New → In Progress
tags: added: tripleo-heat-templates
tags: added: tripleo-kernel
Changed in tripleo:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-ansible (master)

Change abandoned by "David Vallee Delisle <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-ansible/+/783969
Reason: After discussing this, we agreed to move this to a validation with a hard stop if operators didn't have explicitly added TSX kernel flag to their KernelArgs during update/upgrade

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-validations (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to python-tripleoclient (stable/train)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by "David Vallee Delisle <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/775729
Reason: https://review.opendev.org/c/openstack/python-tripleoclient/+/791089/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to python-tripleoclient (stable/train)

Reviewed: https://review.opendev.org/c/openstack/python-tripleoclient/+/791089
Committed: https://opendev.org/openstack/python-tripleoclient/commit/050c9aa99fc6a3818ae37fe6382318be9f59706d
Submitter: "Zuul (22348)"
Branch: stable/train

commit 050c9aa99fc6a3818ae37fe6382318be9f59706d
Author: David Vallee Delisle <email address hidden>
Date: Thu May 13 03:47:47 2021 +0000

    [train-only] post stack creation tsx validation

    RHEL-8.3 kernel disabled the Intel TSX (Transactional
    Synchronization Extensions) feature by default as a preemptive
    security measure, but it breaks live migration from RHEL-7.9
    (or even RHEL-8.1 or RHEL-8.2) to RHEL-8.3.

    Operators are expected to explicitly define the TSX flag in
    their KernelArgs for the compute role to prevent live-migration
    issues during the upgrade process.

    This is explained in detail in this article [a]

    If operators don't want to add the TSX flag to the KernelArgs,
    they can always set "ForceNoTsx" to true.

    Adding this mandatory validation right after the stacks are
    updated is probably the earliest place where we can validate
    and fail if necessary. We'd rather fail quickly than too late
    as this will provide the best experience for our users.

    In addition to this, there's a tripleo-validation [b] in the
    work.

    This is meant to be train-only for now but we will have to
    refactor if (when?) we support FFU from queens to Wallaby+

    [a] https://access.redhat.com/solutions/6036141
    [b] https://review.opendev.org/c/openstack/tripleo-validations/+/790806

    Co-Authored-By: Martin Schuppert <email address hidden>
    Related: https://bugzilla.redhat.com/1923165
    Closes-Bug: #1916758
    Change-Id: I35246fbf74394f6e315973283464085d2aef08b2

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-validations (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-validations/+/793927

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-validations (stable/wallaby)

Change abandoned by "David Vallee Delisle <email address hidden>" on branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-validations/+/793927

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/python-tripleoclient 12.6.0

This issue was fixed in the openstack/python-tripleoclient 12.6.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-validations (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-validations/+/790806
Committed: https://opendev.org/openstack/tripleo-validations/commit/ede25c3e36a751daf68ce151521b371bb25f50dc
Submitter: "Zuul (22348)"
Branch: master

commit ede25c3e36a751daf68ce151521b371bb25f50dc
Author: David Vallee Delisle <email address hidden>
Date: Wed May 19 04:05:38 2021 +0000

    Compute TSX validation

    RHEL-8.3 kernel disabled the Intel TSX (Transactional
    Synchronization Extensions) feature by default as a preemptive
    security measure, but it breaks live migration from RHEL-7.9
    (or even RHEL-8.1 or RHEL-8.2) to RHEL-8.3.

    Operators are expected to explicitly define the TSX flag in
    their KernelArgs for the compute role to prevent live-migration
    issues during the upgrade process.

    This also impacts upstream CentOS systems.

    Co-Authored-By: Martin Schuppert <email address hidden>
    Related: https://bugzilla.redhat.com/1923165
    Closes-Bug: #1916758
    Change-Id: Icfcfb1c07bbfbe05d27d67187d941c0c34fad2b2

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-validations (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/tripleo-validations/+/795632

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-validations (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/tripleo-validations/+/795793

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-validations (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/tripleo-validations/+/795794

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-validations 15.0.0

This issue was fixed in the openstack/tripleo-validations 15.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-validations (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/tripleo-validations/+/793927
Committed: https://opendev.org/openstack/tripleo-validations/commit/74c30cf49147dedad2944da332b58079bc703200
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 74c30cf49147dedad2944da332b58079bc703200
Author: David Vallee Delisle <email address hidden>
Date: Wed May 19 04:05:38 2021 +0000

    Compute TSX validation

    RHEL-8.3 kernel disabled the Intel TSX (Transactional
    Synchronization Extensions) feature by default as a preemptive
    security measure, but it breaks live migration from RHEL-7.9
    (or even RHEL-8.1 or RHEL-8.2) to RHEL-8.3.

    Operators are expected to explicitly define the TSX flag in
    their KernelArgs for the compute role to prevent live-migration
    issues during the upgrade process.

    This also impacts upstream CentOS systems.

    Co-Authored-By: Martin Schuppert <email address hidden>
    Related: https://bugzilla.redhat.com/1923165
    Closes-Bug: #1916758
    Change-Id: Icfcfb1c07bbfbe05d27d67187d941c0c34fad2b2
    (cherry picked from commit ede25c3e36a751daf68ce151521b371bb25f50dc)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-validations (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/tripleo-validations/+/795793
Committed: https://opendev.org/openstack/tripleo-validations/commit/7c937a03069ea939f87952f3532a42e0227c9c4b
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 7c937a03069ea939f87952f3532a42e0227c9c4b
Author: David Vallee Delisle <email address hidden>
Date: Wed May 19 04:05:38 2021 +0000

    Compute TSX validation

    RHEL-8.3 kernel disabled the Intel TSX (Transactional
    Synchronization Extensions) feature by default as a preemptive
    security measure, but it breaks live migration from RHEL-7.9
    (or even RHEL-8.1 or RHEL-8.2) to RHEL-8.3.

    Operators are expected to explicitly define the TSX flag in
    their KernelArgs for the compute role to prevent live-migration
    issues during the upgrade process.

    This also impacts upstream CentOS systems.

    Co-Authored-By: Martin Schuppert <email address hidden>
    Related: https://bugzilla.redhat.com/1923165
    Closes-Bug: #1916758
    Change-Id: Icfcfb1c07bbfbe05d27d67187d941c0c34fad2b2
    (cherry picked from commit ede25c3e36a751daf68ce151521b371bb25f50dc)
    (cherry picked from commit 74c30cf49147dedad2944da332b58079bc703200)
    (cherry picked from commit 1d8c110b52bf90db9efaf55b4f071a1233270163)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-validations (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/tripleo-validations/+/795632
Committed: https://opendev.org/openstack/tripleo-validations/commit/5207e0ede4ca8950d69cd3a264a9d2efa36bb94d
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 5207e0ede4ca8950d69cd3a264a9d2efa36bb94d
Author: David Vallee Delisle <email address hidden>
Date: Wed May 19 04:05:38 2021 +0000

    Compute TSX validation

    RHEL-8.3 kernel disabled the Intel TSX (Transactional
    Synchronization Extensions) feature by default as a preemptive
    security measure, but it breaks live migration from RHEL-7.9
    (or even RHEL-8.1 or RHEL-8.2) to RHEL-8.3.

    Operators are expected to explicitly define the TSX flag in
    their KernelArgs for the compute role to prevent live-migration
    issues during the upgrade process.

    This also impacts upstream CentOS systems.

    Co-Authored-By: Martin Schuppert <email address hidden>
    Related: https://bugzilla.redhat.com/1923165
    Closes-Bug: #1916758
    Change-Id: Icfcfb1c07bbfbe05d27d67187d941c0c34fad2b2
    (cherry picked from commit ede25c3e36a751daf68ce151521b371bb25f50dc)
    (cherry picked from commit 74c30cf49147dedad2944da332b58079bc703200)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-validations (stable/train)

Reviewed: https://review.opendev.org/c/openstack/tripleo-validations/+/795794
Committed: https://opendev.org/openstack/tripleo-validations/commit/1111c6d617289103fe4d3b2800e312066c27a714
Submitter: "Zuul (22348)"
Branch: stable/train

commit 1111c6d617289103fe4d3b2800e312066c27a714
Author: David Vallee Delisle <email address hidden>
Date: Wed May 19 04:05:38 2021 +0000

    Compute TSX validation

    RHEL-8.3 kernel disabled the Intel TSX (Transactional
    Synchronization Extensions) feature by default as a preemptive
    security measure, but it breaks live migration from RHEL-7.9
    (or even RHEL-8.1 or RHEL-8.2) to RHEL-8.3.

    Operators are expected to explicitly define the TSX flag in
    their KernelArgs for the compute role to prevent live-migration
    issues during the upgrade process.

    This also impacts upstream CentOS systems.

    Co-Authored-By: Martin Schuppert <email address hidden>
    Related: https://bugzilla.redhat.com/1923165
    Closes-Bug: #1916758
    Change-Id: Icfcfb1c07bbfbe05d27d67187d941c0c34fad2b2
    (cherry picked from commit ede25c3e36a751daf68ce151521b371bb25f50dc)
    (cherry picked from commit 74c30cf49147dedad2944da332b58079bc703200)
    (cherry picked from commit 1d8c110b52bf90db9efaf55b4f071a1233270163)
    (cherry picked from commit 3c80b58b45625de00bc6d6fca8fe74af24f4690e)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-validations 14.2.0

This issue was fixed in the openstack/tripleo-validations 14.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-validations 13.4.0

This issue was fixed in the openstack/tripleo-validations 13.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-validations 12.3.6

This issue was fixed in the openstack/tripleo-validations 12.3.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-validations train-eol

This issue was fixed in the openstack/tripleo-validations train-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.