Swift rings are no longer synced

Bug #1892674 reported by Christian Schwede
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Christian Schwede

Bug Description

Swift rings are sync'ed by up- and downloading them to the undercloud, making sure every node on the overcloud has the same copy to start with.

Before Train, the swift_copy_rings container had no explicit network setting, and this was working because of using some defaults. However, this has changed with [1][2], and the container now needs "net: host" to successfully up- and download the rings to the undercloud.

A regular deployment won't notice the failure until a node gets replaced or a manually modified ring is used.

An easy way to verify this is to either replace a controller node or manually tweak the ring config, for example:

./overcloud-deploy.sh
swift download overcloud-swift-rings
tar xzvf swift-rings.tar.gz
swift-ring-builder etc/swift/object.builder
swift-ring-builder etc/swift/object.builder set_info d0 127.0.0.1:6000
swift-ring-builder etc/swift/object.builder pretend_min_part_hours_passed
swift-ring-builder etc/swift/object.builder write_ring
tar cvzf swift-rings.tar.gz etc/
swift upload overcloud-swift-rings swift-rings.tar.gz
./overcloud-deploy.sh

Compare these rings with the rings on each controller node and make sure the .ring.gz are identical.

[1] https://review.opendev.org/#/c/630631/
[2] https://review.opendev.org/#/c/670069/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/747621

Changed in tripleo:
milestone: none → victoria-3
tags: added: train-backport-potential ussuri-backport-potential
tags: added: idempotency
Revision history for this message
Christian Schwede (cschwede) wrote :

My first conclusion was wrong, this has nothing to do with the net: setting.

The real reason for this race condition is the skip_consistency_check parameter, which was not set to False - despite looking like it had been set so. swift-recon is used to query all Swift object storage nodes, get the md5sum of the ring files and compare them with the local ring file md5sum.

After running a ton of deployments, node replacements and updates, this is what I found:

- Deployment and update within min_part_hours: all rings in sync
- Deployment and update after min_part_hours: NOT in sync
- Deployment and update after min_part_hours followed by another subsequent update: all rings in sync

This makes somewhat sense: if the consistency check is executed, it will fail because rebalance changed the local ring copy, while all other object storage nodes still return the md5sum of the previous ring file (in a containerized TripleO deployment).

To avoid regression, we need to ensure "swift-ring-builder pretend_min_part_hours_passed" is applied to the rings after an initial deployment and before a node replacement update (but only for testing, not for production environments).

Changed in tripleo:
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/747621
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=13cc41a23f376c61af8fe78df32abf3781e107b0
Submitter: Zuul
Branch: master

commit 13cc41a23f376c61af8fe78df32abf3781e107b0
Author: Christian Schwede <email address hidden>
Date: Thu Sep 3 11:26:45 2020 +0200

    Fix Swift ring file synchronization issue

    Swift ring files are synchronized by up- and downloading them to the
    undercloud, making sure every node on the overcloud has the same copy to
    start with.

    One (optional) step in the process is to ensure rings are in sync before
    uploading them eventually. swift-recon is used to query all Swift object
    storage nodes, get the md5sum of the ring files and compare them with
    the local ring file md5sum.

    However, in containerized deployments this will fail, because Swift
    containers are not immediately restarted after rebalancing. The object
    server will return the md5sum of the previous ring version, which does
    not match with the rebalanced local file. TripleO is intended to skip
    this check by setting skip_consistency_check to false.

    However, the parameter was never set to false, and this patch fixes it.

    Running an overcloud update immediately after an initial deployment was
    not affected by this. Same for multiple overcloud updates - subsequent
    updates did fix this issue automatically. In the first case the rings
    were not rebalanced due to min_part_hours not passed, in the latter case
    they were synchronized on the subsequent update.

    Closes-Bug: 1892674
    Change-Id: Ib56f59b7d2a981196eab334108d42ca4390c0566

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/749883

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/749884

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/749885

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/749886

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/749887

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/stein)

Reviewed: https://review.opendev.org/749885
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=962426ec33345b76e6e94a2d999f51e8221f6c99
Submitter: Zuul
Branch: stable/stein

commit 962426ec33345b76e6e94a2d999f51e8221f6c99
Author: Christian Schwede <email address hidden>
Date: Thu Sep 3 11:26:45 2020 +0200

    Fix Swift ring file synchronization issue

    Swift ring files are synchronized by up- and downloading them to the
    undercloud, making sure every node on the overcloud has the same copy to
    start with.

    One (optional) step in the process is to ensure rings are in sync before
    uploading them eventually. swift-recon is used to query all Swift object
    storage nodes, get the md5sum of the ring files and compare them with
    the local ring file md5sum.

    However, in containerized deployments this will fail, because Swift
    containers are not immediately restarted after rebalancing. The object
    server will return the md5sum of the previous ring version, which does
    not match with the rebalanced local file. TripleO is intended to skip
    this check by setting skip_consistency_check to false.

    However, the parameter was never set to false, and this patch fixes it.

    Running an overcloud update immediately after an initial deployment was
    not affected by this. Same for multiple overcloud updates - subsequent
    updates did fix this issue automatically. In the first case the rings
    were not rebalanced due to min_part_hours not passed, in the latter case
    they were synchronized on the subsequent update.

    Closes-Bug: 1892674
    Change-Id: Ib56f59b7d2a981196eab334108d42ca4390c0566
    (cherry picked from commit 13cc41a23f376c61af8fe78df32abf3781e107b0)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/rocky)

Reviewed: https://review.opendev.org/749886
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=c75880afcad5349d87a0feaff751429f60e2f7d4
Submitter: Zuul
Branch: stable/rocky

commit c75880afcad5349d87a0feaff751429f60e2f7d4
Author: Christian Schwede <email address hidden>
Date: Thu Sep 3 11:26:45 2020 +0200

    Fix Swift ring file synchronization issue

    Swift ring files are synchronized by up- and downloading them to the
    undercloud, making sure every node on the overcloud has the same copy to
    start with.

    One (optional) step in the process is to ensure rings are in sync before
    uploading them eventually. swift-recon is used to query all Swift object
    storage nodes, get the md5sum of the ring files and compare them with
    the local ring file md5sum.

    However, in containerized deployments this will fail, because Swift
    containers are not immediately restarted after rebalancing. The object
    server will return the md5sum of the previous ring version, which does
    not match with the rebalanced local file. TripleO is intended to skip
    this check by setting skip_consistency_check to false.

    However, the parameter was never set to false, and this patch fixes it.

    Running an overcloud update immediately after an initial deployment was
    not affected by this. Same for multiple overcloud updates - subsequent
    updates did fix this issue automatically. In the first case the rings
    were not rebalanced due to min_part_hours not passed, in the latter case
    they were synchronized on the subsequent update.

    Closes-Bug: 1892674
    Change-Id: Ib56f59b7d2a981196eab334108d42ca4390c0566
    (cherry picked from commit 13cc41a23f376c61af8fe78df32abf3781e107b0)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/ussuri)

Reviewed: https://review.opendev.org/749884
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=8d3a89dbf8b86a44934b16abc2692f07b1c641fd
Submitter: Zuul
Branch: stable/ussuri

commit 8d3a89dbf8b86a44934b16abc2692f07b1c641fd
Author: Christian Schwede <email address hidden>
Date: Thu Sep 3 11:26:45 2020 +0200

    Fix Swift ring file synchronization issue

    Swift ring files are synchronized by up- and downloading them to the
    undercloud, making sure every node on the overcloud has the same copy to
    start with.

    One (optional) step in the process is to ensure rings are in sync before
    uploading them eventually. swift-recon is used to query all Swift object
    storage nodes, get the md5sum of the ring files and compare them with
    the local ring file md5sum.

    However, in containerized deployments this will fail, because Swift
    containers are not immediately restarted after rebalancing. The object
    server will return the md5sum of the previous ring version, which does
    not match with the rebalanced local file. TripleO is intended to skip
    this check by setting skip_consistency_check to false.

    However, the parameter was never set to false, and this patch fixes it.

    Running an overcloud update immediately after an initial deployment was
    not affected by this. Same for multiple overcloud updates - subsequent
    updates did fix this issue automatically. In the first case the rings
    were not rebalanced due to min_part_hours not passed, in the latter case
    they were synchronized on the subsequent update.

    Closes-Bug: 1892674
    Change-Id: Ib56f59b7d2a981196eab334108d42ca4390c0566
    (cherry picked from commit 13cc41a23f376c61af8fe78df32abf3781e107b0)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/749883
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=6b98944e369a7a0a0f2a2f8d1fe124f67a38a003
Submitter: Zuul
Branch: stable/train

commit 6b98944e369a7a0a0f2a2f8d1fe124f67a38a003
Author: Christian Schwede <email address hidden>
Date: Thu Sep 3 11:26:45 2020 +0200

    Fix Swift ring file synchronization issue

    Swift ring files are synchronized by up- and downloading them to the
    undercloud, making sure every node on the overcloud has the same copy to
    start with.

    One (optional) step in the process is to ensure rings are in sync before
    uploading them eventually. swift-recon is used to query all Swift object
    storage nodes, get the md5sum of the ring files and compare them with
    the local ring file md5sum.

    However, in containerized deployments this will fail, because Swift
    containers are not immediately restarted after rebalancing. The object
    server will return the md5sum of the previous ring version, which does
    not match with the rebalanced local file. TripleO is intended to skip
    this check by setting skip_consistency_check to false.

    However, the parameter was never set to false, and this patch fixes it.

    Running an overcloud update immediately after an initial deployment was
    not affected by this. Same for multiple overcloud updates - subsequent
    updates did fix this issue automatically. In the first case the rings
    were not rebalanced due to min_part_hours not passed, in the latter case
    they were synchronized on the subsequent update.

    Closes-Bug: 1892674
    Change-Id: Ib56f59b7d2a981196eab334108d42ca4390c0566
    (cherry picked from commit 13cc41a23f376c61af8fe78df32abf3781e107b0)

tags: added: in-stable-train
tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/queens)

Reviewed: https://review.opendev.org/749887
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=02e6124c28620240ec3c9af4834fe398572e334f
Submitter: Zuul
Branch: stable/queens

commit 02e6124c28620240ec3c9af4834fe398572e334f
Author: Christian Schwede <email address hidden>
Date: Thu Sep 3 11:26:45 2020 +0200

    Fix Swift ring file synchronization issue

    Swift ring files are synchronized by up- and downloading them to the
    undercloud, making sure every node on the overcloud has the same copy to
    start with.

    One (optional) step in the process is to ensure rings are in sync before
    uploading them eventually. swift-recon is used to query all Swift object
    storage nodes, get the md5sum of the ring files and compare them with
    the local ring file md5sum.

    However, in containerized deployments this will fail, because Swift
    containers are not immediately restarted after rebalancing. The object
    server will return the md5sum of the previous ring version, which does
    not match with the rebalanced local file. TripleO is intended to skip
    this check by setting skip_consistency_check to false.

    However, the parameter was never set to false, and this patch fixes it.

    Running an overcloud update immediately after an initial deployment was
    not affected by this. Same for multiple overcloud updates - subsequent
    updates did fix this issue automatically. In the first case the rings
    were not rebalanced due to min_part_hours not passed, in the latter case
    they were synchronized on the subsequent update.

    Closes-Bug: 1892674
    Change-Id: Ib56f59b7d2a981196eab334108d42ca4390c0566
    (cherry picked from commit 13cc41a23f376c61af8fe78df32abf3781e107b0)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.4.0

This issue was fixed in the openstack/tripleo-heat-templates 11.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates rocky-eol

This issue was fixed in the openstack/tripleo-heat-templates rocky-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates queens-eol

This issue was fixed in the openstack/tripleo-heat-templates queens-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates stein-eol

This issue was fixed in the openstack/tripleo-heat-templates stein-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.