ceph monitor scaling requires pesistence of ceph-ansible fetch directory

Bug #1769769 reported by John Fulton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
John Fulton

Bug Description

After deploying a ceph cluster with N monitors, it is not possible to scale up to N+K monitors because ceph-ansible uses state information from its fetch_directory for this operation and the mistral workflow to deploy ceph-ansible doesn't persist the fetch_directory.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/567782

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/567786

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/570576

Revision history for this message
John Fulton (jfulton-org) wrote :

https://review.openstack.org/567786 and https://review.openstack.org/570576 can fix the issue for Pike and Queens when backported, but we also need a solution for Master/Rocky so that deploy_external_tasks interface with the swift container the same way.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/570576
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=d51bd9e1bde10ae1d277fa527564536a92a645f9
Submitter: Zuul
Branch: master

commit d51bd9e1bde10ae1d277fa527564536a92a645f9
Author: John Fulton <email address hidden>
Date: Fri May 25 13:01:18 2018 +0000

    Add stack name to env() for OS::TripleO::WorkflowSteps

    In I4b576a6e7fbfb18fa13221e2d080bf7876a8303e state information
    will be persisted in Swift and the name of the Swift container
    should be a function of the Heat stack in case multiple stacks
    are deployed. This patch passes the name of the Heat stack to
    the Mistral environment so that the workflow may access the
    Heat stack name and name the Swift container accordingly.

    Change-Id: I995ad32345a39238ffb9cbcf9966dedc60c75ff8
    Related-Bug: #1769769

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/571705

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/queens)

Reviewed: https://review.openstack.org/571705
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=17b15a81d5e2e73c65d04e46e9d917ea125a0979
Submitter: Zuul
Branch: stable/queens

commit 17b15a81d5e2e73c65d04e46e9d917ea125a0979
Author: John Fulton <email address hidden>
Date: Fri May 25 13:01:18 2018 +0000

    Add stack name to env() for OS::TripleO::WorkflowSteps

    In I4b576a6e7fbfb18fa13221e2d080bf7876a8303e state information
    will be persisted in Swift and the name of the Swift container
    should be a function of the Heat stack in case multiple stacks
    are deployed. This patch passes the name of the Heat stack to
    the Mistral environment so that the workflow may access the
    Heat stack name and name the Swift container accordingly.

    Change-Id: I995ad32345a39238ffb9cbcf9966dedc60c75ff8
    Related-Bug: #1769769
    (cherry picked from commit d51bd9e1bde10ae1d277fa527564536a92a645f9)

tags: added: in-stable-queens
Changed in tripleo:
milestone: rocky-2 → rocky-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/582811

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/567782
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=2f300dfd906ae747e626efa27f4dc0380aef361a
Submitter: Zuul
Branch: master

commit 2f300dfd906ae747e626efa27f4dc0380aef361a
Author: John Fulton <email address hidden>
Date: Fri May 11 03:01:48 2018 -0400

    Persist ceph-ansible fetch_directory using mistral

    When scaling ceph monitors, ceph-ansible uses context from the
    fetch_directory to prevent new monitors from behaving like they
    are the only monitors.

    Save the fetch_directory in swift after each ceph-ansible playbook
    run; and if there is a fetch directory in swift, restore it before
    each playbook run.

    Note that https://review.openstack.org/#/c/582811 only resolves
    1769769 for Master where config-download external deploy tasks
    run ceph-ansible. This change resolves 1769769 when using mistral
    for Queens/Pike (when backported).

    Change-Id: I4b576a6e7fbfb18fa13221e2d080bf7876a8303e
    Depends-On: I995ad32345a39238ffb9cbcf9966dedc60c75ff8
    Closes-Bug: #1769769

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/583229

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/queens)

Reviewed: https://review.openstack.org/583229
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=fa22231bc7311184f9667558b7bff068755c5b37
Submitter: Zuul
Branch: stable/queens

commit fa22231bc7311184f9667558b7bff068755c5b37
Author: John Fulton <email address hidden>
Date: Fri May 11 03:01:48 2018 -0400

    Persist ceph-ansible fetch_directory using mistral

    When scaling ceph monitors, ceph-ansible uses context from the
    fetch_directory to prevent new monitors from behaving like they
    are the only monitors.

    Save the fetch_directory in swift after each ceph-ansible playbook
    run; and if there is a fetch directory in swift, restore it before
    each playbook run.

    Note that https://review.openstack.org/#/c/582811 only resolves
    1769769 for Master where config-download external deploy tasks
    run ceph-ansible. This change resolves 1769769 when using mistral
    for Queens/Pike (when backported).

    Change-Id: I4b576a6e7fbfb18fa13221e2d080bf7876a8303e
    Depends-On: I995ad32345a39238ffb9cbcf9966dedc60c75ff8
    Closes-Bug: #1769769
    (cherry picked from commit 2f300dfd906ae747e626efa27f4dc0380aef361a)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 9.2.0

This issue was fixed in the openstack/tripleo-common 9.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 8.6.4

This issue was fixed in the openstack/tripleo-common 8.6.4 release.

Changed in tripleo:
importance: High → Critical
Revision history for this message
John Fulton (jfulton-org) wrote :

I changed this from fix-released to in-progress because the fix for queens is different from the fix for rocky. e.g. https://review.openstack.org/#/c/582811 isn't merged yet.

Changed in tripleo:
milestone: rocky-3 → rocky-rc2
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/597221

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/597233

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (master)

Change abandoned by John Fulton (<email address hidden>) on branch: master
Review: https://review.openstack.org/597233
Reason: Opting to do this in one submission: https://review.openstack.org/#/c/597221

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart (master)

Change abandoned by John Fulton (<email address hidden>) on branch: master
Review: https://review.openstack.org/567786
Reason: served its purpose as a test only wip

Changed in tripleo:
milestone: rocky-rc2 → stein-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (master)

Change abandoned by Juan Antonio Osorio Robles (<email address hidden>) on branch: master
Review: https://review.openstack.org/597221
Reason: abandoning temporarily to free up resources

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Juan Antonio Osorio Robles (<email address hidden>) on branch: master
Review: https://review.openstack.org/582811
Reason: abandoning temporarily to free up resources

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/604772

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/604773

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (master)

Change abandoned by Juan Antonio Osorio Robles (<email address hidden>) on branch: master
Review: https://review.openstack.org/597221
Reason: Purging the gate to free up resources and address the timeout issues

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/597221
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=bdde61dec0ce2882bd9a591acfb94502b27a297e
Submitter: Zuul
Branch: master

commit bdde61dec0ce2882bd9a591acfb94502b27a297e
Author: John Fulton <email address hidden>
Date: Tue Aug 28 19:53:54 2018 +0000

    Update swift_rings_backup workflow to also backup ceph fetch dir

    Rename swift_rings_backup to swift_backup because we might wish
    to use swift on the undercloud to backup more than just the
    overcloud swift rings. For example the same workflow is useful
    for backing up the ceph-ansible fetch directory in the undercloud
    swift.

    Update deployment and plan management workflows to also create
    or update the ceph-ansible fetch directory swift container.

    Change-Id: Icce658f803a608ee4b7df34b0b8297ecabcdb0ee
    Related-Bug: #1769769

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (stable/rocky)

Change abandoned by Alex Schultz (<email address hidden>) on branch: stable/rocky
Review: https://review.openstack.org/604773
Reason: http://lists.openstack.org/pipermail/openstack-dev/2018-September/135224.html

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (stable/rocky)

Reviewed: https://review.openstack.org/604773
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=5a55202258c009d65885170080885d16645144ee
Submitter: Zuul
Branch: stable/rocky

commit 5a55202258c009d65885170080885d16645144ee
Author: John Fulton <email address hidden>
Date: Tue Aug 28 19:53:54 2018 +0000

    Update swift_rings_backup workflow to also backup ceph fetch dir

    Rename swift_rings_backup to swift_backup because we might wish
    to use swift on the undercloud to backup more than just the
    overcloud swift rings. For example the same workflow is useful
    for backing up the ceph-ansible fetch directory in the undercloud
    swift.

    Update deployment and plan management workflows to also create
    or update the ceph-ansible fetch directory swift container.

    Change-Id: Icce658f803a608ee4b7df34b0b8297ecabcdb0ee
    Related-Bug: #1769769
    (cherry picked from commit bdde61dec0ce2882bd9a591acfb94502b27a297e)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/rocky)

Reviewed: https://review.openstack.org/604772
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=8af44652da35494bb6ca840003c38b7e2ade6802
Submitter: Zuul
Branch: stable/rocky

commit 8af44652da35494bb6ca840003c38b7e2ade6802
Author: John Fulton <email address hidden>
Date: Sun Jul 15 21:45:43 2018 +0000

    Persist ceph-ansible fetch_directory using config-download

    When scaling ceph monitors, ceph-ansible uses context from the
    fetch_directory to prevent new monitors from behaving like they
    are the only monitors.

    Save the fetch_directory after each ceph-ansible playbook run;
    and if there is a previously saved fetch directory, restore it
    before each playbook run.

    Fetch directory can be saved on the undercloud in Swift or if
    the new LocalCephAnsibleFetchDirectoryBackup parameter is passed
    then it will be saved in a directory local to the undercloud
    instead.

    Note that https://review.openstack.org/#/c/567782 only resolves
    1769769 for Queens/Pike where Mistral runs ceph-ansible. This
    change resolves 1769769 when using config-download.

    Change-Id: I0591be8419828cc32f976afce8be1b787b783c23
    Depends-On: Icce658f803a608ee4b7df34b0b8297ecabcdb0ee
    Related-Bug: #1769769
    (cherry picked from commit 7fc83987dc415228e42a9645320a49111c45b301)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/582811
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=7fc83987dc415228e42a9645320a49111c45b301
Submitter: Zuul
Branch: master

commit 7fc83987dc415228e42a9645320a49111c45b301
Author: John Fulton <email address hidden>
Date: Sun Jul 15 21:45:43 2018 +0000

    Persist ceph-ansible fetch_directory using config-download

    When scaling ceph monitors, ceph-ansible uses context from the
    fetch_directory to prevent new monitors from behaving like they
    are the only monitors.

    Save the fetch_directory after each ceph-ansible playbook run;
    and if there is a previously saved fetch directory, restore it
    before each playbook run.

    Fetch directory can be saved on the undercloud in Swift or if
    the new LocalCephAnsibleFetchDirectoryBackup parameter is passed
    then it will be saved in a directory local to the undercloud
    instead.

    Note that https://review.openstack.org/#/c/567782 only resolves
    1769769 for Queens/Pike where Mistral runs ceph-ansible. This
    change resolves 1769769 when using config-download.

    Change-Id: I0591be8419828cc32f976afce8be1b787b783c23
    Depends-On: Icce658f803a608ee4b7df34b0b8297ecabcdb0ee
    Related-Bug: #1769769

Changed in tripleo:
milestone: stein-1 → stein-2
Revision history for this message
John Fulton (jfulton-org) wrote :

All patches related to this bug have merged:

 https://review.openstack.org/#/q/topic:bug/1769769+(status:open+OR+status:merged)

Also, our documentation includes backing up the undercloud's /srv/node directory [1] which contains all of the swift objects on the undercloud. Because the patches to fix this bug store the ceph-ansible fetch directory in swift and because Swift is included in the backup there are no additional changes to the documentation or backup procedures required because a restored undercloud should contain swift.

[1] Line 118 of https://github.com/openstack/tripleo-docs/blob/master/doc/source/install/controlplane_backup_restore/03_undercloud_restore.rst

Changed in tripleo:
status: In Progress → Fix Committed
milestone: stein-2 → stein-1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/618589

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/618597

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/pike)

Change abandoned by John Fulton (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/618589
Reason: This change doesn't help the FFU case I was hoping to address.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (stable/pike)

Change abandoned by John Fulton (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/618597

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.