M/N upgrades - Full HA -> HA NG migration might fail setting maintenance-mode

Bug #1628393 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Michele Baldessari

Bug Description

[root@overcloud-controller-0 ~]# more /var/lib/heat-config/deployed/d431aa70-bd9a-4b18-86c9-cba184716345.notify.json
{
  "deploy_stdout": "mysql upgrade required: 0\nERROR: cluster remained unstable after setting maintenance-mode for more than 300 seconds, exiting.\n",
  "deploy_stderr": "",
  "deploy_status_code": 1
}

This is because we do the following:
pcs property set maintenance-mode=true
# We are making sure here that the property has propagated everywhere
if ! timeout -k 10 300 crm_resource --wait; then
     echo_error "ERROR: cluster remained unstable after setting maintenance-mode for more than 300 seconds, exiting."
     exit 1
fi

crm_resource --wait can actually take forever under certain conditions. Since we are going to shut
down the cluster later anyways, there is no point in using crm_resource --wait at this stage.

Tags: upgrade
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/378317

Changed in tripleo:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/378317
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=35da6af8bd1629ea823aecdd0e15b73fcbccf7b0
Submitter: Jenkins
Branch: master

commit 35da6af8bd1629ea823aecdd0e15b73fcbccf7b0
Author: Michele Baldessari <email address hidden>
Date: Wed Sep 28 09:41:30 2016 +0200

    Full HA->HA NG migration might fail setting maintenance-mode

    Currently we do the following in the migration path:
    pcs property set maintenance-mode=true
    if ! timeout -k 10 300 crm_resource --wait; then
         echo_error "ERROR: cluster remained unstable after setting maintenance-mode for more than 300 seconds, exiting."
         exit 1
    fi

    crm_resource --wait can actually take forever under certain conditions.
    The property will be set atomically across the cluster nodes so we should be good
    without this.

    Change-Id: I8f531d63479b81d65b572c4431c9db6f610f7e04
    Closes-Bug: #1628393

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 5.0.0.0rc2

This issue was fixed in the openstack/tripleo-heat-templates 5.0.0.0rc2 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.