Composable upgrade ansible-pacemaker default service start timeout is too short

Bug #1666604 reported by Marios Andreou
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Low
Marios Andreou

Bug Description

I hit this on my env locally yesterday and again today so filing this bug. The current pacemaker service ansible tasks set '200' seconds https://github.com/openstack/tripleo-heat-templates/blob/be6a66042e14abfc3a1b649ed144bb755c750424/puppet/services/pacemaker.yaml#L146 for the default start timeout on the PacemakerResources (by default it checks ['rabbitmq','haproxy']). In my env this expired by a few seconds in both cases it failed like:

        Feb 21 11:08:27 overcloud-controller-0.localdomain os-collect-config[126779]: [2017-02-21 11:08:27,880] (heat-config) [INFO] {"deploy_stdout": "\nPLAY [localhost] ***************************************************************\n\nTASK [setup] *******************************************************************\nok: [localhost]\n\nTASK [Start pacemaker cluster] *************************************************\nchanged: [localhost]\n\nTASK [Check pacemaker resource] ************************************************\nok: [localhost] => (item=rabbitmq)\nfailed: [localhost] (item=haproxy) => {\"failed\": true, \"item\": \"haproxy\", \"msg\": \"Failed, the resource haproxy is not started\\n\"}\n\tto retry, use: --limit @/var/lib/heat-config/heat-config-ansible/2e011edd-78b5-4ac3-a992-ce83a2f755ff_playbook.retry\n\nPLAY RECAP *********************************************************************\nlocalhost : ok=2 changed=1 unreachable=0 failed=1 \n\n", "deploy_stderr": "", "deploy_status_code": 2}

... and then a few seconds later haproxy starts OK:

        Feb 21 11:08:46 host-192-0-2-7 crmd[226256]: notice: Result of start operation for haproxy on overcloud-controller-0: 0 (ok)

The module defaults to 300 at https://github.com/redhat-openstack/ansible-pacemaker/blob/67b8b9e000cacfed33016e1509e3579f42f1f335/modules/pacemaker_resource.py#L47 so perhaps we should use the same?

Will post a review momentarily and point here... it seems the upgrades CI is NOT hitting this issue so it could be my environment (virt overcloud nodes @5GB ram + swap, the virt host only has 4 physical cores :/) but it may be worth bumping to the module default anyway?

Changed in tripleo:
assignee: nobody → Marios Andreou (marios-b)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/436560

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/436560
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=8448c92203596ca578f85bdd7ffc96dd79adfe3e
Submitter: Jenkins
Branch: master

commit 8448c92203596ca578f85bdd7ffc96dd79adfe3e
Author: marios <email address hidden>
Date: Tue Feb 21 19:12:02 2017 +0200

    Increase ansible-pacemaker default service start timeout

    We are passing 200 but in some environments this has been seen to
    expire by a few seconds.

    Change-Id: I5c2270559339ea9ee0043b7a2e519e26d4d9d78a
    Closes-Bug: 1666604

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/437510

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/ocata)

Reviewed: https://review.openstack.org/437510
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=039e7ef65efc1165d7518979159cb564b2e8a24e
Submitter: Jenkins
Branch: stable/ocata

commit 039e7ef65efc1165d7518979159cb564b2e8a24e
Author: marios <email address hidden>
Date: Tue Feb 21 19:12:02 2017 +0200

    Increase ansible-pacemaker default service start timeout

    We are passing 200 but in some environments this has been seen to
    expire by a few seconds.

    Change-Id: I5c2270559339ea9ee0043b7a2e519e26d4d9d78a
    Closes-Bug: 1666604
    (cherry picked from commit 8448c92203596ca578f85bdd7ffc96dd79adfe3e)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 6.0.0.0rc2

This issue was fixed in the openstack/tripleo-heat-templates 6.0.0.0rc2 release candidate.

Changed in tripleo:
milestone: pike-1 → ongoing
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 7.0.0.0b1

This issue was fixed in the openstack/tripleo-heat-templates 7.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.