Race condition in hacluster charm that leaves pacemaker down

Bug #1654403 reported by David Ames on 2017-01-05
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack hacluster charm
High
David Ames
corosync (Ubuntu)
Undecided
Unassigned
hacluster (Juju Charms Collection)
High
David Ames

Bug Description

Symptom: one or more hacluster nodes are left in an executing state.
Observing the process list on the affected nodes the command 'crm node list' is in an infinite loop and pacemaker is not started. On nodes that complete the crm node list and other crm commands pacemaker is started.

See the artefacts from this run:
https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline/openstack/charm-percona-cluster/417131/1/1873/index.html

Hypothesis: There is a race that leads to crm node list being executed before pacemaker is started. It is also possible that something causes pacemaker to fail to start.

Suggest a check for pacemaker heath before any crm commands are run.

David Ames (thedac) on 2017-01-05
Changed in hacluster (Juju Charms Collection):
status: New → Triaged
importance: Undecided → High
milestone: none → 17.01
David Ames (thedac) wrote :

Root cause:

1) When corosync is restarted it may take up to a minute for it to finish setting up.

2) The systemd timeout value is exceeded.
Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: Failed to start Corosync Cluster Engine.
Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: corosync.service: Unit entered failed state.
Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: corosync.service: Failed with result 'timeout'.

3) Pacemaker is then started. Pacemaker systemd script has a dependency on corosync which may still be in the process of comming up.

4) Pacemaker fails to start due to dependency
Jan 10 18:57:49 juju-39e3e2-percona-3 systemd[1]: pacemaker.service: Job pacemaker.service/start failed with result 'dependency'.

5) Pacemaker remains down.

6) Subsequently, the charm checks for pacemaker health by running `crm node list` in a loop until it succeeds.

7) This is an infinite loop.

Soulitions

1) Adding corosync to this bug for systemd script timeout change

2) Charm needs to better handle validation of restart of the services and better communicate to the end user when an error has occured

Current Work in Process
https://review.openstack.org/#/c/419204/

Corey Bryant (corey.bryant) wrote :

David, what release of ubuntu/openstack does this affect? I'd like to see if we can get a package update in a PPA for you to test with.

David Ames (thedac) wrote :

Corey,

This is Mitaka on Xenial. I suspect that the package remains the same on Xenial for the other OpenStack releases. I'll try and confirm this.

Corey Bryant (corey.bryant) wrote :

This may have been fixed as of the 1.1.15-1 version of the pacemaker package. Prior to commit 071796e, "Restart=on-failure" was patched out. I've attached the diff of the commit that reverted that.

Corey Bryant (corey.bryant) wrote :

David, you could try adding "Restart=on-failure" back to the init file as a test. If it works, we could look into backporting that to xenial, however I'm hesitant to do that until we know better why they dropped the restart bits in the first place.

tags: added: patch
David Ames (thedac) wrote :

Additional information from the charm:

Without cluster_count set to NUM_UNITS a race occurs where the relation to the last hacluster node is not yet set leading to the attempt to startup corosync and pacemaker with only n-1/n nodes.

The last node only has one relationship it is aware of yet when there should be 2 relations:
relation-list -r hanode:0
hacluster/0

corosync.conf looks like the following when there should be 3 nodes:

nodelist {

        node {
                ring0_addr: 10.5.35.235
                nodeid: 1000
        }

        node {
                ring0_addr: 10.5.35.237
                nodeid: 1001
        }

}

The services themselves (not the charm) fail:
corosync logs thousands of RETRANSMIT errors.
pacemaker eventually times out after waiting on corosync.

Adding more documentation to push the setting of cluster_count and updating the amulet tests to include it.

Reviewed: https://review.openstack.org/419204
Committed: https://git.openstack.org/cgit/openstack/charm-hacluster/commit/?id=fda5176bd53f17a69f3e22b6b363bff96ff565c0
Submitter: Jenkins
Branch: master

commit fda5176bd53f17a69f3e22b6b363bff96ff565c0
Author: David Ames <email address hidden>
Date: Wed Jan 11 16:00:39 2017 -0800

    Fix pacemaker down crm infinite loop

    On corosync restart, corosync may take longer than a minute to come
    up. The systemd start script times out too soon. Then pacemaker which
    is dependent on corosync is immediatly started and fails as corosync
    is still in the process of starting.

    Subsequently the charm would run crm node list to validate pacemaker.
    This would become an infinite loop.

    This change adds longer timeout values for systemd scripts and adds
    better error handling and communication to the end user.

    Change-Id: I7c3d018a03fddfb1f6bfd91fd7aeed4b13879e45
    Partial-Bug: #1654403

David Ames (thedac) on 2017-01-25
Changed in hacluster (Juju Charms Collection):
status: Triaged → Fix Committed
assignee: nobody → David Ames (thedac)
James Page (james-page) on 2017-02-23
Changed in charm-hacluster:
assignee: nobody → David Ames (thedac)
importance: Undecided → High
status: New → Fix Committed
Changed in hacluster (Juju Charms Collection):
status: Fix Committed → Invalid
James Page (james-page) on 2017-02-23
Changed in charm-hacluster:
milestone: none → 17.02
James Page (james-page) on 2017-02-23
Changed in charm-hacluster:
status: Fix Committed → Fix Released

Hi,
Corey mentioned 1.1.15 might be fixed a while ago.
You have all the context - is it?

So would that be for corosync:
- Yakkety/Zesty Fixed
- Xenial SRU needed

Or is this totally solved by the charm changes you submitted.
Or ...

TL;DR please help me to understand what might be left on the corosync task of this bug :-)

For this particular bug, it seems we have no description on why corosync was taking too long to start, just that it took too long and all the workaround made to pacemaker initialization and charm handling. With that, I'm marking corosync as incomplete for now, that I'm gathering all work to be done in HA packages. Please re-open this if you disagree, so we can discuss this bug again. Thank you!

Changed in corosync (Ubuntu):
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers