Race condition in hacluster charm that leaves pacemaker down
Bug #1654403 reported by
David Ames
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack HA Cluster Charm |
Fix Released
|
High
|
David Ames | ||
corosync (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Xenial |
Incomplete
|
Undecided
|
Unassigned | ||
hacluster (Juju Charms Collection) |
Invalid
|
High
|
David Ames |
Bug Description
Symptom: one or more hacluster nodes are left in an executing state.
Observing the process list on the affected nodes the command 'crm node list' is in an infinite loop and pacemaker is not started. On nodes that complete the crm node list and other crm commands pacemaker is started.
See the artefacts from this run:
https:/
Hypothesis: There is a race that leads to crm node list being executed before pacemaker is started. It is also possible that something causes pacemaker to fail to start.
Suggest a check for pacemaker heath before any crm commands are run.
Changed in hacluster (Juju Charms Collection): | |
status: | New → Triaged |
importance: | Undecided → High |
milestone: | none → 17.01 |
tags: | added: patch |
Changed in hacluster (Juju Charms Collection): | |
status: | Triaged → Fix Committed |
assignee: | nobody → David Ames (thedac) |
Changed in charm-hacluster: | |
assignee: | nobody → David Ames (thedac) |
importance: | Undecided → High |
status: | New → Fix Committed |
Changed in hacluster (Juju Charms Collection): | |
status: | Fix Committed → Invalid |
Changed in charm-hacluster: | |
milestone: | none → 17.02 |
Changed in charm-hacluster: | |
status: | Fix Committed → Fix Released |
Changed in corosync (Ubuntu Xenial): | |
status: | New → Incomplete |
Changed in corosync (Ubuntu): | |
status: | Incomplete → Fix Released |
To post a comment you must log in.
Root cause:
1) When corosync is restarted it may take up to a minute for it to finish setting up.
2) The systemd timeout value is exceeded. percona- 3 systemd[1]: Failed to start Corosync Cluster Engine. percona- 3 systemd[1]: corosync.service: Unit entered failed state. percona- 3 systemd[1]: corosync.service: Failed with result 'timeout'.
Jan 10 18:57:49 juju-39e3e2-
Jan 10 18:57:49 juju-39e3e2-
Jan 10 18:57:49 juju-39e3e2-
3) Pacemaker is then started. Pacemaker systemd script has a dependency on corosync which may still be in the process of comming up.
4) Pacemaker fails to start due to dependency percona- 3 systemd[1]: pacemaker.service: Job pacemaker. service/ start failed with result 'dependency'.
Jan 10 18:57:49 juju-39e3e2-
5) Pacemaker remains down.
6) Subsequently, the charm checks for pacemaker health by running `crm node list` in a loop until it succeeds.
7) This is an infinite loop.
Soulitions
1) Adding corosync to this bug for systemd script timeout change
2) Charm needs to better handle validation of restart of the services and better communicate to the end user when an error has occured
Current Work in Process /review. openstack. org/#/c/ 419204/
https:/