Bug #1884284 “hosts rebooting on install of masakari charms/bund...” : Bugs : OpenStack HA Cluster Charm

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2020-06-22:

#1

Please upload a sanitized bundle and juju-crashdump for this deployment. Also include more description of the operator/user steps taken to reach the error condition. Thank you.

Changed in charm-pacemaker-remote:
status:	New → Incomplete
Changed in charm-masakari-monitors:
status:	New → Incomplete
Changed in charm-masakari:
status:	New → Incomplete

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2020-06-22:

#2

Please avoid pastebins, as they are ephemeral, and may limit visibility to upstream developers who are important in the triage and collaboration processes. Thanks again.

Revision history for this message

Syed Mohammad Adnan Karim (karimsye) wrote on 2020-06-22:

#3

overlay-masakari-sanitized.yaml Edit (3.8 KiB, text/plain)

Revision history for this message

Syed Mohammad Adnan Karim (karimsye) wrote on 2020-06-22:

#4

bundle-sanitized.yaml Edit (54.3 KiB, text/plain)

Syed Mohammad Adnan Karim (karimsye) on 2020-06-22

description:

updated

Revision history for this message

Syed Mohammad Adnan Karim (karimsye) wrote on 2020-06-22:

#5

link to juju-crashdump: https://drive.google.com/file/d/1jnDJ3tLLIePi6jMnutYBHuyvQk57Yhvo/view

Revision history for this message

Syed Mohammad Adnan Karim (karimsye) wrote on 2020-06-23:

#6

First I deployed the cloud using juju deploy ./bundle-sanitized.yaml.
After all the charms reached an active state in the juju status, I created a failover segment with:

$ openstack segment create segment1

Then I started to add all the compute hosts into the above segment with:

$ openstack segment host create <hostname> segment1

After adding all the nodes, I noticed the rebooting of nodes one-by-one as they joined the haproxy-masakari clone-set/cluster.

Revision history for this message

Liam Young (gnuoy) wrote on 2020-06-26:

#7

Download full text (4.8 KiB)

I have tried to recreate this and was unable to, (Contents below replicated in pastebin: https://paste.ubuntu.com/p/Xh6HGGMqBm/ )

I have tried to recreate this and was unable to, (Contents below replicated in pastebin: https://paste.ubuntu.com/p/Xh6HGGMqBm/ )

Revision history for this message

Liam Young (gnuoy) wrote on 2020-06-30:

#8

Hosts rebooting was observed during a recent test run. Note that this was not connected to the post deployment creation of segments as mentions in comment #6.

A possible work-around is to set enable-stonith=False on the pacemaker-remote charm during the deployment and then set it to True after deployment.

Changed in charm-pacemaker-remote:
status:	Incomplete → Confirmed

Liam Young (gnuoy) on 2020-06-30

Changed in charm-pacemaker-remote:
assignee:	nobody → Liam Young (gnuoy)

Revision history for this message

Liam Young (gnuoy) wrote on 2020-07-02:

#9

Urgh, sorry for the noise, the bug was not reproduced in a recent test run so this issue has not been reproduced yet,

Changed in charm-pacemaker-remote:
assignee:	Liam Young (gnuoy) → nobody

Ryan Beisner (1chb1n) on 2020-07-07

Changed in charm-pacemaker-remote:
status:	Confirmed → Incomplete

Liam Young (gnuoy) on 2020-07-30

Changed in charm-masakari:
status:	Incomplete → Invalid
Changed in charm-masakari-monitors:
status:	Incomplete → Invalid
Changed in charm-pacemaker-remote:
status:	Incomplete → Invalid
Changed in charm-hacluster:
status:	New → Confirmed
assignee:	nobody → Liam Young (gnuoy)

Revision history for this message

Liam Young (gnuoy) wrote on 2020-07-30:

#10

I saw this on a recent deploy and was able to reproduce using a stub maas server. The problem appears to be that there is a window between a pacemaker remote resource being added and the location properties for that resource being added. In this window the resource is down and pacemaker fences the node.

The charm currently does:

1) Set stonith-enabled=true cluster property
2) Add maas stonith device that controls pacemaker node that has not yet been added.
3) Add pacemaker remote node
4) Add pacemaker location rules.

I think the following two fixes are needed:

For initial deploy updatw the charm so it does not enable stonith until stonith resources and pacemaker remotes have been added.

For scale-out do not add the new pacemaker remote stonith resource until the corresponding pacemaker resource has been added along with its location rules.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-03: Fix proposed to charm-hacluster (master)

#11

Fix proposed to branch: master
Review: https://review.opendev.org/749686

Changed in charm-hacluster:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-09-11: Fix merged to charm-hacluster (master)

#12

Reviewed: https://review.opendev.org/749686
Committed: https://git.openstack.org/cgit/openstack/charm-hacluster/commit/?id=e02c6257ae127c86d5c5bc41045b6cd841a46fbe
Submitter: Zuul
Branch: master

commit e02c6257ae127c86d5c5bc41045b6cd841a46fbe
Author: Liam Young <email address hidden>
Date: Wed Sep 9 09:30:46 2020 +0000

Fix adding of stonith controlled resources.

    There appears to be a window between a pacemaker remote resource
    being added and the location properties for that resource being
    added. In this window the resource is down and pacemaker may fence
    the node.

The window is present because the charm charm currently does:

    1) Set stonith-enabled=true cluster property
    2) Add maas stonith device that controls pacemaker remote node that
       has not yet been added.
    3) Add pacemaker remote node
    4) Add pacemaker location rules.

I think the following two fixes are needed:

1) For initial deploys update the charm so it does not enable stonith
until stonith resources and pacemaker remotes have been added.

    2) For scale-out do not add the new pacemaker remote stonith resource
       until the corresponding pacemaker resource has been added along
       with its location rules.

    Depends-On: Ib8a667d0d82ef3dcd4da27e62460b4f0ce32ee43
    Change-Id: I7e2f568d829f6d0bfc7859a7d0ea239203bbc490
    Closes-Bug: #1884284

Changed in charm-hacluster:
status:	In Progress → Fix Committed

Alex Kavanagh (ajkavanagh) on 2020-11-02

Changed in charm-hacluster:
milestone:	none → 20.10

Alex Kavanagh (ajkavanagh) on 2020-11-02

Changed in charm-hacluster:
status:	Fix Committed → Fix Released

	Status	Importance	Assigned to	Milestone
OpenStack HA Cluster Charm	Fix Released	Undecided	Liam Young	OpenStack HA Cluster Charm 20.10
OpenStack Masakari Charm	Invalid	Undecided	Unassigned
OpenStack Masakari Monitors Charm	Invalid	Undecided	Unassigned
OpenStack Pacemaker Remote Charm	Invalid	Undecided	Unassigned

OpenStack HA Cluster Charm

hosts rebooting on install of masakari charms/bundle

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches