hosts rebooting on install of masakari charms/bundle

Bug #1884284 reported by Syed Mohammad Adnan Karim
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack HA Cluster Charm
Fix Released
Undecided
Liam Young
OpenStack Masakari Charm
Invalid
Undecided
Unassigned
OpenStack Masakari Monitors Charm
Invalid
Undecided
Unassigned
OpenStack Pacemaker Remote Charm
Invalid
Undecided
Unassigned

Bug Description

Upon the installation of the masakari charms (masakari, masakari-monitors, pacemaker-remote, haproxy), I found that all the compute-hosts were being rebooted one-by-one as they joined the haproxy-masakari clone-set/cluster.

This is probably due to the enable-stonith: True in the bundle.

overlay-masakari.yaml: https://pastebin.canonical.com/p/rr3ZdhgRHh/
full-bundle.yaml: https://pastebin.canonical.com/p/6HMndPrYPQ/

The overlay-masakari.yaml was used to deploy the masakari charms to an already deployed bionic-stein openstack.

The full-bundle.yaml was used to redeploy the cloud once the masakari functionality and configurations were verified to work properly.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Please upload a sanitized bundle and juju-crashdump for this deployment. Also include more description of the operator/user steps taken to reach the error condition. Thank you.

Changed in charm-pacemaker-remote:
status: New → Incomplete
Changed in charm-masakari-monitors:
status: New → Incomplete
Changed in charm-masakari:
status: New → Incomplete
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Please avoid pastebins, as they are ephemeral, and may limit visibility to upstream developers who are important in the triage and collaboration processes. Thanks again.

Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :
Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :
description: updated
Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :
Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :

First I deployed the cloud using juju deploy ./bundle-sanitized.yaml.
After all the charms reached an active state in the juju status, I created a failover segment with:

$ openstack segment create segment1

Then I started to add all the compute hosts into the above segment with:

$ openstack segment host create <hostname> segment1

After adding all the nodes, I noticed the rebooting of nodes one-by-one as they joined the haproxy-masakari clone-set/cluster.

Revision history for this message
Liam Young (gnuoy) wrote :
Download full text (4.8 KiB)

I have tried to recreate this and was unable to, (Contents below replicated in pastebin: https://paste.ubuntu.com/p/Xh6HGGMqBm/ )

(clients) jenkins@mosci-amd64-jslave-persist-840:~/charm-test-infra$ openstack segment create segment1 auto COMPUTE
+-----------------+--------------------------------------+
| Field | Value |
+-----------------+--------------------------------------+
| created_at | 2020-06-26T08:50:43.000000 |
| updated_at | None |
| uuid | b99b7779-03b9-4e64-8eb4-d108c1ac1e32 |
| name | segment1 |
| description | None |
| id | 1 |
| service_type | COMPUTE |
| recovery_method | auto |
+-----------------+--------------------------------------+
(clients) jenkins@mosci-amd64-jslave-persist-840:~/charm-test-infra$ openstack hypervisor list
+----+---------------------+-----------------+---------------+-------+
| ID | Hypervisor Hostname | Hypervisor Type | Host IP | State |
+----+---------------------+-----------------+---------------+-------+
| 1 | node-flyer.maas | QEMU | 10.245.168.33 | up |
| 2 | node-jaeger.maas | QEMU | 10.245.168.31 | up |
| 3 | node-ritz.maas | QEMU | 10.245.168.32 | up |
+----+---------------------+-----------------+---------------+-------+
(clients) jenkins@mosci-amd64-jslave-persist-840:~/charm-test-infra$ openstack segment host create node-flyer.maas COMPUTE SSH b99b7779-03b9-4e64-8eb4-d108c1ac1e32
+---------------------+--------------------------------------+
| Field | Value |
+---------------------+--------------------------------------+
| created_at | 2020-06-26T08:58:25.000000 |
| updated_at | None |
| uuid | 14a6bab9-0901-4782-8cf2-72dbd4ca13af |
| name | node-flyer.maas |
| type | COMPUTE |
| control_attributes | SSH |
| reserved | False |
| on_maintenance | False |
| failover_segment_id | b99b7779-03b9-4e64-8eb4-d108c1ac1e32 |
+---------------------+--------------------------------------+
(clients) jenkins@mosci-amd64-jslave-persist-840:~/charm-test-infra$ openstack segment host create node-jaeger.maas COMPUTE SSH b99b7779-03b9-4e64-8eb4-d108c1ac1e32
+---------------------+--------------------------------------+
| Field | Value |
+---------------------+--------------------------------------+
| created_at | 2020-06-26T08:59:13.000000 |
| updated_at | None |
| uuid | f80c6473-889c-4271-8c90-e3ca82d0e4e2 |
| name | node-jaeger.maas |
| type | COMPUTE |
...

Read more...

Revision history for this message
Liam Young (gnuoy) wrote :

Hosts rebooting was observed during a recent test run. Note that this was not connected to the post deployment creation of segments as mentions in comment #6.

A possible work-around is to set enable-stonith=False on the pacemaker-remote charm during the deployment and then set it to True after deployment.

Changed in charm-pacemaker-remote:
status: Incomplete → Confirmed
Liam Young (gnuoy)
Changed in charm-pacemaker-remote:
assignee: nobody → Liam Young (gnuoy)
Revision history for this message
Liam Young (gnuoy) wrote :

Urgh, sorry for the noise, the bug was not reproduced in a recent test run so this issue has not been reproduced yet,

Changed in charm-pacemaker-remote:
assignee: Liam Young (gnuoy) → nobody
Ryan Beisner (1chb1n)
Changed in charm-pacemaker-remote:
status: Confirmed → Incomplete
Liam Young (gnuoy)
Changed in charm-masakari:
status: Incomplete → Invalid
Changed in charm-masakari-monitors:
status: Incomplete → Invalid
Changed in charm-pacemaker-remote:
status: Incomplete → Invalid
Changed in charm-hacluster:
status: New → Confirmed
assignee: nobody → Liam Young (gnuoy)
Revision history for this message
Liam Young (gnuoy) wrote :

I saw this on a recent deploy and was able to reproduce using a stub maas server. The problem appears to be that there is a window between a pacemaker remote resource being added and the location properties for that resource being added. In this window the resource is down and pacemaker fences the node.

The charm currently does:

1) Set stonith-enabled=true cluster property
2) Add maas stonith device that controls pacemaker node that has not yet been added.
3) Add pacemaker remote node
4) Add pacemaker location rules.

I think the following two fixes are needed:

For initial deploy updatw the charm so it does not enable stonith until stonith resources and pacemaker remotes have been added.

For scale-out do not add the new pacemaker remote stonith resource until the corresponding pacemaker resource has been added along with its location rules.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-hacluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/749686

Changed in charm-hacluster:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-hacluster (master)

Reviewed: https://review.opendev.org/749686
Committed: https://git.openstack.org/cgit/openstack/charm-hacluster/commit/?id=e02c6257ae127c86d5c5bc41045b6cd841a46fbe
Submitter: Zuul
Branch: master

commit e02c6257ae127c86d5c5bc41045b6cd841a46fbe
Author: Liam Young <email address hidden>
Date: Wed Sep 9 09:30:46 2020 +0000

    Fix adding of stonith controlled resources.

    There appears to be a window between a pacemaker remote resource
    being added and the location properties for that resource being
    added. In this window the resource is down and pacemaker may fence
    the node.

    The window is present because the charm charm currently does:

    1) Set stonith-enabled=true cluster property
    2) Add maas stonith device that controls pacemaker remote node that
       has not yet been added.
    3) Add pacemaker remote node
    4) Add pacemaker location rules.

    I think the following two fixes are needed:

    1) For initial deploys update the charm so it does not enable stonith
       until stonith resources and pacemaker remotes have been added.

    2) For scale-out do not add the new pacemaker remote stonith resource
       until the corresponding pacemaker resource has been added along
       with its location rules.

    Depends-On: Ib8a667d0d82ef3dcd4da27e62460b4f0ce32ee43
    Change-Id: I7e2f568d829f6d0bfc7859a7d0ea239203bbc490
    Closes-Bug: #1884284

Changed in charm-hacluster:
status: In Progress → Fix Committed
Changed in charm-hacluster:
milestone: none → 20.10
Changed in charm-hacluster:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.