percona-cluster with all nodes down doesn't properly startup w/out intervention

Bug #1744393 reported by Drew Freiberger on 2018-01-19
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack percona-cluster charm
High
David Ames

Bug Description

We recently had a cloud-down issue caused by a hard power failure on all nodes of the cloud at the same point in time. This took all three of the percona-cluster units offline and they all powered back up within a minute of each other. All nodes were in non-primary mode when found running and denying connections from openstack services.

All three units noted this within 10 seconds of each other:

180108 16:25:31 [Note] WSREP: Setting initial position to 72d2bae9-5df8-11e6-bb62-cb546a1bb47f:930203610

Then followed with https://pastebin.ubuntu.com/26418976/

180108 16:26:01 [Warning] WSREP: no nodes coming from prim view, prim not possible

and

180108 16:26:12 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
  at gcomm/src/pc.cpp:connect():141
180108 16:26:12 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():196: Failed to open backend connection: -110 (Connection timed out)
180108 16:26:12 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1292: Failed to open channel 'juju_cluster' at 'gcomm://10.28.2.244,10.28.2.194,10.28.2.226': -110 (Connection timed out)

both seem to be important pieces of this issue.

What I'm wondering is why, if all 3 nodes of a percona-cluster determine that they're starting with the same initial position wouldn't the cluster elect a primary member automatically?

This resulted in an extended cloud downtime and manual recovery of many of the nova/neutron services was necessary after manually reforming the percona-cluster.

Felipe Reyes (freyes) on 2018-01-22
tags: added: sts
Mario Splivalo (mariosplivalo) wrote :

Percona can't auto-restore from a all-node power failure, a manual intervention is needed.

https://www.percona.com/blog/2014/09/01/galera-replication-how-to-recover-a-pxc-cluster/

Scenario 6 in the above link explains the recovery from full-node failure.

Changed in charm-percona-cluster:
status: New → Invalid
Changed in charm-percona-cluster:
status: Invalid → Triaged
importance: Undecided → Wishlist
assignee: nobody → Aymen Frikha (aym-frikha)
milestone: none → 18.08
Changed in charm-percona-cluster:
status: Triaged → In Progress
James Page (james-page) on 2018-09-12
Changed in charm-percona-cluster:
milestone: 18.08 → 18.11
James Page (james-page) on 2018-11-20
Changed in charm-percona-cluster:
milestone: 18.11 → 19.04
David Ames (thedac) on 2019-04-17
Changed in charm-percona-cluster:
milestone: 19.04 → 19.07
Changed in charm-percona-cluster:
assignee: Aymen Frikha (aym-frikha) → nobody
status: In Progress → Confirmed
David Ames (thedac) on 2019-06-03
Changed in charm-percona-cluster:
assignee: nobody → David Ames (thedac)
importance: Wishlist → High
status: Confirmed → In Progress
tags: added: reboot-fail

Fix proposed to branch: master
Review: https://review.opendev.org/670163

Reviewed: https://review.opendev.org/670163
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=b97a0971c22f129b71674a22a65080a65c96af76
Submitter: Zuul
Branch: master

commit b97a0971c22f129b71674a22a65080a65c96af76
Author: David Ames <email address hidden>
Date: Wed Jul 10 12:01:06 2019 -0700

    Bootstrap action after a cold boot

    After a cold boot, percona-cluster will require administrative
    intervention. One node will need to bootstrap per upstream
    Percona Cluster documentation:
    https://www.percona.com/blog/2014/09/01/galera-replication-how-to-recover-a-pxc-cluster/

    This change adds an action to bootstrap a single node. On the other
    nodes systemd will be attempting to start percona. Once the bootstrapped
    node is up the others will join automatically.

    Change-Id: Id9a860edc343ee5dbd7fc8c5ce3b4420ec6e134e
    Partial-Bug: #1744393

Reviewed: https://review.opendev.org/670675
Committed: https://git.openstack.org/cgit/openstack/charm-percona-cluster/commit/?id=b8c2213dfbd4ae417be95a8ce1b1c973eee9e55c
Submitter: Zuul
Branch: master

commit b8c2213dfbd4ae417be95a8ce1b1c973eee9e55c
Author: David Ames <email address hidden>
Date: Fri Jul 12 16:16:46 2019 -0700

    Notify bootstrapped action

    It turns out a subsequent required step after a cold boot bootstrap is
    notifying the cluster of the new bootstrap UUID.

    The notify-bootstrapped action should be run on a different node than
    the one which ran the bootstrap-pxc action.

    This action will ensure the cluster converges on the correct bootstrap
    UUID.

    A subsequent patch stacked on this one will include tests for the new
    cold boot actions.

    Change-Id: Idee12d5f7e28498c5ab6ccb9605f751c6427ac30
    Partial-Bug: #1744393

David Ames (thedac) on 2019-08-12
Changed in charm-percona-cluster:
milestone: 19.07 → 19.10
David Ames (thedac) on 2019-10-24
Changed in charm-percona-cluster:
milestone: 19.10 → 20.01
tags: added: cold-start
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers