M/N upgrade fail as the galera cluster doesn't restart.

Bug #1612642 reported by Sofer Athlan-Guyot
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Sofer Athlan-Guyot

Bug Description

Hi,

doing a Mitaka to Newton upgrade fails as the cluster doesn't restart.

We can see:

    + node_states=' galera (ocf::heartbeat:galera): Started overcloud-controller-0
    * galera_start_0 on overcloud-controller-2 '\''not installed'\'' (5): call=240, status=complete, exitreason='\''Datadir /var/lib/mysql doesn'\''t exist'\'',
    * galera_start_0 on overcloud-controller-1 '\''not installed'\'' (5): call=240, status=complete, exitreason='\''Datadir /var/lib/mysql doesn'\''t exist'\'','
    + echo ' galera (ocf::heartbeat:galera): Started overcloud-controller-0
    * galera_start_0 on overcloud-controller-2 '\''not installed'\'' (5): call=240, status=complete, exitreason='\''Datadir /var/lib/mysql doesn'\''t exist'\'',
    * galera_start_0 on overcloud-controller-1 '\''not installed'\'' (5): call=240, status=complete, exitreason='\''Datadir /var/lib/mysql doesn'\''t exist'\'','

And I can confirm that in the node other than bootstrap node, the
/var/lib/mysql directory has completely as the backup directory.

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/354713

Changed in tripleo:
assignee: nobody → Sofer Athlan-Guyot (sofer-athlan-guyot)
status: New → In Progress
Revision history for this message
Michele Baldessari (michele) wrote :

Hi Sofer,

can you share more infos about the upgrade? Was the galera/mariadb package updated or did it stay the same? Do we have full logs somewhere?

Thanks,
Michele

Changed in tripleo:
importance: Undecided → High
tags: added: upgrade-bugs
Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

Hi Michele,

yes the package was updated, and here some relevant logs:

    | ID | Name | Status | Task State | Power State | Networks |
    | 862d268f-31a3-4703-bbaa-de7753a7ca92 | overcloud-cephstorage-0 | ACTIVE | - | Running | ctlplane=192.0.2.7 |
    | cf0e99f9-e8b7-4dc7-bc7a-c2fed6ddf178 | overcloud-controller-0 | ACTIVE | - | Running | ctlplane=192.0.2.11 |
    | 9ee5cbdd-dac5-40a2-a4e2-fba4d98e281c | overcloud-controller-1 | ACTIVE | - | Running | ctlplane=192.0.2.10 |
    | 9f66a408-904e-432a-bd78-113f2a07fd31 | overcloud-controller-2 | ACTIVE | - | Running | ctlplane=192.0.2.12 |
    | e0815864-8dee-4d40-8084-9740c011db35 | overcloud-novacompute-0 | ACTIVE | - | Running | ctlplane=192.0.2.9 |
    | 733dfa06-9392-4e41-90f5-d3d5e360bc2d | overcloud-novacompute-1 | ACTIVE | - | Running | ctlplane=192.0.2.8 |

    ** 192.0.2.11 up and running

    overcloud-controller-1: Starting Cluster...
    overcloud-controller-0: Starting Cluster...
    overcloud-controller-2: Starting Cluster...
    Error: cluster is not currently running on this node
    OFFLINE: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
    OFFLINE: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
    OFFLINE: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
    ip-192.0.2.6 has started
    ip-172.16.2.5 has started
    ip-172.16.3.4 has started
    ip-10.0.0.4 has started
    ip-172.16.2.4 has started
    ip-172.16.1.4 has started
    galera has started
    mongod has started
    HTTP/1.1 503 Service Unavailable
    Content-Type: text/plain
    Connection: close
    Content-Length: 36

    Galera cluster node is not synced.
    HTTP/1.1 503 Service Unavailable
    Content-Type: text/plain
    Connection: close
    Content-Length: 36

    ..... (this goes on and on)

    Galera cluster node is not synced.
    HTTP/1.1 503 Service Unavailable
    Content-Type: text/plain
    Connection: close
    Content-Length: 36

    Galera cluster node is not synced.
    ERROR galera sync timed out

I don't have any more logs about this one left, but the pacemaker log posted above are quite clear about the failure, no ?

With the proposed patch I was able to get past this error.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/354713
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=514a81eda9bbd31095a0b0a50373a1735b236b52
Submitter: Jenkins
Branch: master

commit 514a81eda9bbd31095a0b0a50373a1735b236b52
Author: Sofer Athlan-Guyot <email address hidden>
Date: Fri Aug 12 15:09:56 2016 +0200

    M/N upgrade fix galera restart.

    We have to recreate the /var/lib/mysql directory on all controller node,
    not just the boostrap node for the cluster to be able to restart.

    Adding a warning on the fact that those script are local and know
    nothing about the good upgrade state of the other nodes.

    Closes-Bug: 1612642
    Change-Id: I48e2812d7df80bbf2db53a8b71dc434d4209a160

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/397102

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/mitaka)

Reviewed: https://review.openstack.org/397102
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=191df2bf1f2c0ae36be503285e84f81f8173a6b8
Submitter: Jenkins
Branch: stable/mitaka

commit 191df2bf1f2c0ae36be503285e84f81f8173a6b8
Author: Sofer Athlan-Guyot <email address hidden>
Date: Fri Aug 12 15:09:56 2016 +0200

    M/N upgrade fix galera restart.

    We have to recreate the /var/lib/mysql directory on all controller node,
    not just the boostrap node for the cluster to be able to restart.

    Adding a warning on the fact that those script are local and know
    nothing about the good upgrade state of the other nodes.

    Closes-Bug: 1612642
    Change-Id: I48e2812d7df80bbf2db53a8b71dc434d4209a160
    (cherry picked from commit 514a81eda9bbd31095a0b0a50373a1735b236b52)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 2.2.0

This issue was fixed in the openstack/tripleo-heat-templates 2.2.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.