tripleo

N-O upgrades do not wait for galera to be fully up

Bug #1668372 reported by Michele Baldessari on 2017-02-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Michele Baldessari	tripleo pike-1 "pike-1"

Bug Description

I have observed the following during one of my HA N->O upgrade runs (note the lost connection):
1. Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: TASK [Setup cell_v2 (migrate hosts)] *******************************************↲
Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["nova-manage", "cell_v2", "map_cell_and_hosts"], "delta": "0:00:0
4.102342", "end": "2017-02-27 18:35:17.504998", "failed": true, "rc": 1, "start": "2017-02-27 18:35:13.402656", "stderr": "", "stdout": "An error has occurred:\nTraceback (most recent ca
ll last):\n File \"/usr/lib/python2.7/site-packages/nova/cmd/manage.py\", line 1594, in main\n ret = fn(*fn_args, **fn_kwargs)\n File \"/usr/lib/python2.7/site-packages/nova/cmd/man
....
Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: _query_result\n result.read()\n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1312, in read\n
first_packet = self.connection._read_packet()\n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 971, in _read_packet\n packet_header = self._read_bytes(4)\
n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1008, in _read_bytes\n 2013, \"Lost connection to MySQL server during query\")\nDBConnectionError: (pymysql.e
rr.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'INSERT INTO cell_mappings (created_at, updated_at, uuid, name, transport_url, database_connection) VAL
UES (%(created_at)s, %(updated_at)s, %(uuid)s, %(name)s, %(transport_url)s, %(database_connection)s)'] [parameters: {'database_connection': u'mysql+pymysql://nova:N8AUkJGgVewYzdCdC6rPTfr
8B@172.16.2.11/nova?bind_address=172.16.2.13', 'name': None, 'transport_url': u'rabbit://guest:<email address hidden>:5672,guest:9bakaEc
<email address hidden>:5672,guest:<email address hidden>:5672/?ssl=0', 'created_at': datetime
.datetime(2017, 2, 27, 18, 35, 16, 812201), 'updated_at': None, 'uuid': 'ce8d3b0d-a969-4de4-82af-67e3bd9d11e5'}]", "stdout_lines": ["An error has occurred:", "Traceback (most recent call

2. While Step4 on controller-0 started at:
2017-02-27 18:33:38Z [overcloud-AllNodesDeploySteps-ozxvts356czy.ControllerUpgrade_Step4]: CREATE_IN_PROGRESS state changed
and finished at:
2017-02-27 18:34:31Z [overcloud-AllNodesDeploySteps-ozxvts356czy.ControllerUpgrade_Step4]: CREATE_COMPLETE state changed

3. Yet galera was not yet ready at that time and the first time it was ready was actually afterwards:
galera(galera)[341034]: 2017/02/27_18:35:33 INFO: Galera started

So we need to double check that the code that is supposed to wait for all services in puppet/services/pacemaker.yaml is working correctly at Step4. It clearly did not wait for galera to be master everywhere and that is likely what caused this issue.

It might be either a) ansible-pacemaker that needs to make sure that the resource is master on all nodes *or* b) it is due to the fact that here https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/pacemaker.yaml#L93 we are missing all the other pacemaker managed resources.

Tags:

Emilien Macchi (emilienm) on 2017-02-27

Changed in tripleo:
status:	New → Triaged
milestone:	none → ocata-rc2

Revision history for this message

Michele Baldessari (michele) wrote on 2017-02-28:

Ansible-pacemaker change: https://review.gerrithub.io/#/c/350823/
THT change: https://review.openstack.org/438947

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-02-28: Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/438947

Changed in tripleo:
assignee:	nobody → Michele Baldessari (michele)
status:	Triaged → In Progress

Emilien Macchi (emilienm) on 2017-03-06

Changed in tripleo:
milestone:	ocata-rc2 → pike-1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-03-13: Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/438947
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=841d30549bd27a8b5669955196e14085025dafad
Submitter: Jenkins
Branch: master

commit 841d30549bd27a8b5669955196e14085025dafad
Author: Michele Baldessari <email address hidden>
Date: Tue Feb 28 13:25:59 2017 +0100

Upgrades: wait for galera to be settled

We also need to wait for the galera resource to settle down
before we proceed starting up with the other services.

    Note that before merging this, we need to land the following
    change in ansible-pacemaker:
    https://review.gerrithub.io/#/c/351387/

Change-Id: Id71c9cb41cfd4c17685c922db2683e28ab7588fd
Closes-Bug: #1668372

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-03-13: Fix proposed to tripleo-heat-templates (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/444928

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-03-18: Fix merged to tripleo-heat-templates (stable/ocata)

Reviewed: https://review.openstack.org/444928
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=bc8dcd1054f95620d69c4595eebb9157df9c7e7b
Submitter: Jenkins
Branch: stable/ocata

commit bc8dcd1054f95620d69c4595eebb9157df9c7e7b
Author: Michele Baldessari <email address hidden>
Date: Tue Feb 28 13:25:59 2017 +0100

Upgrades: wait for galera to be settled

We also need to wait for the galera resource to settle down
before we proceed starting up with the other services.

    Note that before merging this, we need to land the following
    change in ansible-pacemaker:
    https://review.gerrithub.io/#/c/351387/

D-O is needed for upgrades to work against stable/* branches.
Depends-On: I712abe71f97c22ee3d55d9db2f641096f8a7350c

    Change-Id: Id71c9cb41cfd4c17685c922db2683e28ab7588fd
    Closes-Bug: #1668372
    (cherry picked from commit 841d30549bd27a8b5669955196e14085025dafad)