N-O upgrades do not wait for galera to be fully up

Bug #1668372 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Michele Baldessari

Bug Description

I have observed the following during one of my HA N->O upgrade runs (note the lost connection):
1. Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: TASK [Setup cell_v2 (migrate hosts)] *******************************************↲
Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["nova-manage", "cell_v2", "map_cell_and_hosts"], "delta": "0:00:0
4.102342", "end": "2017-02-27 18:35:17.504998", "failed": true, "rc": 1, "start": "2017-02-27 18:35:13.402656", "stderr": "", "stdout": "An error has occurred:\nTraceback (most recent ca
ll last):\n File \"/usr/lib/python2.7/site-packages/nova/cmd/manage.py\", line 1594, in main\n ret = fn(*fn_args, **fn_kwargs)\n File \"/usr/lib/python2.7/site-packages/nova/cmd/man
....
Feb 27 18:35:17 overcloud-controller-0 os-collect-config[2078]: _query_result\n result.read()\n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1312, in read\n
    first_packet = self.connection._read_packet()\n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 971, in _read_packet\n packet_header = self._read_bytes(4)\
n File \"/usr/lib/python2.7/site-packages/pymysql/connections.py\", line 1008, in _read_bytes\n 2013, \"Lost connection to MySQL server during query\")\nDBConnectionError: (pymysql.e
rr.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: u'INSERT INTO cell_mappings (created_at, updated_at, uuid, name, transport_url, database_connection) VAL
UES (%(created_at)s, %(updated_at)s, %(uuid)s, %(name)s, %(transport_url)s, %(database_connection)s)'] [parameters: {'database_connection': u'mysql+pymysql://nova:N8AUkJGgVewYzdCdC6rPTfr
8B@172.16.2.11/nova?bind_address=172.16.2.13', 'name': None, 'transport_url': u'rabbit://guest:<email address hidden>:5672,guest:9bakaEc
<email address hidden>:5672,guest:<email address hidden>:5672/?ssl=0', 'created_at': datetime
.datetime(2017, 2, 27, 18, 35, 16, 812201), 'updated_at': None, 'uuid': 'ce8d3b0d-a969-4de4-82af-67e3bd9d11e5'}]", "stdout_lines": ["An error has occurred:", "Traceback (most recent call

2. While Step4 on controller-0 started at:
2017-02-27 18:33:38Z [overcloud-AllNodesDeploySteps-ozxvts356czy.ControllerUpgrade_Step4]: CREATE_IN_PROGRESS state changed
and finished at:
2017-02-27 18:34:31Z [overcloud-AllNodesDeploySteps-ozxvts356czy.ControllerUpgrade_Step4]: CREATE_COMPLETE state changed

3. Yet galera was not yet ready at that time and the first time it was ready was actually afterwards:
galera(galera)[341034]: 2017/02/27_18:35:33 INFO: Galera started

So we need to double check that the code that is supposed to wait for all services in puppet/services/pacemaker.yaml is working correctly at Step4. It clearly did not wait for galera to be master everywhere and that is likely what caused this issue.

It might be either a) ansible-pacemaker that needs to make sure that the resource is master on all nodes *or* b) it is due to the fact that here https://github.com/openstack/tripleo-heat-templates/blob/master/puppet/services/pacemaker.yaml#L93 we are missing all the other pacemaker managed resources.

Changed in tripleo:
status: New → Triaged
milestone: none → ocata-rc2
Revision history for this message
Michele Baldessari (michele) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/438947

Changed in tripleo:
assignee: nobody → Michele Baldessari (michele)
status: Triaged → In Progress
Changed in tripleo:
milestone: ocata-rc2 → pike-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/438947
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=841d30549bd27a8b5669955196e14085025dafad
Submitter: Jenkins
Branch: master

commit 841d30549bd27a8b5669955196e14085025dafad
Author: Michele Baldessari <email address hidden>
Date: Tue Feb 28 13:25:59 2017 +0100

    Upgrades: wait for galera to be settled

    We also need to wait for the galera resource to settle down
    before we proceed starting up with the other services.

    Note that before merging this, we need to land the following
    change in ansible-pacemaker:
    https://review.gerrithub.io/#/c/351387/

    Change-Id: Id71c9cb41cfd4c17685c922db2683e28ab7588fd
    Closes-Bug: #1668372

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/444928

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/ocata)

Reviewed: https://review.openstack.org/444928
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=bc8dcd1054f95620d69c4595eebb9157df9c7e7b
Submitter: Jenkins
Branch: stable/ocata

commit bc8dcd1054f95620d69c4595eebb9157df9c7e7b
Author: Michele Baldessari <email address hidden>
Date: Tue Feb 28 13:25:59 2017 +0100

    Upgrades: wait for galera to be settled

    We also need to wait for the galera resource to settle down
    before we proceed starting up with the other services.

    Note that before merging this, we need to land the following
    change in ansible-pacemaker:
    https://review.gerrithub.io/#/c/351387/

    D-O is needed for upgrades to work against stable/* branches.
    Depends-On: I712abe71f97c22ee3d55d9db2f641096f8a7350c

    Change-Id: Id71c9cb41cfd4c17685c922db2683e28ab7588fd
    Closes-Bug: #1668372
    (cherry picked from commit 841d30549bd27a8b5669955196e14085025dafad)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 7.0.0.0b1

This issue was fixed in the openstack/tripleo-heat-templates 7.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 6.1.0

This issue was fixed in the openstack/tripleo-heat-templates 6.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.