N ->O upgrade: after running major-upgrade-composable-steps.yaml nova-api cannot connect to Galera on 2/3 controllers

Bug #1675359 reported by Sofer Athlan-Guyot
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Sofer Athlan-Guyot

Bug Description

Reported originally there: https://bugzilla.redhat.com/show_bug.cgi?id=1434955

newton -> ocata upgrade:

after running major-upgrade-composable-steps.yaml nova-api cannot
connect to MySQL on 2/3 controllers. This results in timeouts and 500
ERROR (ClientException): Unknown Error (HTTP 504) when calling the
nova api, making it very difficult to manage the nova instances prior
to upgrading the compute nodes.

Even after running the upgrade converge step 2/3 controllers cannot
reach MySQL leaving the upgraded environment in a semi working state.

Steps to Reproduce:
1. Deploy newton with 3 ctrl, 2 computes, 3 ceph nodes
2. Upgrde to ocata

Actual results:
controller-1 and controller-2 report in /var/log/nova/nova-api.log:

    2017-03-22 17:25:43.189 377885 WARNING oslo_db.sqlalchemy.engines [req-41cd5904-8031-48d1-9e12-56e73c62f5cd - - - - -] SQL connection failed. -123 attempts left.

runnnig:

    mysql -u nova_api -p -h <galera_vip> -e 'show grants;'

works fine from every nodes in the cluster, so no network connectivity issue.

Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :
Download full text (5.6 KiB)

So after a discussion with Damien and Michele, the problem was found.

It appears that when the nova cell is create it hardcode into the database the connection parameter present in /etc/nova/nova.conf of the node where it's run. Running it on controller0 for instance will give you this in the database:

    mysql+pymysql://nova:c2cdagE8PyAbnpers3AD88Hge@10.0.0.19nova?bind_address=10.0.0.20'

This is later used to create a connection to the database for nova cell information. This obviously fails on 2 other node as they don't have the 10.0.0.20 address.

To prevent this issue, this workaround have been done: https://review.openstack.org/#/c/436192/ removing the bind_address parameter from the configuration line.

The sequence of event on the seems correct:
 1. update hiera data;
 2. create nova cell with database option;

From the journalctl logs:

    Mar 22 19:23:44 overcloud-controller-0.localdomain os-collect-config[4197]: [2017-03-22 19:23:44,806] (heat-config) [DEBUG] Running /usr/libexec/heat-config/hooks/hiera < /var/lib/heat-config/deployed/e4e9bd8e-4b7b-41da-b040-d4f563f2fd48.json
    Mar 22 19:23:44 overcloud-controller-0.localdomain os-collect-config[4197]: [2017-03-22 19:23:44,852] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/e4e9bd8e-4b7b-41da-b040-d4f563f2fd48.json < /var/lib/heat-config/deployed/e4e9bd8e-4b7b-41da-b040-d4f563f2fd48.notify.json

    $ grep nova::database_connection /var/lib/heat-config/deployed/e4e9bd8e-4b7b-41da-b040-d4f563f2fd48.json
            "nova::database_connection": "mysql+pymysql://nova:c2cdagE8PyAbnpers3AD88Hge@10.0.0.19/nova?read_default_file=/etc/my.cnf.d/tripleo.cnf&read_default_group=tripleo",

    [root@overcloud-controller-0 e]# journalctl | grep 'nova-manage cell_v2'
    Mar 22 19:39:52 overcloud-controller-0.localdomain ansible-command[440226]: Invoked with warn=True executable=None _uses_shell=False _raw_params=nova-manage cell_v2 map_cell0 removes=None creates=None chdir=None
    Mar 22 19:39:55 overcloud-controller-0.localdomain ansible-command[440632]: Invoked with warn=True executable=None _uses_shell=True _raw_params=nova-manage cell_v2 create_cell --name='default' --database_connection=$(hiera nova::database_connection) removes=None creates=None chdir=None
    Mar 22 19:40:09 overcloud-controller-0.localdomain ansible-command[443480]: Invoked with warn=True executable=None _uses_shell=False _raw_params=nova-manage cell_v2 map_cell_and_hosts removes=None creates=None chdir=None
    Mar 22 19:40:12 overcloud-controller-0.localdomain ansible-command[443950]: Invoked with warn=True executable=None _uses_shell=True _raw_params=nova-manage cell_v2 list_cells | sed -e '1,3d' -e '$d' | awk -F ' *| *' '$2 == "default" {print $4}' removes=None creates=None chdir=None
    Mar 22 19:40:15 overcloud-controller-0.localdomain ansible-command[444382]: Invoked with warn=True executable=None _uses_shell=False _raw_params=nova-manage cell_v2 map_instances --cell_uuid 7f04f00d-4b9d-478d-941f-d93a7be145e7 removes=None creates=None chdir=None

    INSERT INTO `cell_mappings` VALUES
    ('2017-03-22 19:39:54',NULL,2,'00000000-0000-0000-0000-000000000000','cell0','none:/...

Read more...

Changed in tripleo:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/449080

Changed in tripleo:
assignee: nobody → Sofer Athlan-Guyot (sofer-athlan-guyot)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/449093

Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

For the NULL entry another launchpad bug has been created : https://bugs.launchpad.net/tripleo/+bug/1675418

Changed in tripleo:
milestone: ongoing → pike-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/449080
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=4883e8b229c7fa3b8fe828bd3e06aa16e852d95c
Submitter: Jenkins
Branch: master

commit 4883e8b229c7fa3b8fe828bd3e06aa16e852d95c
Author: Sofer Athlan-Guyot <email address hidden>
Date: Thu Mar 23 12:10:48 2017 +0100

    [N->O] Fix wrong database connection for cell0 during upgrade.

    During upgrade the cell0 database has the connection pointing to

       mysql+pymysql://nova:c2cdagE8PyAbnpers3AD88Hge@10.0.0.19/nova_cell0?bind_address=10.0.0.20

    where 10.0.0.20 was the ip of the bootstrap node. This makes the
    nova-api fails on 2/3 node at the end of the
    major-upgrade-composable-steps.yaml step.

    We do have the right value in the hiera database so make sure we use
    it for cell0 creation and not the nova.conf file which hasn't been
    updated yet.

    Change-Id: I09775206cb8fc5e15934f7e4475506a7fe17271e
    Closes-Bug: #1675359

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/ocata)

Reviewed: https://review.openstack.org/449093
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=d99a06705489b1d63d72b0f2296d7236b3ccf7aa
Submitter: Jenkins
Branch: stable/ocata

commit d99a06705489b1d63d72b0f2296d7236b3ccf7aa
Author: Sofer Athlan-Guyot <email address hidden>
Date: Thu Mar 23 12:10:48 2017 +0100

    [N->O] Fix wrong database connection for cell0 during upgrade.

    During upgrade the cell0 database has the connection pointing to

       mysql+pymysql://nova:c2cdagE8PyAbnpers3AD88Hge@10.0.0.19/nova_cell0?bind_address=10.0.0.20

    where 10.0.0.20 was the ip of the bootstrap node. This makes the
    nova-api fails on 2/3 node at the end of the
    major-upgrade-composable-steps.yaml step.

    We do have the right value in the hiera database so make sure we use
    it for cell0 creation and not the nova.conf file which hasn't been
    updated yet.

    Change-Id: I09775206cb8fc5e15934f7e4475506a7fe17271e
    Closes-Bug: #1675359
    (cherry picked from commit c9c3813b6a0811a262068d0aab28d0bd535be3e1)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 7.0.0.0b1

This issue was fixed in the openstack/tripleo-heat-templates 7.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 6.1.0

This issue was fixed in the openstack/tripleo-heat-templates 6.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.