NCC looks for old mysql IP after upgrade to innodb charm

Bug #1892904 reported by Michael Quiniola
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Nova Cloud Controller Charm
Fix Released
High
Unassigned

Bug Description

After upgrading from percona charm to mysql-innodb-cluster following the guide found at https://docs.openstack.org/project-deploy-guide/charm-deployment-guide/latest/app-series-upgrade-specific-procedures.html#percona-cluster-charm-series-upgrade-to-focal, N-C-C has been looking for the old mysql IP address.

At first the upgrade worked fine and using the openstack CLI client worked. But, accessing via horizon I would get errors in the nova-api log:

Can't connect to MySQL server on '10.0.3.116' ([Errno 113] EHOSTUNREACH)

I couldn't find ANYWHERE this IP address was defined in any charm. The old sql server was removed and the charm was removed from the model.

I completely removed Nova Cloud Controller and redeployed the charm. It STILL is looking for the old IP address, but now its ALSO affecting the rest of the charm:

2020-08-25 18:04:24 ERROR juju-log shared-db:316: db sync failed
b'ERROR: Could not access cell0.\nHas the nova_api database been created?\nHas the nova_cell0 database been created?\nHas "nova-manage api_db sync" been run?\nHas "nova-manage cell_v2 map_cell0" been run?\nIs [api_database]/connection set in nova.conf?\nIs the cell0 database connection URL correct?\nError: (pymysql.err.OperationalError) (2003, "Can\'t connect to MySQL server on \'10.0.3.116\' ([Errno 113] EHOSTUNREACH)")\n(Background on this error at: http://sqlalche.me/e/e3q8)\n'

I looked through all the config files and can't find ANYWHERE this IP address is still defined.

nova-cloud-controller 21.0.0 nova-cloud-controller 346

description: updated
description: updated
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hi Michael,

Thank you for reporting this. You mentioned "At first the upgrade worked fine and using the openstack CLI client worked." Can you expand on that a bit and what may have gone on in between when things were working and then not working?

Thank you,
Corey

Revision history for this message
Michael Quiniola (qthepirate) wrote :

Corey,

I can use the CLI client and interact with all services in openstack without issue.

But, I could not access anything via the horizon dashboard. I traced the issue in the logs back to
errors in Nova-Cloud-Controller, which the only error showed up in the nova-api log, and that error was the same error as above: It is trying to talk to a missing SQL server.

Revision history for this message
Michael Quiniola (qthepirate) wrote :

I attempted to redeploy the entire charm, but now its still looking for the old sql server

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I know you mentioned the relation no longer existed but just to double check that base, can you show the output of juju status --relations?

Revision history for this message
Michael Quiniola (qthepirate) wrote :

No original percona relations exist. I triple-checked.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Some additional notes from IRC discussion with Michael in #openstack-charms:
* he checked all connection strings in /etc/nova/nova.conf and they had 127.0.0.1
* he redeployed nova-cloud-controller, openstack-dashboard, and related memcached with no changes in behavior
* the problem seems to be an issue with nova-c-c. nova-api-wsgi is the only service having this issue.
* he confirmed steps taken were deploy bionic-ussuri with percona, then follow the charm guide upgrade steps to get to focal with mysql8. small caviat is he hit [1] on his upgrade.

[1] https://bugs.launchpad.net/charm-mysql-router/+bug/1884460

Revision history for this message
Michael Quiniola (qthepirate) wrote :

I'm leaning towards the issue being in juju, because mysqlrouter.conf uses these configurations to pull IP addresses for which nodes are primary/secondary in mysql-innodb-cluster:

[routing:jujuCluster_rw]
bind_address=127.0.0.1
bind_port=3306
socket=/var/lib/mysql/keystone-mysql-router/mysql.sock
destinations=metadata-cache://jujuCluster/?role=PRIMARY
routing_strategy=first-available
protocol=classic

[routing:jujuCluster_ro]
bind_address=127.0.0.1
bind_port=3307
socket=/var/lib/mysql/keystone-mysql-router/mysqlro.sock
destinations=metadata-cache://jujuCluster/?role=SECONDARY
routing_strategy=round-robin-with-fallback
protocol=classic

Revision history for this message
Michael Quiniola (qthepirate) wrote :

Verified it is NOT the mysql-router for nova via this log:

2020-08-25 22:16:58 metadata_cache INFO [7f4eafae7700] Starting Metadata Cache
2020-08-25 22:16:58 metadata_cache INFO [7f4eafae7700] Connections using ssl_mode 'PREFERRED'
2020-08-25 22:16:58 metadata_cache INFO [7f4eac2e0700] Starting metadata cache refresh thread
2020-08-25 22:16:58 metadata_cache INFO [7f4eac2e0700] Potential changes detected in cluster 'jujuCluster' after metadata refresh
2020-08-25 22:16:58 metadata_cache INFO [7f4eac2e0700] Metadata for cluster 'jujuCluster' has 1 replicasets:
2020-08-25 22:16:58 metadata_cache INFO [7f4eac2e0700] 'default' (3 members, single-master)
2020-08-25 22:16:58 metadata_cache INFO [7f4eac2e0700] 10.0.3.93:3306 / 33060 - mode=RW
2020-08-25 22:16:58 metadata_cache INFO [7f4eac2e0700] 10.0.3.91:3306 / 33060 - mode=RO
2020-08-25 22:16:58 metadata_cache INFO [7f4eac2e0700] 10.0.3.96:3306 / 33060 - mode=RO
2020-08-25 22:16:58 routing INFO [7f4eac2e0700] Routing routing:jujuCluster_x_ro listening on 3309 and named socket /var/lib/mysql/nova-mysql-router/mysqlxro.sock got request to disconnect invalid connections: metadata change
2020-08-25 22:16:58 routing INFO [7f4eac2e0700] Routing routing:jujuCluster_x_rw listening on 3308 and named socket /var/lib/mysql/nova-mysql-router/mysqlx.sock got request to disconnect invalid connections: metadata change
2020-08-25 22:16:58 routing INFO [7f4eac2e0700] Routing routing:jujuCluster_ro listening on 3307 and named socket /var/lib/mysql/nova-mysql-router/mysqlro.sock got request to disconnect invalid connections: metadata change
2020-08-25 22:16:58 routing INFO [7f4eac2e0700] Routing routing:jujuCluster_rw listening on 3306 and named socket /var/lib/mysql/nova-mysql-router/mysql.sock got request to disconnect invalid connections: metadata change

Revision history for this message
Michael Quiniola (qthepirate) wrote :

SOLVED! Error was in the nova_api database.

Database: nova_api
Table: cell_mappings

The issue is that the cells did not update to the new cluster, leaving cell0 int an empty state and cell1 in an unusable state.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

That's great to hear! Somehow I missed this update until now. Is there any chance that you didn't mysqldump and restore-mysqldump the nova_api database?

I attempted to reproduce the issue but wasn't able to. I'm going to attach my steps to attempt since it may be useful to others. Note that my initial deployment had all services at focal-ussuri except for percona-cluster which was at bionic-ussuri.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Or perhaps missed the nova_cell0 database?

Revision history for this message
Corey Bryant (corey.bryant) wrote :

e.g. I used:
juju scp /home/ubuntu/mysqldump-nova-nova_api-nova_cell0-glance-cinder-neutron-keystone-placement-202008261837.gz mysql-innodb-cluster/1:/home/ubuntu

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Sorry wrong command. I used:
juju run-action --wait mysql/0 mysqldump databases=nova,nova_api,nova_cell0,glance,cinder,neutron,keystone,placement

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Michael, For now I'm going to mark this as invalid. If this is a bug please feel free to move back to New status.

Changed in charm-nova-cloud-controller:
status: New → Invalid
Revision history for this message
Michael Quiniola (qthepirate) wrote :

Corey,

I'm not sure honestly.

I did backup and restore all the databases. One thing that should be made clear in ALL documentation is you HAVE to do the databases separately/individually and not allow mysqldump to just do "all databases".

The other thing is this it not a database entry that seemed to be updated on restore to the new mysql-innodb-cluter.

So, I think the root cause here is the migration from percona to mysql-innodb charms. I would have to set up a test environment (which i dont have at the moment) to try it again and see if that entry in the cell_mappings table is updated upon restore (which i think may not be).

If this is true, this could be a HUGE issue for people moving from mysql 5/percona to mysql8/innodb

Revision history for this message
Corey Bryant (corey.bryant) wrote :

After some discussion in #openstack-charms the issue is pointing at db connection records that get created for nova cells. e.g. the nova-manage commands seem to create db records for the cell's db connection: https://docs.openstack.org/nova/latest/user/cells.html#setup-of-cells-v2

Michael says to run the same as before and then test via horizon and watch the nova-api-wsgi logs. Then hop into nova_api db and look at the cell_mappings table.

Changed in charm-nova-cloud-controller:
status: Invalid → New
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I've confirmed that nova_cell0 from the cell_mappings table of the nova_api database does contain the old database IP address after upgrading and dumping/restoring database tables. I used the same steps in my prior attachment with additional steps of digging into the database, see new attachment.

Changed in charm-nova-cloud-controller:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Michael Quiniola (qthepirate) wrote :

The current workaround would be to edit the entry to point to 127.0.0.1 prior to linking the N-C-C charm to nova-mysql-router

Revision history for this message
Michael Quiniola (qthepirate) wrote :

Steps for the workaround:

To be run PRIOR to connecting nova-cloud-controller to nova-mysql-router.

1) Get mysql password:
juju run --unit mysql-innodb-cluster/leader leader-get

2) Log into mysql R/W node:
juju ssh mysql-innodb-cluster/*

3) Log into mysql with password:
mysql -u root -p

4)Find and adjust entry:

mysql> use nova_api;
mysql> select * from cell_mappings;
mysql> update cell_mappings set database_connection = 'mysql+pymysql://nova:<PASSWORD>@127.0.0.1/nova_cell0' where name = 'cell0';
mysql> exit

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Yes, that should work although I think we probably want to use nova-manage to make the update:

sudo nova-manage cell_v2 list_cells

sudo nova-manage cell_v2 update_cell --cell_uuid 00000000-0000-0000-0000-000000000000 --database_connection mysql+pymysql://nova:6fz2bd5Npp8mpqmkPwzhbSHsmVkqYRjk@127.0.0.1/nova_cell0 --transport-url none:///

Revision history for this message
Michael Quiniola (qthepirate) wrote :

Maybe have the command run once the relation to nova-mysql-router is added? Since even if the original entry is there or not and it would be the same entry either way.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

There's already a function in nova_cc_utils.py called update_cell_database() that calls 'nova-manage cell_v2 update_cell --cell_uuid <cell1_uuid>'.

A few things to note:
  * that function currently only works with cell1
  * --database_connection and --transport-url are required, at least for the cell0 call [1]

[1]
$ sudo nova-manage cell_v2 update_cell --cell_uuid 00000000-0000-0000-0000-000000000000
--transport-url not provided in the command line, using the value [DEFAULT]/transport_url from the configuration file
--database_connection not provided in the command line, using the value [database]/connection from the configuration file
The specified transport_url and/or database_connection combination already exists for another cell with uuid 81b99d13-1c20-4864-9aeb-df21c0227586. # <-- that's cell1

Revision history for this message
Corey Bryant (corey.bryant) wrote :

It should be pretty simple to get update_cell_database() to handle cell0 and cell1, and pass --database_connection and --transport-url to the update_cells calls. There's also a function called get_cell_details() that puts those details in a dict:

>>> cells['cell0']['db']
'mysql+pymysql://nova:6fz2bd5Npp8mpqmkPwzhbSHsmVkqYRjk@127.0.0.1/nova_cell0'
>>> cells['cell0']['amqp']
'none:///'
>>> cells['cell0']['uuid']
'00000000-0000-0000-0000-000000000000'

Revision history for this message
Corey Bryant (corey.bryant) wrote :

That's wrong as it's the old data, the new amqp/db data needs to be pulled from relations.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hmm, there's actually an update_child_cell_records() function already that looks like it should be taking care of the nova-manage update_cell calls in the way we want to with the --transport-url and
--database_connection options. It's called in the following hooks:
* shared-db-cell-relation-changed
* amqp-cell-relation-changed
* nova-cell-api-relation-changed

Revision history for this message
Michael Quiniola (qthepirate) wrote :

Interesting. I wonder why its not called? Or if it is called it hasn't errored or changed anything

Revision history for this message
Corey Bryant (corey.bryant) wrote :

It appears that update_child_cell() isn't getting called in my upgrade for some reason, at least I'm not seeing any logs from that function and it is quite verbose with it's logging.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

All of the update_child_cell_records() end up calling nova_cell_api_relation_changed(). However, nova_cell_api_relation_changed() never calls update_child_cell() unless nova-cell-controller is related. [1]

I think we could fix this by either extending nova_cell_api_relation_changed() to call update_child_cell() for all cells, or we could expand update_cell_database(). They seem to duplicate some code.

[1]
@hooks.hook('nova-cell-api-relation-changed')
def nova_cell_api_relation_changed(rid=None, unit=None):
    data = hookenv.relation_get(rid=rid, unit=unit)
    ch_neutron.log("Data: {}".format(data, level=hookenv.DEBUG))
    if not data.get('cell-name'):
        return # <---- short-circuits here because cell-name is Null
    cell_updated = ncc_utils.update_child_cell(
        name=data['cell-name'],
        db_service=data['db-service'],
        amqp_service=data['amqp-service'])

Revision history for this message
Corey Bryant (corey.bryant) wrote :

I have a patch up for this at: https://review.opendev.org/#/c/750539/1

With the patch I tested the upgrade from percona-cluster -> mysql-innodb-cluster/mysql-router charms successfully and errored instances are correctly being added to the nova_cell0 database post-upgrade.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-nova-cloud-controller (master)

Reviewed: https://review.opendev.org/750539
Committed: https://git.openstack.org/cgit/openstack/charm-nova-cloud-controller/commit/?id=7983d1aaa958c09c5a39321b1059ba8fd187cf0f
Submitter: Zuul
Branch: master

commit 7983d1aaa958c09c5a39321b1059ba8fd187cf0f
Author: Corey Bryant <email address hidden>
Date: Tue Sep 8 19:40:10 2020 +0000

    Ensure cell0 updated on db/amqp change

    Prior to this change, cell1 was updated when there was a database
    or rabbitmq-server relation change, but cell0 wasn't. Ensure that
    cell0 is also updated.

    Change-Id: I670d0295ea339b21166ef7b18509b04a5beaa959
    Closes-Bug: #1892904

Changed in charm-nova-cloud-controller:
status: In Progress → Fix Committed
Revision history for this message
Michael Quiniola (qthepirate) wrote :

Woo! I was useful!

Changed in charm-nova-cloud-controller:
milestone: none → 20.10
Changed in charm-nova-cloud-controller:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.