nova-manage: db sync failure leaves DB in an inconsistent state

Bug #1641609 reported by Alexei Sheplyakov on 2016-11-14
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Wishlist
MOS Nova

Bug Description

Steps to reproduce:

1. Start deploying a cluster (1 controller node and a compute one is enough)
2. Manually interrupt nova-manage db sync
3. Try to re-run nova-manage db sync

The expected result: the 2nd run of nova-manage db sync completes successfully

The actual result: the 2nd run of nova-manage db sync fails (backtrace: [1]).
Thus a failed db sync leaves the cluster in a badly broken state.
Recovering from such a state requires manual inspection of tables and selectively running
DB migration scripts to add missing tables, columns, constraints, etc.
This is not a big deal for a new deployment (one can wipe out all tables and start over),
however this is a real disaster for upgrades.

Notes: nova-manage db sync expects DDL statements to be transactional (which is quite natural),
however they are not transactional with mysql (which is the default DB in MOS).
Therefore migration scripts can't trust the version recorded in the DB, and should explicitly
check if all tables/columns/constraints corresponding to that version actually exist.
This involves using 'create table if not exists' (as opposed to 'create table'), and so on.

[1]
root@node-1:~# nova-manage db sync
Option "verbose" from group "DEFAULT" is deprecated for removal. Its value may be silently ignored in the future.
Option "notification_driver" from group "DEFAULT" is deprecated. Use option "driver" from group "oslo_messaging_notifications".
Option "notification_topics" from group "DEFAULT" is deprecated. Use option "topics" from group "oslo_messaging_notifications".
error: (_mysql_exceptions.OperationalError) (1050, "Table 'instances' already exists") [SQL: u"\nCREATE TABLE instances (\n\tcreated_at DATETIME, \n\tupdated_at DATETIME, \n\tdeleted_at DATETIME, \n\tid INTEGER NOT NULL AUTO_INCREMENT, \n\tinternal_id INTEGER, \n\tuser_id VARCHAR(255), \n\tproject_id VARCHAR(255), \n\timage_ref VARCHAR(255), \n\tkernel_id VARCHAR(255), \n\tramdisk_id VARCHAR(255), \n\tlaunch_index INTEGER, \n\tkey_name VARCHAR(255), \n\tkey_data MEDIUMTEXT, \n\tpower_state INTEGER, \n\tvm_state VARCHAR(255), \n\tmemory_mb INTEGER, \n\tvcpus INTEGER, \n\thostname VARCHAR(255), \n\thost VARCHAR(255), \n\tuser_data MEDIUMTEXT, \n\treservation_id VARCHAR(255), \n\tscheduled_at DATETIME, \n\tlaunched_at DATETIME, \n\tterminated_at DATETIME, \n\tdisplay_name VARCHAR(255), \n\tdisplay_description VARCHAR(255), \n\tavailability_zone VARCHAR(255), \n\tlocked BOOL, \n\tos_type VARCHAR(255), \n\tlaunched_on MEDIUMTEXT, \n\tinstance_type_id INTEGER, \n\tvm_mode VARCHAR(255), \n\tuuid VARCHAR(36), \n\tarchitecture VARCHAR(255), \n\troot_device_name VARCHAR(255), \n\taccess_ip_v4 VARCHAR(39), \n\taccess_ip_v6 VARCHAR(39), \n\tconfig_drive VARCHAR(255), \n\ttask_state VARCHAR(255), \n\tdefault_ephemeral_device VARCHAR(255), \n\tdefault_swap_device VARCHAR(255), \n\tprogress INTEGER, \n\tauto_disk_config BOOL, \n\tshutdown_terminate BOOL, \n\tdisable_terminate BOOL, \n\troot_gb INTEGER, \n\tephemeral_gb INTEGER, \n\tcell_name VARCHAR(255), \n\tnode VARCHAR(255), \n\tdeleted INTEGER, \n\tlocked_by ENUM('owner','admin'), \n\tcleaned INTEGER, \n\tPRIMARY KEY (id), \n\tCHECK (locked IN (0, 1)), \n\tCHECK (auto_disk_config IN (0, 1)), \n\tCHECK (shutdown_terminate IN (0, 1)), \n\tCHECK (disable_terminate IN (0, 1))\n)ENGINE=InnoDB CHARSET=utf8\n\n"]

Oleksiy Molchanov (omolchanov) wrote :

Nova team, can you comment here?

Changed in fuel:
assignee: nobody → MOS Nova (mos-nova)
Roman Podoliaka (rpodolyaka) wrote :

It's not Nova specific, actually the problem is the same for *all* OpenStack projects: MySQL does not support transactional DDL (i.e. each DDL statement like CREATE TABLE will implicitly issue a COMMIT ending the ongoing DB transaction, so that one can no longer ROLLBACK all the previous changes in case of errors) - if application of migration scripts fails in the middle for whatever reason (e.g. "2006 mysql server has gone away") - the DB is potentially left in inconsistent state: e.g. migration scripts actions might have already been performed (e.g. tables might have been created), but the schema version number hasn't been updated yet - next time you run migration scripts alembic or sqlalchemy-migrate will try to execute the very same script potentially trying to create tables that already exist.

While the best solution is to have transactional DDL to make sure the DB is always in consistent state in case of errors, it does not look like it will happen anytime soon (PostgreSQL supports it, but MySQL Galera is one of the corner stones of MOS reference architecture).

We could try to patch all the existing migration scripts in OpenStack to add conditions to DDL statements (like CREATE TABLE IF NOT EXISTS), but that's going to be a cross-project effort, which is potentially error-prone.

FWIW, I do not think the problem is that severe: all new migration scripts are usually written with conditional statements to ensure they are idempotent even in absence of support from RDBMS side (at least in Nova), so this should mitigate problems on upgrade of the existing schema. And for new deployments, one can always drop the existing schema completely (as it's empty - there no rows in tables yet) and re-create if from scratch by the means of migration scripts.

The bottom line is, I see this as a valid request for improvement, but not as a severe issue.

tags: added: area-mos
Changed in fuel:
status: New → Confirmed
importance: High → Wishlist
Alexei Sheplyakov (asheplyakov) wrote :

> And for new deployments, one can always drop the existing schema completely (as it's empty - there no rows in tables yet) and re-create if from scratch by the means of migration scripts.

this requires manual intervention and is not acceptable.
(Leaving aside that it's next to impossible to find out that failed due to db sync error due to overly generic Fuel's error messages)

Changed in fuel:
importance: Wishlist → High
Roman Podoliaka (rpodolyaka) wrote :

Alexei,

I don't see this happening in 9.2 in downstream. From my perspective it makes little sense to only try to "harden" Nova migration scripts, as all OpensStack projects potentially suffer from the very same problem.

Changed in fuel:
assignee: MOS Nova (mos-nova) → nobody
Dmitry Mescheryakov (dmitrymex) wrote :

Alexey,

That is a rather rare case which is very hard to fix. So while technically the bug is valid, it is irrational to spend time on it. In case a user experiences it, our verdict is "the issue should be manually resolved by operations people".

Changed in fuel:
status: Confirmed → Invalid
Changed in fuel:
status: Invalid → New
Alexei Sheplyakov (asheplyakov) wrote :

Dmitry,

> That is a rather rare case

It happens once in 4 -- 6 deployment attempts, I don't think that qualifies as "rather rare"

> which is very hard to fix

Just because the bug is "hard to fix" doesn't mean it's invalid.

> it is irrational to spend time on it

It's irrational to spend time arguing that bugs should not be fixed.

> In case a user experiences it, our verdict is "the issue should be manually resolved by operations people".

In reality these issues get resolved as "bye-bye Fuel, we'll pay our $$$ to Amazon/Red Hat/Canonical".
And until you guys take the quality of the product seriously, it's going to happen that way ever and ever again.

Changed in fuel:
status: New → Confirmed
importance: High → Wishlist
assignee: nobody → MOS Nova (mos-nova)
milestone: 9.2 → 11.0
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers