Instances not visible when hidden=NULL

Bug #1862205 reported by Mohammed Naser
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Dan Smith

Bug Description

During an upgrade of a cloud from Stein to Train, there is a migration which adds the `hidden` field to the database.

In that migration, it was assumed that it does not backfill all of the columns. However, upon verifying, it actually does backfill all columns and the order of operations *seems* to be:

1. Create new column for `hidden`
2. Update database migration version
3. Start backfilling all existing instances with hidden=0

In my case, the migration did create the column but failed to backfill all existing instances because of the large number of instances. However, running the migrations again seems to simply continue and not block on that migration, but leaving all columns with hidden=NULL.

====================
Feb 06 14:06:13 control02-nova-api-container-f89ad8b4 nova-manage[10596]: 2020-02-06 14:06:13.566 10596 INFO migrate.versioning.api [req-34f0c5a6-2983-4c8e-9b9d-14167851c984 - - - - -] 398 -> 399...
Feb 06 14:07:18 control02-nova-api-container-f89ad8b4 nova-manage[10596]: 2020-02-06 14:07:18.129 10596 ERROR oslo_db.sqlalchemy.exc_filters [req-34f0c5a6-2983-4c8e-9b9d-14167851c984 - - - - -] DBAPIError exception wrapped from (pymysql.err.InternalError) (1180, 'Got error 90 "Message too long" during COMMIT')
Feb 06 14:07:18 control02-nova-api-container-f89ad8b4 nova-manage[10596]: 2020-02-06 14:07:18.132 10596 ERROR oslo_db.sqlalchemy.exc_filters [req-34f0c5a6-2983-4c8e-9b9d-14167851c984 - - - - -] DB exception wrapped.: sqlalchemy.exc.ResourceClosedError: This Connection is closed
Feb 06 14:10:22 control02-nova-api-container-f89ad8b4 nova-manage[14139]: 2020-02-06 14:10:22.930 14139 INFO migrate.versioning.api [req-032e5b40-88c9-4f4b-8ab0-525c50389967 - - - - -] 398 -> 399...
Feb 06 14:10:22 control02-nova-api-container-f89ad8b4 nova-manage[14139]: 2020-02-06 14:10:22.985 14139 INFO migrate.versioning.api [req-032e5b40-88c9-4f4b-8ab0-525c50389967 - - - - -] done
Feb 06 14:10:22 control02-nova-api-container-f89ad8b4 nova-manage[14139]: 2020-02-06 14:10:22.985 14139 INFO migrate.versioning.api [req-032e5b40-88c9-4f4b-8ab0-525c50389967 - - - - -] 399 -> 400...
Feb 06 14:10:22 control02-nova-api-container-f89ad8b4 nova-manage[14139]: 2020-02-06 14:10:22.995 14139 INFO migrate.versioning.api [req-032e5b40-88c9-4f4b-8ab0-525c50389967 - - - - -] done
Feb 06 14:10:22 control02-nova-api-container-f89ad8b4 nova-manage[14139]: 2020-02-06 14:10:22.995 14139 INFO migrate.versioning.api [req-032e5b40-88c9-4f4b-8ab0-525c50389967 - - - - -] 400 -> 401...
Feb 06 14:10:23 control02-nova-api-container-f89ad8b4 nova-manage[14139]: 2020-02-06 14:10:23.145 14139 INFO migrate.versioning.api [req-032e5b40-88c9-4f4b-8ab0-525c50389967 - - - - -] done
Feb 06 14:10:23 control02-nova-api-container-f89ad8b4 nova-manage[14139]: 2020-02-06 14:10:23.145 14139 INFO migrate.versioning.api [req-032e5b40-88c9-4f4b-8ab0-525c50389967 - - - - -] 401 -> 402...
Feb 06 14:10:23 control02-nova-api-container-f89ad8b4 nova-manage[14139]: 2020-02-06 14:10:23.244 14139 INFO migrate.versioning.api [req-032e5b40-88c9-4f4b-8ab0-525c50389967 - - - - -] done
====================

This issue is two-part, because now it seems that Nova does not assume that hidden=NULL means that the instance is not hidden and no longer displays the instance via API or any other operations.

The "very silly" confirmation of this behaviour of backfilling was my attempt at patching things up resulted in the same error:

==================
MariaDB [nova]> update instances set hidden=0;
ERROR 1180 (HY000): Got error 90 "Message too long" during COMMIT
===================

Ideally, Nova shouldn't try and backfill values and it should treat hidden=NULL as 0.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Some more context:

(11:29:22 AM) mriedem: i sort of remember this discussion with jaypipes on https://review.opendev.org/#/c/557958/
(11:29:26 AM) mriedem: which uses server_default
(11:30:09 AM) dansmith: I guess server_default would be one option, but it involves another schema change which I think is less good since we have to backport this
(11:34:30 AM) mriedem: https://docs.sqlalchemy.org/en/13/core/metadata.html#sqlalchemy.schema.Column.params.default
(11:34:41 AM) mriedem: "which will be invoked upon insert if this column is otherwise not specified in the VALUES clause of the insert"
(11:34:45 AM) mriedem: didn't realize it would backfill though
(11:35:03 AM) dansmith: right, I always assume it fills on read too, but just insert
(11:35:07 AM) mriedem: https://docs.sqlalchemy.org/en/13/core/metadata.html#sqlalchemy.schema.Column.params.server_default - "A FetchedValue instance, str, Unicode or text() construct representing the DDL DEFAULT value for the column."
(11:35:23 AM) mriedem: so yeah, i guess never use default in migrations

tags: added: db
tags: added: upgrade
Dan Smith (danms)
Changed in nova:
importance: Undecided → Critical
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/706331

Changed in nova:
assignee: nobody → Dan Smith (danms)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/706351

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/706582

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/706331
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=001f3a7bfe6b2c8af135daff8e154a708792070e
Submitter: Zuul
Branch: master

commit 001f3a7bfe6b2c8af135daff8e154a708792070e
Author: Dan Smith <email address hidden>
Date: Thu Feb 6 09:21:38 2020 -0800

    Fix instance.hidden migration and querying

    It was discovered that default= on a Column definition in a schema migration
    will attempt to update the table with the provided value, instead of just
    translating on read, which is often the assumption. The Instance.hidden=False
    change introduced in Train[1] used such a default on the new column, which caused
    at least one real-world deployment to time out rewriting the instances table
    due to size. Apparently SQLAlchemy-migrate also does not consider such a timeout
    to be a failure and proceeds on. The end result is that some existing instances
    in the database have hidden=NULL values, and the DB model layer does not convert
    those to hidden=False when we read/query them, causing those instances to be
    excluded from the API list view.

    This change alters the 399 schema migration to remove the default=False
    specification. This does not actually change the schema, but /will/ prevent
    users who have not yet upgraded to Train from rewriting the table.

    This change also makes the instance_get_all_by_filters() code handle hidden
    specially, including false and NULL in a query for non-hidden instances.

    A future change should add a developer trap test to ensure that future migrations
    do not add default= values to new columns to avoid this situation in the future.

    [1] Iaffb27bd8c562ba120047c04bb62619c0864f594

    Change-Id: Iace3f653b42c20887b40ee0105c8e9a4edeff1f7
    Closes-Bug: #1862205

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/706582
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8363905a6a6d5c8b2488619bdf807c5dc17b2842
Submitter: Zuul
Branch: stable/train

commit 8363905a6a6d5c8b2488619bdf807c5dc17b2842
Author: Dan Smith <email address hidden>
Date: Thu Feb 6 09:21:38 2020 -0800

    Fix instance.hidden migration and querying

    It was discovered that default= on a Column definition in a schema migration
    will attempt to update the table with the provided value, instead of just
    translating on read, which is often the assumption. The Instance.hidden=False
    change introduced in Train[1] used such a default on the new column, which caused
    at least one real-world deployment to time out rewriting the instances table
    due to size. Apparently SQLAlchemy-migrate also does not consider such a timeout
    to be a failure and proceeds on. The end result is that some existing instances
    in the database have hidden=NULL values, and the DB model layer does not convert
    those to hidden=False when we read/query them, causing those instances to be
    excluded from the API list view.

    This change alters the 399 schema migration to remove the default=False
    specification. This does not actually change the schema, but /will/ prevent
    users who have not yet upgraded to Train from rewriting the table.

    This change also makes the instance_get_all_by_filters() code handle hidden
    specially, including false and NULL in a query for non-hidden instances.

    A future change should add a developer trap test to ensure that future migrations
    do not add default= values to new columns to avoid this situation in the future.

    [1] Iaffb27bd8c562ba120047c04bb62619c0864f594

    Change-Id: Iace3f653b42c20887b40ee0105c8e9a4edeff1f7
    Closes-Bug: #1862205
    (cherry picked from commit 001f3a7bfe6b2c8af135daff8e154a708792070e)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.1.1

This issue was fixed in the openstack/nova 20.1.1 release.

Revision history for this message
Tyler Bishop (tyler-bishop) wrote :

This needs backported to the train tag... any user who upgrades will be impacted by this. I was.

Revision history for this message
Dan Smith (danms) wrote :

It was backported almost immediately to stable/train (see the two comments above yours) and released as 20.1.1.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.