archive_deleted_rows_for_table relies on reflection to access the "default" for soft-delete columns, but this is not a server default

Bug #1431571 reported by Mike Bayer
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Timofey Durakov

Bug Description

Running subsets of Nova tests or individual tests within test_db_api reveals a simple error in several of the tests within ArchiveTestCase.

A test such as test_archive_deleted_rows_2_tables attempts the following:

1. places six rows into instance_id_mappings
2. places six rows into instances
3. runs the archive_deleted_rows_ routine with a max of 7 rows to archive
4. runs a SELECT of instances and instance_id_mappings, and confirms that only 5 remain.

Running this test directly with PYTHONHASHSEED=random will very easily encounter failures such as:

Traceback (most recent call last):
  File "/Users/classic/dev/redhat/openstack/nova/nova/tests/unit/db/test_db_api.py", line 7869, in test_archive_deleted_rows_2_tables
    self.assertEqual(len(iim_rows) + len(i_rows), 5)
  File "/Users/classic/dev/redhat/openstack/nova/.tox/py27/lib/python2.7/site-packages/testtools/testcase.py", line 350, in assertEqual
    self.assertThat(observed, matcher, message)
  File "/Users/classic/dev/redhat/openstack/nova/.tox/py27/lib/python2.7/site-packages/testtools/testcase.py", line 435, in assertThat
    raise mismatch_error
testtools.matchers._impl.MismatchError: 8 != 5

or

Traceback (most recent call last):
  File "/Users/classic/dev/redhat/openstack/nova/nova/tests/unit/db/test_db_api.py", line 7872, in test_archive_deleted_rows_2_tables
    self.assertEqual(len(iim_rows) + len(i_rows), 5)
  File "/Users/classic/dev/redhat/openstack/nova/.tox/py27/lib/python2.7/site-packages/testtools/testcase.py", line 350, in assertEqual
    self.assertThat(observed, matcher, message)
  File "/Users/classic/dev/redhat/openstack/nova/.tox/py27/lib/python2.7/site-packages/testtools/testcase.py", line 435, in assertThat
    raise mismatch_error
testtools.matchers._impl.MismatchError: 10 != 5

The reason is that the archive_deleted_rows() routine looks for rows in *all* tables, in *non-deterministic order*, e.g. by searching through "models.__dict__.itervalues()". In the "8 != 5" case, there are rows present also in the instance_types table. By PDBing into archive_deleted_rows during the test, we can see here:

ARCHIVED 4 ROWS FROM TABLE instances
ARCHIVED 3 ROWS FROM TABLE instance_types
Traceback (most recent call last):
...
testtools.matchers._impl.MismatchError: 8 != 5

that is, the archiver locates seven rows just between instances and instance_types, then stops. It never even gets to the instance_id_mappings table.

The serious problem with the way this test is designed, is that if we were to make it ignore only certain tables, or make the ordering fixed, or anything else, that will never keep the test from breaking again, any time a new table is added which contains rows when the test fixtures start.

The only solution to making these tests runnable in their current form is to limit the listing of tables that are searched in archive_deleted_rows; that is, the test needs to inject a fixture into it. The most straightforward way to achieve this would look like this:

 @require_admin_context
-def archive_deleted_rows(context, max_rows=None):
+def archive_deleted_rows(context, max_rows=None, _limit_tablenames_fixture=None):
     """Move up to max_rows rows from production tables to the corresponding
     shadow tables.

@@ -5870,6 +5870,9 @@ def archive_deleted_rows(context, max_rows=None):
         if hasattr(model_class, "__tablename__"):
             tablenames.append(model_class.__tablename__)
     rows_archived = 0
+ if _limit_tablenames_fixture:
+ tablenames = set(tablenames).intersection(_limit_tablenames_fixture)
+
     for tablename in tablenames:
         rows_archived += archive_deleted_rows_for_table(context, tablename,
                                          max_rows=max_rows - rows_archived)

Revision history for this message
Mike Bayer (zzzeek) wrote :

console output illustrating a series of test runs of just the archive tests where non-deterministic failures are demonstrated.

Revision history for this message
Mike Bayer (zzzeek) wrote :

with fix in place, passes consistently:

classics-MacBook-Pro:nova classic$ .tox/py27/bin/py.test nova/tests/unit/db/test_db_api.py -k "ArchiveTestCase" -v
====================================================================================== test session starts =======================================================================================
platform darwin -- Python 2.7.5 -- py-1.4.26 -- pytest-2.6.4 -- /Users/classic/dev/redhat/openstack/nova/.tox/py27/bin/python
collected 629 items

nova/tests/unit/db/test_db_api.py::ArchiveTestCase::test_archive_deleted_rows PASSED
nova/tests/unit/db/test_db_api.py::ArchiveTestCase::test_archive_deleted_rows_2_tables PASSED
nova/tests/unit/db/test_db_api.py::ArchiveTestCase::test_archive_deleted_rows_fk_constraint PASSED
nova/tests/unit/db/test_db_api.py::ArchiveTestCase::test_archive_deleted_rows_for_every_uuid_table PASSED
nova/tests/unit/db/test_db_api.py::ArchiveTestCase::test_archive_deleted_rows_no_id_column PASSED
nova/tests/unit/db/test_db_api.py::ArchiveTestCase::test_shadow_tables PASSED

========================================================================== 623 tests deselected by '-kArchiveTestCase' ===========================================================================
============================================================================ 6 passed, 623 deselected in 6.03 seconds ============================================================================
classics-MacBook-Pro:nova classic$

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/163990

Changed in nova:
assignee: nobody → Mike Bayer (zzzeek)
status: New → In Progress
Revision history for this message
Mike Bayer (zzzeek) wrote : Re: ArchiveTestCase erroneously assumes the tables that are populated
Revision history for this message
Mike Bayer (zzzeek) wrote :

looking more closely, maybe the problem is that we shouldn't be deleting from instance_types. It seems that "archive_deleted_rows_for_table" is checking the "deleted" column for NULL, but when the fixtures run and we run 216_havana.py, this table has zeros for the "deleted" column - _populate_instance_types() explicitly sets it to zero:

(Pdb) conn.connection.execute("select id, name, deleted from instance_types where deleted is not null").fetchall()
[(1, u'm1.large', 0), (2, u'm1.medium', 0), (3, u'm1.small', 0), (4, u'm1.tiny', 0), (5, u'm1.xlarge', 0)]

but archive_deleted_rows_for_table is doing this very shaky system of looking at "deleted_column.default", assuming its set to something:

    query_insert = sql.select([table],
                          deleted_column != deleted_column.default).\
                          order_by(column).limit(max_rows)
    query_delete = sql.select([column],
                          deleted_column != deleted_column.default).\
                          order_by(column).limit(max_rows)

but the SELECT comes out with an "IS NOT NULL":

DELETE FROM instance_types WHERE instance_types.id in (SELECT T1.id FROM (SELECT instance_types.id
FROM instance_types
WHERE instance_types.deleted IS NOT NULL ORDER BY instance_types.id
 LIMIT ? OFFSET ?) as T1)

which....is why it's deleting those rows. that "default" looks like it is a zero however, so trying to see what that's about.

Revision history for this message
Mike Bayer (zzzeek) wrote :

oh. der. archival works on reflection and assumes that SQLite reflection gives us a zero for that default. nay:

(Pdb) from sqlalchemy import inspect
(Pdb) inspect(conn).get_columns('instance_types')
[.... {'name': u'deleted', 'nullable': True, 'autoincrement': True, 'primary_key': 0, 'type': INTEGER(), 'default': None}]

(Pdb) conn.execute('pragma table_info("instance_types")').fetchall()
[ ... (15, u'deleted', u'INTEGER', 0, None, 0)]

column 4 (zero based) is "dflt_value":
(Pdb) cursor.description
(('cid', None, None, None, None, None, None), ('name', None, None, None, None, None, None), ('type', None, None, None, None, None, None), ('notnull', None, None, None, None, None, None), ('dflt_value', None, None, None, None, None, None), ('pk', None, None, None, None, None, None))

so this assumption is broken in Nova and has to be changed. I'd suggest we don't use reflection for the primary table since it's already set up.

summary: - ArchiveTestCase erroneously assumes the tables that are populated
+ archive_de
summary: - archive_de
+ archive_deleted_rows_for_table relies on inaccurate reflection to get at
+ the required SQL criteria for determining deleted rows, will definitely
+ make mistakes on SQLite
Revision history for this message
Mike Bayer (zzzeek) wrote :

the column is a Python-side default, so I've no idea how people's MySQL databases aren't getting blown away by this archival routine? here's the column in oslo.db:

class SoftDeleteMixin(object):
    deleted_at = Column(DateTime)
    deleted = Column(Integer, default=0)

that's a client-side default.

summary: - archive_deleted_rows_for_table relies on inaccurate reflection to get at
- the required SQL criteria for determining deleted rows, will definitely
- make mistakes on SQLite
+ archive_deleted_rows_for_table relies on reflection to access the
+ "default" for soft-delete columns, but this is not a server default
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/164009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Michael Bayer (<email address hidden>) on branch: master
Review: https://review.openstack.org/163990
Reason: looking into a more direct failure in the archive routine.

Revision history for this message
Joe Gordon (jogo) wrote :

nova-manage db archive_deleted_rows 1000 is not working at all right now:

Output: http://paste.ubuntu.com/10588278/

Changed in nova:
importance: Undecided → Critical
Revision history for this message
Joe Gordon (jogo) wrote :

I confirmed that running nova-manage db archive_deleted_rows can break your system. I just did it to my devstack box.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/164178

Changed in nova:
assignee: Mike Bayer (zzzeek) → Sean Dague (sdague)
Sean Dague (sdague)
Changed in nova:
milestone: none → kilo-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Michael Bayer (<email address hidden>) on branch: master
Review: https://review.openstack.org/164009
Reason: ah I see that there are some "deleted" columns that aren't ints, OK

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/164263

Changed in nova:
assignee: Sean Dague (sdague) → Timofey Durakov (tdurakov)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/164178
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2c1784f150026c7b903fc920d480898ff51c424d
Submitter: Jenkins
Branch: master

commit 2c1784f150026c7b903fc920d480898ff51c424d
Author: Sean Dague <email address hidden>
Date: Fri Mar 13 07:52:27 2015 -0400

    Add shadow table empty verification

    The ArchiveDatabase tests only test that expected content is moved
    into shadow tables. They don't check that unexpected content in
    unrelated tables *is not moved*. Like... instance_types. The lack of
    this test allowed a regression to get in that could corrupt your
    entire database.

    This adds an _assert_shadow_tables_empty_except function that should
    be called at the end of archiving db tests that asserts that there is
    not content in any of the shadow_tables except the ones we expect.

    Closes-Bug: #1431571

    Change-Id: I49431c5413cc700c46d28ee47a9ed678987f17ff

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/164263
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fa2a6afe7639726fad201a51832b8fcb06c95a75
Submitter: Jenkins
Branch: master

commit fa2a6afe7639726fad201a51832b8fcb06c95a75
Author: Timofey Durakov <email address hidden>
Date: Fri Mar 6 17:09:38 2015 +0300

    Fixed archiving of deleted records

    Added specific column order for insert clause
    in insert from select query, used in archivation.
    Changed source for tables metadata from db to models.

    Change-Id: Idff93ab4fa3397746999a836de54354235919471
    Closes-Bug: #1431571

Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-3 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.