db-archiving fails to clear some deleted rows from instances table

Bug #1183523 reported by David Ripton
86
This bug affects 16 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Cédric LECOMTE
Mitaka
Fix Released
Medium
Matt Riedemann

Bug Description

Downstream bug report from Red Hat Bugzilla against Grizzly: https://bugzilla.redhat.com/show_bug.cgi?id=960644

In unit tests, db-archiving moves all 'deleted' rows to the shadow tables. However, in the real-world test, some deleted rows got stuck in the instances table.

I suspect a bug in the way we deal with foreign key constraints.

David Ripton (dripton)
Changed in nova:
assignee: nobody → David Ripton (dripton)
tags: added: db
David Ripton (dripton)
Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Zhidong Yu (zhidong) wrote :

I ran into the same bug. It is because instance's uuid was used in some other tables as foreign key.

Not sure if it is caused by an inconsistent database.

Revision history for this message
David Ripton (dripton) wrote :

The bug is that when we "soft-delete" a row from instances (by setting deleted to nonzero), we do not also soft-delete the rows pointing to that instance's uuid from fixed_ips and instance_system_metadata.

db-archiving can't archive the instances row because the FK constraint forces it to archive the others first, and it can't archive the others because they haven't been deleted.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/32742

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
David Ripton (dripton) wrote :

https://bugs.launchpad.net/nova/+bug/1190206 may be the root cause of this bug. When its patch gets in, I will retest and see if it helped.

Revision history for this message
Hans Lindgren (hanlind) wrote :
Revision history for this message
David Ripton (dripton) wrote :

Hans, yes. We're adding more things that pin the instances rows faster than we remove things that pin instances rows. I'm afraid this bug will end up a WONTFIX.

Revision history for this message
David Ripton (dripton) wrote :

comstud thinks we can fix this but we need to do instance_type data differently. Maybe embedded JSON blobs so we have all the information we need without a reference to the instances row. (My opinion: yuck.) So this bug is staying open for now, but it requires some significant redesign to fix.

Revision history for this message
Matt Riedemann (mriedem) wrote :

The patch was abandoned so resetting the state here.

Changed in nova:
status: In Progress → Confirmed
assignee: David Ripton (dripton) → nobody
Revision history for this message
Matt Riedemann (mriedem) wrote :

Note that nova doesn't do cascading deletes on it's foreign keys, when I asked about this in IRC no one had a good answer why, just that nova doesn't delete anything anyway so it wasn't a big deal, which I don't think is a very good answer.

Revision history for this message
Mike Bayer (zzzeek) wrote :

here's the likely cause of this issue: https://bugs.launchpad.net/nova/+bug/1431571

Revision history for this message
Mike Bayer (zzzeek) wrote :

well part of it. the routine also doesn't seem to try to be smart about foreign keys as well as mentioned here. really needs a rewrite.

Revision history for this message
Matt Riedemann (mriedem) wrote :

On a master devstack (mitaka) I have 20 deleted instances:

http://paste.openstack.org/show/475609/

I tried archiving 10 of them and it didn't fail but it didn't archive anything either:

http://paste.openstack.org/show/475610/

Results after nova-manage db archive_deleted_rows 10:

http://paste.openstack.org/show/475611/

So nothing was deleted.

As previously pointed out, it's really odd that the DB API doesn't attempt to sort the tables based on dependencies, like what the 267 migration does here:

https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/migrate_repo/versions/267_instance_uuid_non_nullable.py#L73

That's sorting the tables based on dependency since we're deleting entries so we want to process tables that don't have foreign keys first.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Maybe my problem is that I'm specifying max_rows as 10 and the code actually keeps track of how many rows it's deleted from each table, which seems weird. For example, if I have 20 tables and I say delete max_rows=10, then is it only going to delete 10 rows from the first table it encounters? Or one row in the first 10 tables it encounters? Or up to 10 rows from all 20 tables?

Revision history for this message
Matt Riedemann (mriedem) wrote :

Heh, nope, the warning was lost in the debug log output:

2015-10-07 14:18:39.265 WARNING nova.db.sqlalchemy.api [req-149506be-fbdd-452c-bd35-76f4ba269bc3 None None] IntegrityError detected when archiving table instances

So yeah, that's the foreign key constraint issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/232097

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/232097
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9f880fc8f437a82782f5c5ef0ed21ccbfb20facb
Submitter: Jenkins
Branch: master

commit 9f880fc8f437a82782f5c5ef0ed21ccbfb20facb
Author: Matt Riedemann <email address hidden>
Date: Wed Oct 7 09:33:05 2015 -0700

    Log DBReferenceError in archive_deleted_rows_for_table

    We should add the foreign key constraint integrity error to the warning
    since it includes detailed information about the reference table that
    caused the constraint failure, which is useful for knowing things are
    out of order without having to check your database.

    You get something like this now:

    2015-10-07 15:29:38.238 WARNING nova.db.sqlalchemy.api
    [req-6115aef5-2a82-4d42-bf67-eaab2e8eee63 None None] IntegrityError
     detected when archiving table instances: (pymysql.err.IntegrityError)
     (1451, u'Cannot delete or update a parent row: a foreign key constraint
     fails (`nova`.`instance_actions`, CONSTRAINT
     `fk_instance_actions_instance_uuid` FOREIGN KEY (`instance_uuid`)
     REFERENCES `instances` (`uuid`))') [SQL: u'DELETE FROM instances WHERE
     instances.id in (SELECT T1.id FROM (SELECT instances.id \nFROM
     instances \nWHERE instances.deleted != %s ORDER BY instances.id \n
     LIMIT %s) as T1)'] [parameters: (0, 10)]

    Related-Bug: #1183523

    Change-Id: I41b10898c6e9cc48da58c4d2daf7df362fc9fc36

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/246635

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote :

So it looks like we aren't soft deleting instance_actions when we delete instances, that's what's currently tripping the archive code up:

http://paste.openstack.org/show/479188/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/246635
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=91b989701d88ca4a223a172b05f978527d352dd3
Submitter: Jenkins
Branch: master

commit 91b989701d88ca4a223a172b05f978527d352dd3
Author: Matt Riedemann <email address hidden>
Date: Tue Nov 17 14:38:32 2015 -0800

    Reverse sort tables before archiving

    sqlalchemy metadata provides a way to sort tables based on foreign
    key dependency relationships and then we can reverse sort those to
    process leaf nodes first when archiving deleted records. Use this
    to at least attempt to archive things in a sane order.

    Adds a functional test to cover this since the unit tests for archive
    are setup in such a way that they don't stress many of the tables
    related to a created instance record and thus miss some fundamental
    referential constraint failures during archive.

    This does not completely fix archive_deleted_rows because of the issue
    with instance_actions not being soft deleted when instances are deleted,
    which we still need to sort out, but this at least gives us a functional
    test that stresses a real usage of the DB API. And since the tables are
    now sorted we can reliably assert the archive results.

    Partial-Bug: #1183523

    Change-Id: I948b149239f01cca2d4156611bb883580a216a83

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/299474

Changed in nova:
assignee: Matt Riedemann (mriedem) → Cédric LECOMTE (cedric-lecomte)
Revision history for this message
Matt Riedemann (mriedem) wrote :

See the ML thread discussion here:

http://lists.openstack.org/pipermail/openstack-dev/2015-November/079701.html

That talks about the blocking issues for making this work and suggestions for how to fix.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Cédric LECOMTE (<email address hidden>) on branch: master
Review: https://review.openstack.org/299474

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/299474
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=59439059ed818eb8476a8c61683a02770fe9f821
Submitter: Jenkins
Branch: master

commit 59439059ed818eb8476a8c61683a02770fe9f821
Author: Cedric LECOMTE <email address hidden>
Date: Wed Mar 30 15:18:22 2016 +0000

    Archive instance_actions and instance_actions_event

    This try to handle the soft-delete of instance_actions and
    instance_actions_event before the archive of deleted rows. So in
    this way theses tables will be archived like others.

    Change-Id: Ia0fe130a35a6c9bb16a940f3f4cea6b1aaee9668
    Partial-Bug: #1183523

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

As we use the "direct-release" model in Nova we don't use the
"Fix Comitted" status for merged bug fixes anymore. I'm setting
this manually to "Fix Released" to be consistent.

[1] "[openstack-dev] [release][all] bugs will now close automatically
    when patches merge"; Doug Hellmann; 2015-12-07;
    http://lists.openstack.org/pipermail/openstack-dev/2015-December/081612.html

Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/326730

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/mitaka)

Reviewed: https://review.openstack.org/326730
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2787b39dc0a7e12ec83b70ca55027b4dc09243d8
Submitter: Jenkins
Branch: stable/mitaka

commit 2787b39dc0a7e12ec83b70ca55027b4dc09243d8
Author: Cedric LECOMTE <email address hidden>
Date: Wed Mar 30 15:18:22 2016 +0000

    Archive instance_actions and instance_actions_event

    This try to handle the soft-delete of instance_actions and
    instance_actions_event before the archive of deleted rows. So in
    this way theses tables will be archived like others.

    Change-Id: Ia0fe130a35a6c9bb16a940f3f4cea6b1aaee9668
    Partial-Bug: #1183523
    (cherry picked from commit 59439059ed818eb8476a8c61683a02770fe9f821)

tags: added: in-stable-mitaka
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.