fill_virtual_interface_list migration fails on second attempt

Bug #1824435 reported by iain MacDonnell on 2019-04-11
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
melanie witt
Stein
Medium
melanie witt
Train
Medium
melanie witt

Bug Description

On attempting to run online_data_migrations on Stein for the second time (and beyond), fill_virtual_interface_list fails as below. I find two rows in the security_groups table which have name='default' and project_id NULL.

2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage [req-6cb533e2-58b5-41db-a455-29dae8efef31 - - - - -] Error attempting to run <function fill_virtual_interface_list at 0x7f1ed1b6f500>: TypeError: 'NoneType' object has no attribute '__getitem__'
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage Traceback (most recent call last):
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/nova/cmd/manage.py", line 686, in _run_migration
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage found, done = migration_meth(ctxt, count)
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/enginefacade.py", line 1012, in wrapper
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage return fn(*args, **kwargs)
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/nova/objects/virtual_interface.py", line 279, in fill_virtual_interface_list
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage _set_or_delete_marker_for_migrate_instances(cctxt, marker)
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 227, in wrapped
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage return f(context, *args, **kwargs)
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/nova/objects/virtual_interface.py", line 305, in _set_or_delete_marker_for_migrate_instances
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage instance.create()
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 226, in wrapper
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage return fn(self, *args, **kwargs)
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/nova/objects/instance.py", line 600, in create
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage db_inst = db.instance_create(self._context, updates)
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/nova/db/api.py", line 748, in instance_create
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage return IMPL.instance_create(context, values)
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 170, in wrapper
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage return f(*args, **kwargs)
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 154, in wrapper
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage ectxt.value = e.inner_exc
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage self.force_reraise()
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage six.reraise(self.type_, self.value, self.tb)
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/oslo_db/api.py", line 142, in wrapper
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage return f(*args, **kwargs)
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 227, in wrapped
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage return f(context, *args, **kwargs)
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 1728, in instance_create
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage security_group_ensure_default(context)
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 4039, in security_group_ensure_default
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage return _security_group_ensure_default(context)
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 227, in wrapped
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage return f(context, *args, **kwargs)
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage File "/usr/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 4050, in _security_group_ensure_default
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage default_group = _security_group_get_by_names(context, ['default'])[0]
2019-04-11 03:51:27.632 22147 ERROR nova.cmd.manage TypeError: 'NoneType' object has no attribute '__getitem__'

mysql> select * from security_groups;
+---------------------+------------+------------+----+---------+-------------+----------------------------------+----------------------------------+---------+
| created_at | updated_at | deleted_at | id | name | description | user_id | project_id | deleted |
+---------------------+------------+------------+----+---------+-------------+----------------------------------+----------------------------------+---------+
| 2019-04-09 23:23:10 | NULL | NULL | 1 | default | default | 7ed07a8869044a49b03a6e303e548198 | 1b57d6726e574f3db7fb94ceb3955591 | 0 |
| 2019-04-11 01:11:05 | NULL | NULL | 2 | default | default | NULL | NULL | 0 |
| 2019-04-11 01:11:05 | NULL | NULL | 3 | default | default | NULL | NULL | 0 |
+---------------------+------------+------------+----+---------+-------------+----------------------------------+----------------------------------+---------+
3 rows in set (0.00 sec)

Matt Riedemann (mriedem) wrote :
tags: added: db nova-manage upgrade
Changed in nova:
status: New → Triaged
importance: Undecided → High
Matt Riedemann (mriedem) wrote :

(5:33:08 PM) mriedem: hmm yeah i'm not sure how https://github.com/openstack/nova/blob/03322bb517925a9f5a04ebdb41c3fd31e7962440/nova/db/sqlalchemy/api.py#L3874 can't either return at least 1 or raise
(5:35:47 PM) efried: oh, that part is because there's two rows in the database with the "right" project_id (NULL)
(5:36:05 PM) efried: mriedem: So the initial check (==) fails because there's *more* db rows than expected
(5:36:18 PM) efried: and then the for loop doesn't hit because all the names match (because they're the same)
(5:36:23 PM) mriedem: ah yup
(5:36:35 PM) mriedem: i don't know why the unique constraint doesn't blow up - because NULL isn't considered unique?
(5:36:39 PM) efried: so the problem is that we're somehow creating two rows, and I don't know how that ....
(5:36:42 PM) mriedem: schema.UniqueConstraint('project_id', 'name', 'deleted',
(5:36:42 PM) mriedem: name='uniq_security_groups0project_id0'
(5:36:42 PM) mriedem: 'name0deleted'),

Matt Riedemann (mriedem) wrote :

I'm not exactly sure how to reproduce this though. I was able to recreate the unique constraint issue in the database after creating a server and then running the online_data_migrations command a few times:

http://paste.openstack.org/show/749218/

But I don't get errors from the command.

iain MacDonnell (imacdonn) wrote :

Seems there are two issues here...

1) Duplicate marker row created due to unique constraint ignoring null value

2) _security_group_get_by_names() takes a list of group names, does a DB query, and expects that the result will have the same number of items. In this case, the list of group names is just '[default'], but there are two items with the name "default", so the list length comparison fails. The subsequent attempt to find a missing item iterates over the list of group names in search of one that is not present in the query result, but there is none, so it falls out the bottom (ironically, past the "Not Reached" comment).

I wonder if (2) should be addressed, or if it's reasonable to assume that there will never be multiple query results with the same name (and make it so by fixing (1))?

Matt Riedemann (mriedem) wrote :

Iain reports that https://review.openstack.org/653065 at least works around the issue. Also a potential recreate:

(2:44:55 PM) imacdonn: mriedem: FWIW, I can reproduce my original problem with this sequence: 1) create an instance 2) run migrations 3) archive 4) run migrations

Which I'm doing in this functional test https://review.openstack.org/#/c/653098/ but that might be missing something (and uses sqlite) so I'll try Iain's recreate in devstack.

Matt Riedemann (mriedem) wrote :

imacdonn's recreate works for me in devstack: http://paste.openstack.org/show/749391/

Surya Seetharaman (tssurya) wrote :

I ran into the same bug today on my devstack.

Matt Riedemann (mriedem) on 2019-05-23
Changed in nova:
importance: High → Medium

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/653065
Reason: I'm not working on this, someone else will have to pick up working on the bug.

Related fix proposed to branch: master
Review: https://review.opendev.org/688205

Changed in nova:
assignee: nobody → melanie witt (melwitt)
status: Triaged → In Progress
melanie witt (melwitt) wrote :

I've proposed a different fix for this after investigating locally in devstack using Iain's repro steps. I found that our code for "get or create default security group" is creating a duplicate default group for project_id=NULL (this works despite a unique constraint on project_id because unique constraints are only enforced on non-NULL values, as mentioned in earlier comments on this bug [1]) because the group create happens in a separate database transaction but a later read happens in the same/current database transaction. The current transaction cannot "see" the earlier inserted security group record because of transaction isolation [2]. The bug is not reproducible with SQLite because SQLite provides no isolation between transactions on the same database connection (which will always be the case if the insert and later read are happening in the same instance_create method call [3].

Based on all this, the fix I propose is to do the read in a separate transaction, similar to how the create is done. I am attempting to do the test coverage in our post test hook, for lack of a better idea.

[1] https://dev.mysql.com/doc/refman/8.0/en/create-index.html#create-index-unique
[2] http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-10-11.log.html#t2019-10-11T19:05:21
[3] https://www.sqlite.org/isolation.html

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/678776
Reason: https://review.opendev.org/688206 is much simpler and has test coverage.

Reviewed: https://review.opendev.org/688205
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7c41365f193fd5b08d0174b6cd8e349ac95b7907
Submitter: Zuul
Branch: master

commit 7c41365f193fd5b08d0174b6cd8e349ac95b7907
Author: melanie witt <email address hidden>
Date: Fri Oct 11 20:58:03 2019 +0000

    Add regression test for bug 1824435

    This adds a regression test in our post test hook. We are not able to
    do a similar test in the unit or functional tests because SQLite does
    not provide any isolation between transactions on the same database
    connection [1] and the bug can only be reproduced with the isolation
    that is present when using a real MySQL database.

    Related-Bug: #1824435

    [1] https://www.sqlite.org/isolation.html

    Change-Id: I204361d6ff7c2323bc744878d8a9fa2d20a480b1

Reviewed: https://review.opendev.org/688206
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6ea945e3b126879a86fe78d9c0537f5d258cf91a
Submitter: Zuul
Branch: master

commit 6ea945e3b126879a86fe78d9c0537f5d258cf91a
Author: melanie witt <email address hidden>
Date: Fri Oct 11 21:08:33 2019 +0000

    Remove redundant call to get/create default security group

    In the instance_create DB API method, it ensures the (legacy) default
    security group gets created for the specified project_id if it does
    not already exist. If the security group does not exist, it is created
    in a separate transaction.

    Later in the instance_create method, it reads the default security group
    back that it wrote earlier (via the same ensure default security group
    code). But since it was written in a separate transaction, the current
    transaction will not be able to see it and will get back 0 rows. So, it
    creates a duplicate default security group record if project_id=NULL
    (which it will be, if running nova-manage db online_data_migrations,
    which uses an anonymous RequestContext with project_id=NULL). This
    succeeds despite the unique constraint on project_id because in MySQL,
    unique constraints are only enforced on non-NULL values [1].

    To avoid creation of a duplicate default security group for
    project_id=NULL, we can use the default security group object that was
    returned from the first security_group_ensure_default call earlier in
    instance_create method and remove the second, redundant call.

    This also breaks out the security groups setup code from a nested
    method as it was causing confusion during code review and is not being
    used for any particular purpose. Inspection of the original commit
    where it was added in 2012 [2] did not contain any comments about the
    nested method and it appeared to either be a way to organize the code
    or a way to reuse the 'models' module name as a local variable name.

    Closes-Bug: #1824435

    [1] https://dev.mysql.com/doc/refman/8.0/en/create-index.html#create-index-unique
    [2] https://review.opendev.org/#/c/8973/2/nova/db/sqlalchemy/api.py@1339

    Change-Id: Idb205ab5b16bbf96965418cd544016fa9cc92de9

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/691402
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=73b6cb897ecbfe95e50eee632e5e6f5b5c29f7c4
Submitter: Zuul
Branch: stable/train

commit 73b6cb897ecbfe95e50eee632e5e6f5b5c29f7c4
Author: melanie witt <email address hidden>
Date: Fri Oct 11 20:58:03 2019 +0000

    Add regression test for bug 1824435

    This adds a regression test in our post test hook. We are not able to
    do a similar test in the unit or functional tests because SQLite does
    not provide any isolation between transactions on the same database
    connection [1] and the bug can only be reproduced with the isolation
    that is present when using a real MySQL database.

    Related-Bug: #1824435

    [1] https://www.sqlite.org/isolation.html

     Conflicts:
     gate/post_test_hook.sh

    NOTE(melwitt): The conflict is due to change
    I993c9661c37da012cc975ee8c04daa0eb9216744 which is not in Train.

    Change-Id: I204361d6ff7c2323bc744878d8a9fa2d20a480b1
    (cherry picked from commit 7c41365f193fd5b08d0174b6cd8e349ac95b7907)

tags: added: in-stable-train

Reviewed: https://review.opendev.org/691403
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=416290f1938631a95aaaf92a41f455ebf75b0732
Submitter: Zuul
Branch: stable/train

commit 416290f1938631a95aaaf92a41f455ebf75b0732
Author: melanie witt <email address hidden>
Date: Fri Oct 11 21:08:33 2019 +0000

    Remove redundant call to get/create default security group

    In the instance_create DB API method, it ensures the (legacy) default
    security group gets created for the specified project_id if it does
    not already exist. If the security group does not exist, it is created
    in a separate transaction.

    Later in the instance_create method, it reads the default security group
    back that it wrote earlier (via the same ensure default security group
    code). But since it was written in a separate transaction, the current
    transaction will not be able to see it and will get back 0 rows. So, it
    creates a duplicate default security group record if project_id=NULL
    (which it will be, if running nova-manage db online_data_migrations,
    which uses an anonymous RequestContext with project_id=NULL). This
    succeeds despite the unique constraint on project_id because in MySQL,
    unique constraints are only enforced on non-NULL values [1].

    To avoid creation of a duplicate default security group for
    project_id=NULL, we can use the default security group object that was
    returned from the first security_group_ensure_default call earlier in
    instance_create method and remove the second, redundant call.

    This also breaks out the security groups setup code from a nested
    method as it was causing confusion during code review and is not being
    used for any particular purpose. Inspection of the original commit
    where it was added in 2012 [2] did not contain any comments about the
    nested method and it appeared to either be a way to organize the code
    or a way to reuse the 'models' module name as a local variable name.

    Closes-Bug: #1824435

    [1] https://dev.mysql.com/doc/refman/8.0/en/create-index.html#create-index-unique
    [2] https://review.opendev.org/#/c/8973/2/nova/db/sqlalchemy/api.py@1339

    Change-Id: Idb205ab5b16bbf96965418cd544016fa9cc92de9
    (cherry picked from commit 6ea945e3b126879a86fe78d9c0537f5d258cf91a)

Reviewed: https://review.opendev.org/692906
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=844209b064ab4fd4d66ec8d9fc7dc106ae6d6d4b
Submitter: Zuul
Branch: stable/stein

commit 844209b064ab4fd4d66ec8d9fc7dc106ae6d6d4b
Author: melanie witt <email address hidden>
Date: Fri Oct 11 20:58:03 2019 +0000

    Add regression test for bug 1824435

    This adds a regression test in our post test hook. We are not able to
    do a similar test in the unit or functional tests because SQLite does
    not provide any isolation between transactions on the same database
    connection [1] and the bug can only be reproduced with the isolation
    that is present when using a real MySQL database.

    Related-Bug: #1824435

    [1] https://www.sqlite.org/isolation.html

    NOTE(melwitt): The differences from the Train change is because change
    If133b12bf02d708c099504a88b474dce0bdb0f00 is not in Stein. We need to
    pass the cell1 config to the archive command because we don't have
    --all-cells.

    Change-Id: I204361d6ff7c2323bc744878d8a9fa2d20a480b1
    (cherry picked from commit 7c41365f193fd5b08d0174b6cd8e349ac95b7907)
    (cherry picked from commit 73b6cb897ecbfe95e50eee632e5e6f5b5c29f7c4)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/692907
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a80eb82f675ae4c933158fc097094cbbbb96f9dd
Submitter: Zuul
Branch: stable/stein

commit a80eb82f675ae4c933158fc097094cbbbb96f9dd
Author: melanie witt <email address hidden>
Date: Fri Oct 11 21:08:33 2019 +0000

    Remove redundant call to get/create default security group

    In the instance_create DB API method, it ensures the (legacy) default
    security group gets created for the specified project_id if it does
    not already exist. If the security group does not exist, it is created
    in a separate transaction.

    Later in the instance_create method, it reads the default security group
    back that it wrote earlier (via the same ensure default security group
    code). But since it was written in a separate transaction, the current
    transaction will not be able to see it and will get back 0 rows. So, it
    creates a duplicate default security group record if project_id=NULL
    (which it will be, if running nova-manage db online_data_migrations,
    which uses an anonymous RequestContext with project_id=NULL). This
    succeeds despite the unique constraint on project_id because in MySQL,
    unique constraints are only enforced on non-NULL values [1].

    To avoid creation of a duplicate default security group for
    project_id=NULL, we can use the default security group object that was
    returned from the first security_group_ensure_default call earlier in
    instance_create method and remove the second, redundant call.

    This also breaks out the security groups setup code from a nested
    method as it was causing confusion during code review and is not being
    used for any particular purpose. Inspection of the original commit
    where it was added in 2012 [2] did not contain any comments about the
    nested method and it appeared to either be a way to organize the code
    or a way to reuse the 'models' module name as a local variable name.

    Closes-Bug: #1824435

    [1] https://dev.mysql.com/doc/refman/8.0/en/create-index.html#create-index-unique
    [2] https://review.opendev.org/#/c/8973/2/nova/db/sqlalchemy/api.py@1339

     Conflicts:
     gate/post_test_hook.sh

    NOTE(melwitt): The conflict is because of the difference from Train in
    the patch below this one where we need to pass the cell1 config to the
    archive command because change
    If133b12bf02d708c099504a88b474dce0bdb0f00 is not in Stein.

    Change-Id: Idb205ab5b16bbf96965418cd544016fa9cc92de9
    (cherry picked from commit 6ea945e3b126879a86fe78d9c0537f5d258cf91a)
    (cherry picked from commit 416290f1938631a95aaaf92a41f455ebf75b0732)

This issue was fixed in the openstack/nova 20.0.1 release.

This issue was fixed in the openstack/nova 19.1.0 release.

melanie witt (melwitt) wrote :

Just wanted to add a note on here, if you have already begun an upgrade (and run nova-manage db online_data_migrations) on a version prior to this fix landing and hit the error, there is a manual step you need to take in order for the upgrade (i.e. nova-manage db online_data_migrations) to work after you have applied the fix.

If you have a duplicate security_groups record with NULL user_id and project_id, you need to delete one of them. In this case, you would delete the record with id = 3.

> select id,name,description,user_id,project_id,deleted from security_groups;
+----+---------+-------------+----------------------------------+----------------------------------+---------+
| id | name | description | user_id | project_id | deleted |
+----+---------+-------------+----------------------------------+----------------------------------+---------+
| 1 | default | default | da452be8c38f42fe9527646f7da1683e | 5e9830de3b6b40b6b1269b6bd9659502 | 0 |
| 2 | default | default | NULL | NULL | 0 |
| 3 | default | default | NULL | NULL | 0 |
+----+---------+-------------+----------------------------------+----------------------------------+---------+

After you have deleted the duplicate security_groups record, try again and the nova-manage db online_data_migrations command should succeed.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers