Occasional deadlock during db_sync --contract during Newton to Pike live upgrade

Bug #1755906 reported by Andy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Identity (keystone)
Expired
Medium
Unassigned

Bug Description

We are testing upgrade from Newton to Pike (skip Ocata). The Newton system has two controllers. On controller0 (active) we upgrade controller1 to Pike. Then we swact to make controller1 active and controller0 standby, then we upgrade controller0 to Pike. After that we upgrade the other nodes to Pike.

Now with all the nodes running Pike and controller1 as active, we finalize the upgrade by running "keystone-manage db_sync --contract". Occasionally the contract failed with a keystone DB deadlock.

The log from postgres during the deadlock:
------------------------------------------
2018-02-22T21:26:47.000 controller-1 postgres[76959]: warning [2-1] db=keystone,user=admin-keystone ERROR: deadlock detected
2018-02-22T21:26:47.000 controller-1 postgres[76959]: warning [2-2] db=keystone,user=admin-keystone DETAIL: Process 76959 waits for AccessExclusiveLock on relation 17886 of database 16401; blocked by process 76955.
2018-02-22T21:26:47.000 controller-1 postgres[76959]: warning [2-3] Process 76955 waits for AccessShareLock on relation 17776 of database 16401; blocked by process 76959.
2018-02-22T21:26:47.000 controller-1 postgres[76959]: warning [2-4] Process 76959: ALTER TABLE local_user ADD CONSTRAINT local_user_user_id_fkey FOREIGN KEY(user_id, domain_id) REFERENCES "user" (id, domain_id) ON DELETE CASCADE ON UPDATE CASCADE
2018-02-22T21:26:47.000 controller-1 postgres[76959]: warning [2-5] Process 76955: SELECT local_user.id AS local_user_id, local_user.user_id AS local_user_user_id, local_user.domain_id AS local_user_domain_id, local_user.name AS local_user_name, local_user.failed_auth_count AS local_user_failed_auth_count, local_user.failed_auth_at AS local_user_failed_auth_at, anon_1.user_id AS anon_1_user_id, anon_1.user_domain_id AS anon_1_user_domain_id
2018-02-22T21:26:47.000 controller-1 postgres[76959]: warning [2-6] FROM (SELECT "user".id AS user_id, "user".domain_id AS user_domain_id
2018-02-22T21:26:47.000 controller-1 postgres[76959]: warning [2-7] FROM "user"
2018-02-22T21:26:47.000 controller-1 postgres[76959]: warning [2-8] WHERE "user".id = '20bd4216910340bc8e6019f6d826f9d7') AS anon_1 JOIN local_user ON anon_1.user_id = local_user.user_id AND anon_1.user_domain_id = local_user.domain_id ORDER BY anon_1.user_id, anon_1.user_domain_id
2018-02-22T21:26:47.000 controller-1 postgres[76959]: warning [2-9] db=keystone,user=admin-keystone HINT: See server log for query details.
2018-02-22T21:26:47.000 controller-1 postgres[76959]: warning [2-10] db=keystone,user=admin-keystone STATEMENT: ALTER TABLE local_user ADD CONSTRAINT local_user_user_id_fkey FOREIGN KEY(user_id, domain_id) REFERENCES "user" (id, domain_id) ON DELETE CASCADE ON UPDATE CASCADE

The log from keystone during the deadlock:
------------------------------------------

keystone:log 2018-02-22 21:26:47.447 76427 CRITICAL keystone [-] Unhandled error: DbMigrationError: (psycopg2.extensions.TransactionRollbackError) deadlock detected
DETAIL: Process 76959 waits for AccessExclusiveLock on relation 17886 of database 16401; blocked by process 76955.
Process 76955 waits for AccessShareLock on relation 17776 of database 16401; blocked by process 76959.
HINT: See server log for query details.
 [SQL: 'ALTER TABLE local_user ADD CONSTRAINT local_user_user_id_fkey FOREIGN KEY(user_id, domain_id) REFERENCES "user" (id, domain_id) ON DELETE CASCADE ON UPDATE CASCADE']
2018-02-22 21:26:47.447 76427 ERROR keystone Traceback (most recent call last):
2018-02-22 21:26:47.447 76427 ERROR keystone File "/bin/keystone-manage", line 10, in <module>
2018-02-22 21:26:47.447 76427 ERROR keystone sys.exit(main())
2018-02-22 21:26:47.447 76427 ERROR keystone File "/usr/lib/python2.7/site-packages/keystone/cmd/manage.py", line 44, in main
2018-02-22 21:26:47.447 76427 ERROR keystone cli.main(argv=sys.argv, config_files=config_files)
2018-02-22 21:26:47.447 76427 ERROR keystone File "/usr/lib/python2.7/site-packages/keystone/cmd/cli.py", line 1337, in main
2018-02-22 21:26:47.447 76427 ERROR keystone CONF.command.cmd_class.main()
2018-02-22 21:26:47.447 76427 ERROR keystone File "/usr/lib/python2.7/site-packages/keystone/cmd/cli.py", line 528, in main
2018-02-22 21:26:47.447 76427 ERROR keystone upgrades.contract_schema()
2018-02-22 21:26:47.447 76427 ERROR keystone File "/usr/lib/python2.7/site-packages/keystone/common/sql/upgrades.py", line 346, in contract_schema
2018-02-22 21:26:47.447 76427 ERROR keystone _sync_repo(repo_name=CONTRACT_REPO)
2018-02-22 21:26:47.447 76427 ERROR keystone File "/usr/lib/python2.7/site-packages/keystone/common/sql/upgrades.py", line 187, in _sync_repo
2018-02-22 21:26:47.447 76427 ERROR keystone init_version=init_version, sanity_check=False)
2018-02-22 21:26:47.447 76427 ERROR keystone File "/usr/lib/python2.7/site-packages/oslo_db/sqlalchemy/migration.py", line 81, in db_sync
2018-02-22 21:26:47.447 76427 ERROR keystone raise exception.DbMigrationError(ex)
2018-02-22 21:26:47.447 76427 ERROR keystone DbMigrationError: (psycopg2.extensions.TransactionRollbackError) deadlock detected
2018-02-22 21:26:47.447 76427 ERROR keystone DETAIL: Process 76959 waits for AccessExclusiveLock on relation 17886 of database 16401; blocked by process 76955.
2018-02-22 21:26:47.447 76427 ERROR keystone Process 76955 waits for AccessShareLock on relation 17776 of database 16401; blocked by process 76959.
2018-02-22 21:26:47.447 76427 ERROR keystone HINT: See server log for query details.
2018-02-22 21:26:47.447 76427 ERROR keystone [SQL: 'ALTER TABLE local_user ADD CONSTRAINT local_user_user_id_fkey FOREIGN KEY(user_id, domain_id) REFERENCES "user" (id, domain_id) ON DELETE CASCADE ON UPDATE CASCADE']
2018-02-22 21:26:47.447 76427 ERROR keystone

Looks like the two processes involves are the keystone-management command and a query from a openstack service. The keystone tables involved are "user" and "local_user". But we do see the deadlock happened on nonlocal_user table as well.

When re-run keystone-manage db_sync --contract, it would failed and complaining:
--------------------------------------------------------------------------------
2018-02-23T20:27:27.000 controller-1 postgres[13174]: warning [2-1] db=keystone,user=admin-keystone ERROR: relation "ixu_user_id_domain_id" already exists
2018-02-23T20:27:27.000 controller-1 postgres[13174]: warning [2-2] db=keystone,user=admin-keystone STATEMENT: ALTER TABLE "user" ADD CONSTRAINT ixu_user_id_domain_id UNIQUE (id, domain_id)

This seems to be because the previous failed (deadlock) run successfully created the unique constraint in step 014 of contract_repo, just before the deadlock happens.

If the deadlock can't be fixed completely, making db_sync --contract re-runnable seems to be a reasonable solution as well.

Andy (andy.wrs)
description: updated
Colleen Murphy (krinkle)
Changed in keystone:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Lance Bragstad (lbragstad) wrote :
Revision history for this message
Morgan Fainberg (mdrnstm) wrote :

The referenced MySQL issue in comment #1 is not related to this bug. In that case the domain.id didn't exist but the user object referenced it.

Revision history for this message
Morgan Fainberg (mdrnstm) wrote :

Unfortunately, as per https://governance.openstack.org/tc/resolutions/20170613-postgresql-status.html please note PGSQL is not as well tested as MySQL is. We make a "best effort" to ensure support, but without extensive use and testing there will be times we have these issues.

I do believe this is a real concern and real issue. I'd like to get some more information about the scenarios that this happens in:

1) Is it consistent with no load / moderate load / high load
2) What kind of workload is being performed.
3) We may need a bit more PGSQL know-how to debug this (unfortunately, I'm not as confident in PGSQL-isms and replication as I am with MySQL).

making db-sync re-runnable is not really possible due to the way the migration tools work, we need to address the problem in the contract code that is causing the deadlock. We may need to summon in the real expert in such things :)

Changed in keystone:
importance: High → Medium
Revision history for this message
Morgan Fainberg (mdrnstm) wrote :

I confirmed with Mike Bayer (one of our experts), PGSQL is extra lock-y at the DDL level. Unfortunately due to constraints on resources his recommendation is do not run contract phases with the application (keystone) running [1].

While we don't have the resources to solve this issue, we will happily accept external help and include fixes that prevent this from outside resources. If you have time and energy to help fix the problem, I'm more than happy to help you get it included in the code base. Barring that, we need to recommend no live-schema changes with PGSQL.

I am marking this bug as incomplete, please feel free to move it back to confirmed if you have code or recommendations (and please include/submit that code/recommendations) for fixing this issue.

[1] http://eavesdrop.openstack.org/irclogs/%23openstack-keystone/%23openstack-keystone.2018-03-27.log.html#t2018-03-27T18:43:08

Changed in keystone:
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Identity (keystone) because there has been no activity for 60 days.]

Changed in keystone:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.