Cluster stuck in deleting

Bug #1861445 reported by Duc Truong
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
senlin
Fix Released
Undecided
Unassigned

Bug Description

When deleting a large number of clusters, some clusters end up stuck in DELETING state. This can be reproduced by creating 100 clusters with 20 nodes each and deleting all 100 clusters at the same time. After the deletion of the clusters is started, we observe the following error in the senlin-engine logs:

ERROR oslo_db.api DB exceeded retry limit.: DB Deadlock: (pymysql.err.OperationalError) (1213, u'Deadlock found when trying to get lock; try restarting transaction') [SQL: u'DELETE FROM dependency WHERE dependency.depended = %(depended_1)s'] [parameters: {u'depended_1': 'f641b91b-cd3d-42c1-a892-a4e7
1284b663'}] (Background on this error at: http://sqlalche.me/e/e3q8)

The problem is that the cluster delete will create the node delete actions. When the node delete actions finish, they will be removed from the dependency table. However, in this case a deadlock occurred while trying to perform the delete.

After enabling the deadlock logs in mysql, we observe that the deadlock is caused by another transaction that is deleting rows from the action table:

DELETE FROM action WHERE action.target = '1ed3be01-1599-4571-b3f7-85410ad7f48f' AND action.project = 'f9df3d498504461f861a44a39cefd89b' AND action.action NOT IN ('NODE_DELETE') AND action.status IN ('SUCCEEDED', 'FAILED')

This is happening because of part of a cluster/node delete, any actions for that cluster/node are automatically removed from the actions table. This happens while the cluster/ndoe delete is on-going, See https://github.com/openstack/senlin/blob/8d6d50898917bf284e75804d404fb46b6d5d9e5f/senlin/engine/actions/cluster_action.py#L485
The delete action only targets a specific node here. In this case the node target is a different one than the bode for which the delete from the dependency table are happening. However, the delete action statement involves a long where clause which ends up locking more than the actions of the targetted node. This is due to the fact InnoDB locks all the rows scanned as part of the DELETE statement (See http://mitchdickinson.com/mysql-innodb-row-locking-in-delete/).
By locking the addtional rows scanned as part of the DELETE, InnoDB ends up locking the node delete action that is part of the insert statement into the dependency table. This is the root cause of the deadlock because the dependent and depended fields in the dependency table are foreign keyed into the action table. Since the action delete statement ends up locking more rows than necessary and thereby inadvertently locking the node delete action, the other delete statement is deadlocked and fails.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to senlin (master)

Reviewed: https://review.opendev.org/705095
Committed: https://git.openstack.org/cgit/openstack/senlin/commit/?id=d243ea253bb31753ad7fefb6a3ef42da3dab7c53
Submitter: Zuul
Branch: master

commit d243ea253bb31753ad7fefb6a3ef42da3dab7c53
Author: Duc Truong <email address hidden>
Date: Thu Jan 30 11:16:00 2020 -0800

    Remove clean-up of cluster/node action when cluster/node is deleted

    The simultaneous delete of cluster/node actions while actual cluster/node
    deletion is on-going causes DB deadlocks. Removing the clean-up and instead
    having users run the senlin-manage script periodically to purge the action
    table is better way to handle this.

    Change-Id: If65d8006150788fff35ecd6eddec0a761e9b4cff
    Closes-Bug: 1861445

Changed in senlin:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.