Cluster stuck in deleting
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
senlin |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
When deleting a large number of clusters, some clusters end up stuck in DELETING state. This can be reproduced by creating 100 clusters with 20 nodes each and deleting all 100 clusters at the same time. After the deletion of the clusters is started, we observe the following error in the senlin-engine logs:
ERROR oslo_db.api DB exceeded retry limit.: DB Deadlock: (pymysql.
1284b663'}] (Background on this error at: http://
The problem is that the cluster delete will create the node delete actions. When the node delete actions finish, they will be removed from the dependency table. However, in this case a deadlock occurred while trying to perform the delete.
After enabling the deadlock logs in mysql, we observe that the deadlock is caused by another transaction that is deleting rows from the action table:
DELETE FROM action WHERE action.target = '1ed3be01-
This is happening because of part of a cluster/node delete, any actions for that cluster/node are automatically removed from the actions table. This happens while the cluster/ndoe delete is on-going, See https:/
The delete action only targets a specific node here. In this case the node target is a different one than the bode for which the delete from the dependency table are happening. However, the delete action statement involves a long where clause which ends up locking more than the actions of the targetted node. This is due to the fact InnoDB locks all the rows scanned as part of the DELETE statement (See http://
By locking the addtional rows scanned as part of the DELETE, InnoDB ends up locking the node delete action that is part of the insert statement into the dependency table. This is the root cause of the deadlock because the dependent and depended fields in the dependency table are foreign keyed into the action table. Since the action delete statement ends up locking more rows than necessary and thereby inadvertently locking the node delete action, the other delete statement is deadlocked and fails.
Reviewed: https:/ /review. opendev. org/705095 /git.openstack. org/cgit/ openstack/ senlin/ commit/ ?id=d243ea253bb 31753ad7fefb6a3 ef42da3dab7c53
Committed: https:/
Submitter: Zuul
Branch: master
commit d243ea253bb3175 3ad7fefb6a3ef42 da3dab7c53
Author: Duc Truong <email address hidden>
Date: Thu Jan 30 11:16:00 2020 -0800
Remove clean-up of cluster/node action when cluster/node is deleted
The simultaneous delete of cluster/node actions while actual cluster/node
deletion is on-going causes DB deadlocks. Removing the clean-up and instead
having users run the senlin-manage script periodically to purge the action
table is better way to handle this.
Change-Id: If65d8006150788 fff35ecd6eddec0 a761e9b4cff
Closes-Bug: 1861445