Following are some approaches to solve this issue. Please suggest which would be the better way.
1) As suggested by Paul Murray, we can modify resize operation and set migration status to 'failed' on resize operation failure.
In this case, we need to modify periodic_task _cleanup_incomplete_migrations and add 'failed' status instead of 'error' in filter for migrations.
2) We can add new migration status 'cleaned', which will be set in periodic task _cleanup_incomplete_migrations.
We can filter migration status which are having 'error' or 'failed' status in periodic task _cleanup_incomplete_migrations and once instance files are deleted from compute node (either source or dest node) we can set newly added migration status 'cleaned' so that the same record is not filtered in subsequent periodic task run.
3) As suggested by Nikola Dipanov, it is reasonable to have retry logic in on self.driver.live_migration call. In that case, if retry logic is not successful (i.e. its unrecoverable situation) then ultimately migration status would be set to 'error' by _rollback_live_migration. But as of now, we don't have retry logic on live_migration driver call.
4) We can stick to the patch which is currently under review and replace migration status from 'failed' to 'error' wherever required.
Hi
Following are some approaches to solve this issue. Please suggest which would be the better way.
1) As suggested by Paul Murray, we can modify resize operation and set migration status to 'failed' on resize operation failure. incomplete_ migrations and add 'failed' status instead of 'error' in filter for migrations.
In this case, we need to modify periodic_task _cleanup_
2) We can add new migration status 'cleaned', which will be set in periodic task _cleanup_ incomplete_ migrations.
We can filter migration status which are having 'error' or 'failed' status in periodic task _cleanup_ incomplete_ migrations and once instance files are deleted from compute node (either source or dest node) we can set newly added migration status 'cleaned' so that the same record is not filtered in subsequent periodic task run.
3) As suggested by Nikola Dipanov, it is reasonable to have retry logic in on self.driver. live_migration call. In that case, if retry logic is not successful (i.e. its unrecoverable situation) then ultimately migration status would be set to 'error' by _rollback_ live_migration. But as of now, we don't have retry logic on live_migration driver call.
4) We can stick to the patch which is currently under review and replace migration status from 'failed' to 'error' wherever required.