Comment 16 for bug 1499669

Revision history for this message
Zane Bitter (zaneb) wrote :

The database errors turn out to be a sqlalchemy issue: https://bitbucket.org/zzzeek/sqlalchemy/issues/3803/dbapi-connections-go-invalid-on

Surprisingly enough though, those aren't actually the cause of the problem here. For the most part we deal with an error writing to the DB quite gracefully. The reason the root stack is "hanging" IN_PROGRESS (it's not really hanging; it will eventually time out normally) is that the child stack (the ResourceGroup) doesn't start deleting after we've cancelled its update. And the reason it doesn't start deleting is because we don't wait long enough for the running threads to be stopped before we give up and don't bother starting the delete.

The length of time we wait is configurable as engine_life_check_timeout. The default is 2s - it turns out that it takes at least 4-5s to cancel a stack of this size. A user could work around this problem by increasing the engine_life_check_timeout, however it's probably just inappropriate for us to be using this value (I think it happened in a historical accident).

We're much less likely to encounter this issue now that https://review.openstack.org/369827 has merged, but a fix not only benefit master but be easily backportable to earlier stable branches.