bad timeout caused kill-controller to leave resources behind
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
juju-ci-tools |
Fix Released
|
High
|
Curtis Hovey |
Bug Description
We observed that azure sometimes had instances let behind, preventing subsequent tests to get enough instances to test with. Andrew reviewed the logs and saw that juju was waiting for instances to be when CI killed the long running proc. CI must let Juju try to finish. Azure can take a long time to reclaim resources. 10 minutes is not enough even for a trivial deployment. Juju can take as long as 30 minutes to clean up.
Sane outout looks like this example
2016-07-18 10:01:50 INFO cmd cmd.go:141 admin@local/
Waiting on 1 model
2016-07-18 10:01:52 INFO cmd cmd.go:141 admin@local/
All hosted models reclaimed, cleaning up controller machines
If the console log is missing "All hosted models reclaimed, cleaning up controller machines" then juju did not clean up, and if we see in CI's log that it collected timings, we can see that CI prematurely interrupted Juju.
This issue is currently masked by the azure cleanup script that reclaims resources older than 6 hours.
Changed in juju-ci-tools: | |
assignee: | Leo Zhang (nealpzhang) → Curtis Hovey (sinzui) |
status: | Triaged → In Progress |
Changed in juju-ci-tools: | |
status: | In Progress → Fix Committed |
Changed in juju-ci-tools: | |
status: | Fix Committed → Fix Released |
EnvJujuClient. kill_controller () sets a 600 second timeout for all calls to bring down the controller/ state-server and their machines. This time is twice the time needed to gce and 4/5 x time needed by other clouds.
Azure is the exception a trivial stack of 3 machines takes 666 seconds. It can take 30 minutes to bring down a large deployment. We could change the timeout to 1800 seconds. I prefer to only pass 1800 when the client. config[ 'type'] is 'azure'.