2.7.7+2.7-5323b92: vSphere machine stuck in allocating after juju logs show it destroyed
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Invalid
|
Low
|
Unassigned |
Bug Description
While release testing juju 2.7.7+2.7-5323b92 a kubernetes bundle deploy was attempted.
Juju status shows Machine 5 (juju-721cbe-5) allocating with powering on VM: 44.00% message.
Controller machine log shows the machine was started, then powered off and destroyed.
grep juju-721cbe-5 machine-0.log
2020-06-19 17:16:22 INFO juju.provider.
2020-06-19 17:16:51 DEBUG juju.provider.
2020-06-19 17:16:52 DEBUG juju.provider.
Machine 9 was also powered off, with the status message:
cannot record provisioning info for "juju-721cbe-9": cannot set instance data for machine "9": read tcp 127.0.0.
2020-06-19 17:15:50 INFO juju.provider.
2020-06-19 17:16:52 DEBUG juju.provider.
2020-06-19 17:16:52 DEBUG juju.provider.
Please see the logs available through the link at the bottom of the test run page:
https:/
description: | updated |
Changed in juju: | |
status: | New → Triaged |
importance: | Undecided → Medium |
I've spent quite a bit of time going through the crashdump from this.
I think there is a symptom, a real bug, and the underlying issue. All three are different.
The bug we see is what causes the test to fail.
This is from foundation.log:
ERROR:root:ERROR model "kubernetes" has been removed from the controller, run 'juju models' and switch to one of them.
This appears to be a bug in the dependency engine configuration for the controller. The model cache should be initialized before the apiserver starts. What we are seeing here is that the apiserver is starting before the cache has been initialized with the models. The error here is because the model can't be found in the cache. This we definitely need to fix.
The symptom is that the machine is shutdown due to an inability to set instance data. This is the i/o timeout. You'll also see that machine-10 was also powered off and destroyed, but in the juju-status.yaml that machine-10 is up and running, so the powering off is a red herring. The automatic provisioning retrying would have taken care of that.
The underlying cause here is the slow disk or lack of memory on the instance that the controller is running. The test is running up 70 units on 22 machines. I have the gut feeling that the 3.5G of memory that the controller has isn't enough. It is applying pressure on the underlying mongodb. If you look in /var/log/syslog at the mongod entries, you can see that the database query times for very simple things end up taking multiple seconds. This is a clear indication of either disk i/o issues, or memory issues. If mongo isn't able to keep the working set in memory, then everything will get slower, particularly if the actual disk for the machine isn't local.
At 17:16:51 the "state" worker restarted due to a ping i/o timeout. This meant that mongo wasn't able to respond to a simple ping in a timely manner. This restarting of the "state" worker also restarts the apiserver, and a bunch of other workers. The model-cache- initialized gate wasn't restarted, so the apiserver started prematurely. This is the bug, but the underlying cause is the disk i/o and memory constraints of the controller itself.