2.7.7+2.7-5323b92: vSphere machine stuck in allocating after juju logs show it destroyed

Bug #1884315 reported by John George
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Invalid
Low
Unassigned

Bug Description

While release testing juju 2.7.7+2.7-5323b92 a kubernetes bundle deploy was attempted.

Juju status shows Machine 5 (juju-721cbe-5) allocating with powering on VM: 44.00% message.

Controller machine log shows the machine was started, then powered off and destroyed.
grep juju-721cbe-5 machine-0.log
2020-06-19 17:16:22 INFO juju.provider.vmware environ_broker.go:94 started instance "juju-721cbe-5"
2020-06-19 17:16:51 DEBUG juju.provider.vmware client.go:177 powering off "juju-721cbe-5"
2020-06-19 17:16:52 DEBUG juju.provider.vmware client.go:186 destroying "juju-721cbe-5"

Machine 9 was also powered off, with the status message:
cannot record provisioning info for "juju-721cbe-9": cannot set instance data for machine "9": read tcp 127.0.0.1:43528->127.0.0.1:37017: i/o timeout

2020-06-19 17:15:50 INFO juju.provider.vmware environ_broker.go:94 started instance "juju-721cbe-9"
2020-06-19 17:16:52 DEBUG juju.provider.vmware client.go:177 powering off "juju-721cbe-9"
2020-06-19 17:16:52 DEBUG juju.provider.vmware client.go:186 destroying "juju-721cbe-9"

Please see the logs available through the link at the bottom of the test run page:
https://solutions.qa.canonical.com/#/qa/testRun/8240f026-13d1-43e2-b49f-c7b039d2473f

John George (jog)
description: updated
Revision history for this message
Tim Penhey (thumper) wrote :

I've spent quite a bit of time going through the crashdump from this.

I think there is a symptom, a real bug, and the underlying issue. All three are different.

The bug we see is what causes the test to fail.

This is from foundation.log:
ERROR:root:ERROR model "kubernetes" has been removed from the controller, run 'juju models' and switch to one of them.

This appears to be a bug in the dependency engine configuration for the controller. The model cache should be initialized before the apiserver starts. What we are seeing here is that the apiserver is starting before the cache has been initialized with the models. The error here is because the model can't be found in the cache. This we definitely need to fix.

The symptom is that the machine is shutdown due to an inability to set instance data. This is the i/o timeout. You'll also see that machine-10 was also powered off and destroyed, but in the juju-status.yaml that machine-10 is up and running, so the powering off is a red herring. The automatic provisioning retrying would have taken care of that.

The underlying cause here is the slow disk or lack of memory on the instance that the controller is running. The test is running up 70 units on 22 machines. I have the gut feeling that the 3.5G of memory that the controller has isn't enough. It is applying pressure on the underlying mongodb. If you look in /var/log/syslog at the mongod entries, you can see that the database query times for very simple things end up taking multiple seconds. This is a clear indication of either disk i/o issues, or memory issues. If mongo isn't able to keep the working set in memory, then everything will get slower, particularly if the actual disk for the machine isn't local.

At 17:16:51 the "state" worker restarted due to a ping i/o timeout. This meant that mongo wasn't able to respond to a simple ping in a timely manner. This restarting of the "state" worker also restarts the apiserver, and a bunch of other workers. The model-cache-initialized gate wasn't restarted, so the apiserver started prematurely. This is the bug, but the underlying cause is the disk i/o and memory constraints of the controller itself.

Revision history for this message
Heather Lanigan (hmlanigan) wrote :

Small update:

The api-server worker does wait for the model cache to be initialized:
https://github.com/juju/juju/blob/091efcf08ed828834b5b9df6495539d158e081ae/cmd/jujud/agent/machine/manifolds.go#L682

So why does it appear not to wait?

Revision history for this message
Tim Penhey (thumper) wrote : Re: [Bug 1884315] Re: 2.7.7+2.7-5323b92: vSphere machine stuck in allocating after juju logs show it destroyed

That is the $250k question.

On Wed, Jun 24, 2020 at 5:00 AM Heather Lanigan <email address hidden>
wrote:

> Small update:
>
> The api-server worker does wait for the model cache to be initialized:
>
> https://github.com/juju/juju/blob/091efcf08ed828834b5b9df6495539d158e081ae/cmd/jujud/agent/machine/manifolds.go#L682
>
> So why does it appear not to wait?
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: Juju bugs
> https://bugs.launchpad.net/bugs/1884315
>
> Title:
> 2.7.7+2.7-5323b92: vSphere machine stuck in allocating after juju logs
> show it destroyed
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1884315/+subscriptions
>

Pen Gale (pengale)
Changed in juju:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Medium → Low
tags: added: expirebugs-bot
Revision history for this message
Konstantinos Kaskavelis (kaskavel) wrote :

Closing this due to inactivity (low number of occurrences, and no hit for more than one year)

tags: added: solutions-qa-expired
Changed in juju:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.