Bug #1884315 “2.7.7+2.7-5323b92: vSphere machine stuck in alloca...” : Bugs : Canonical Juju

John George (jog) on 2020-06-19

description:

updated

Revision history for this message

Tim Penhey (thumper) wrote on 2020-06-23:

#1

I've spent quite a bit of time going through the crashdump from this.

I think there is a symptom, a real bug, and the underlying issue. All three are different.

The bug we see is what causes the test to fail.

This is from foundation.log:
ERROR:root:ERROR model "kubernetes" has been removed from the controller, run 'juju models' and switch to one of them.

This appears to be a bug in the dependency engine configuration for the controller. The model cache should be initialized before the apiserver starts. What we are seeing here is that the apiserver is starting before the cache has been initialized with the models. The error here is because the model can't be found in the cache. This we definitely need to fix.

The symptom is that the machine is shutdown due to an inability to set instance data. This is the i/o timeout. You'll also see that machine-10 was also powered off and destroyed, but in the juju-status.yaml that machine-10 is up and running, so the powering off is a red herring. The automatic provisioning retrying would have taken care of that.

The underlying cause here is the slow disk or lack of memory on the instance that the controller is running. The test is running up 70 units on 22 machines. I have the gut feeling that the 3.5G of memory that the controller has isn't enough. It is applying pressure on the underlying mongodb. If you look in /var/log/syslog at the mongod entries, you can see that the database query times for very simple things end up taking multiple seconds. This is a clear indication of either disk i/o issues, or memory issues. If mongo isn't able to keep the working set in memory, then everything will get slower, particularly if the actual disk for the machine isn't local.

At 17:16:51 the "state" worker restarted due to a ping i/o timeout. This meant that mongo wasn't able to respond to a simple ping in a timely manner. This restarting of the "state" worker also restarts the apiserver, and a bunch of other workers. The model-cache-initialized gate wasn't restarted, so the apiserver started prematurely. This is the bug, but the underlying cause is the disk i/o and memory constraints of the controller itself.

I've spent quite a bit of time going through the crashdump from this.

I think there is a symptom, a real bug, and the underlying issue. All three are different.

The bug we see is what causes the test to fail.

This is from foundation.log:
ERROR:root:ERROR model "kubernetes" has been removed from the controller, run 'juju models' and switch to one of them.

This appears to be a bug in the dependency engine configuration for the controller. The model cache should be initialized before the apiserver starts. What we are seeing here is that the apiserver is starting before the cache has been initialized with the models. The error here is because the model can't be found in the cache. This we definitely need to fix.

The symptom is that the machine is shutdown due to an inability to set instance data. This is the i/o timeout. You'll also see that machine-10 was also powered off and destroyed, but in the juju-status.yaml that machine-10 is up and running, so the powering off is a red herring. The automatic provisioning retrying would have taken care of that.

The underlying cause here is the slow disk or lack of memory on the instance that the controller is running. The test is running up 70 units on 22 machines. I have the gut feeling that the 3.5G of memory that the controller has isn't enough. It is applying pressure on the underlying mongodb. If you look in /var/log/syslog at the mongod entries, you can see that the database query times for very simple things end up taking multiple seconds. This is a clear indication of either disk i/o issues, or memory issues. If mongo isn't able to keep the working set in memory, then everything will get slower, particularly if the actual disk for the machine isn't local.

At 17:16:51 the "state" worker restarted due to a ping i/o timeout. This meant that mongo wasn't able to respond to a simple ping in a timely manner. This restarting of the "state" worker also restarts the apiserver, and a bunch of other workers. The model-cache-initialized gate wasn't restarted, so the apiserver started prematurely. This is the bug, but the underlying cause is the disk i/o and memory constraints of the controller itself.

Revision history for this message

Heather Lanigan (hmlanigan) wrote on 2020-06-23:

#2

Small update:

The api-server worker does wait for the model cache to be initialized:
https://github.com/juju/juju/blob/091efcf08ed828834b5b9df6495539d158e081ae/cmd/jujud/agent/machine/manifolds.go#L682

So why does it appear not to wait?

Revision history for this message

Tim Penhey (thumper) wrote on 2020-06-23: Re: [Bug 1884315] Re: 2.7.7+2.7-5323b92: vSphere machine stuck in allocating after juju logs show it destroyed

#3

That is the $250k question.

On Wed, Jun 24, 2020 at 5:00 AM Heather Lanigan <email address hidden>
wrote:

> Small update:
>
> The api-server worker does wait for the model cache to be initialized:
>
> https://github.com/juju/juju/blob/091efcf08ed828834b5b9df6495539d158e081ae/cmd/jujud/agent/machine/manifolds.go#L682
>
> So why does it appear not to wait?
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: Juju bugs
> https://bugs.launchpad.net/bugs/1884315
>
> Title:
> 2.7.7+2.7-5323b92: vSphere machine stuck in allocating after juju logs
> show it destroyed
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1884315/+subscriptions
>

Pen Gale (pengale) on 2020-10-26

Changed in juju:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

#4

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	Medium → Low
tags:	added: expirebugs-bot

Revision history for this message

Konstantinos Kaskavelis (kaskavel) wrote on 2022-11-25:

#6

Closing this due to inactivity (low number of occurrences, and no hit for more than one year)

tags:	added: solutions-qa-expired
Changed in juju:
status:	Triaged → Invalid

Canonical Juju

2.7.7+2.7-5323b92: vSphere machine stuck in allocating after juju logs show it destroyed

Bug Description

Other bug subscribers

Remote bug watches