Comment 1 for bug 1826297

Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1826297] Re: application units can't get resource from controller

I'm guessing the issue could be that if multiple units are requesting the
same resource, we aren't handling the caching and queuing on the controller
correctly. The controller should be downloading the resource on demand
(some charms have very large resources, so we don't want to cache them
unless they are needed). My guess is that multiple requests for the same
resource is causing confusion in the queuing system, and not having one
request start the download, and the rest be blocked until that is finished.

On Thu, Apr 25, 2019 at 6:15 AM james beedy <email address hidden> wrote:

> ** Description changed:
>
> When multiple units try and pull a resource(s) at once everything seems
> to lock up, some units are able to get the resource, and some fail
> pulling it from the controller. I have worked around this in my spark
> charm to some degree by putting units that can't get the resource in a
> blocked state and have them naturally retry again when its their time.
> This ends up working itself out e.g. all of my units end up eventually
> getting the resource, but its for sure an extreme hack.
>
> This can be reproduced by running the following command:
>
> juju deploy cs:~omnivector/spark --constraints "instance-type=t3.medium"
> -n 10
>
> Exhibited in this juju show here https://youtu.be/lirfA5a9Xik?t=1351
> +
> + The charm code that accounts for the block and return demented spin lock
> + mechanism is here https://github.com/omnivector-solutions/layer-spark-
> + base/blob/master/lib/charms/layer/spark_base.py#L27,L30 and here
> + https://github.com/omnivector-solutions/layer-spark-
> + base/blob/master/reactive/spark_base.py#L76,L81
> +
> + similarly for layer-hadoop-base, https://github.com/omnivector-solutions
> + /layer-hadoop-base/blob/master/lib/charms/layer/hadoop_base.py#L25,L28
> + and https://github.com/omnivector-solutions/layer-hadoop-
> + base/blob/master/lib/charms/layer/hadoop_base.py#L24,L28
>
> ** Description changed:
>
> When multiple units try and pull a resource(s) at once everything seems
> to lock up, some units are able to get the resource, and some fail
> pulling it from the controller. I have worked around this in my spark
> charm to some degree by putting units that can't get the resource in a
> blocked state and have them naturally retry again when its their time.
> This ends up working itself out e.g. all of my units end up eventually
> getting the resource, but its for sure an extreme hack.
>
> This can be reproduced by running the following command:
>
> juju deploy cs:~omnivector/spark --constraints "instance-type=t3.medium"
> -n 10
>
> Exhibited in this juju show here https://youtu.be/lirfA5a9Xik?t=1351
>
> - The charm code that accounts for the block and return demented spin lock
> - mechanism is here https://github.com/omnivector-solutions/layer-spark-
> - base/blob/master/lib/charms/layer/spark_base.py#L27,L30 and here
> + The charm code that accounts for this demented block and return spin
> + lock mechanism is here https://github.com/omnivector-solutions/layer-
> + spark-base/blob/master/lib/charms/layer/spark_base.py#L27,L30 and here
> https://github.com/omnivector-solutions/layer-spark-
> base/blob/master/reactive/spark_base.py#L76,L81
>
> similarly for layer-hadoop-base, https://github.com/omnivector-solutions
> /layer-hadoop-base/blob/master/lib/charms/layer/hadoop_base.py#L25,L28
> and https://github.com/omnivector-solutions/layer-hadoop-
> base/blob/master/lib/charms/layer/hadoop_base.py#L24,L28
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1826297
>
> Title:
> application units can't get resource from controller
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1826297/+subscriptions
>