application units can't get resource from controller

Bug #1826297 reported by james beedy
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

When multiple units try and pull a resource(s) at once everything seems to lock up, some units are able to get the resource, and some fail pulling it from the controller. I have worked around this in my spark charm to some degree by putting units that can't get the resource in a blocked state and have them naturally retry again when its their time. This ends up working itself out e.g. all of my units end up eventually getting the resource, but its for sure an extreme hack.

This can be reproduced by running the following command:

juju deploy cs:~omnivector/spark --constraints "instance-type=t3.medium" -n 10

Exhibited in this juju show here https://youtu.be/lirfA5a9Xik?t=1351

The charm code that accounts for this demented block and return spin lock mechanism is here https://github.com/omnivector-solutions/layer-spark-base/blob/master/lib/charms/layer/spark_base.py#L27,L30 and here https://github.com/omnivector-solutions/layer-spark-base/blob/master/reactive/spark_base.py#L76,L81

similarly for layer-hadoop-base, https://github.com/omnivector-solutions/layer-hadoop-base/blob/master/lib/charms/layer/hadoop_base.py#L25,L28 and https://github.com/omnivector-solutions/layer-hadoop-base/blob/master/lib/charms/layer/hadoop_base.py#L24,L28

james beedy (jamesbeedy)
description: updated
james beedy (jamesbeedy)
description: updated
description: updated
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1826297] Re: application units can't get resource from controller
Download full text (3.8 KiB)

I'm guessing the issue could be that if multiple units are requesting the
same resource, we aren't handling the caching and queuing on the controller
correctly. The controller should be downloading the resource on demand
(some charms have very large resources, so we don't want to cache them
unless they are needed). My guess is that multiple requests for the same
resource is causing confusion in the queuing system, and not having one
request start the download, and the rest be blocked until that is finished.

On Thu, Apr 25, 2019 at 6:15 AM james beedy <email address hidden> wrote:

> ** Description changed:
>
> When multiple units try and pull a resource(s) at once everything seems
> to lock up, some units are able to get the resource, and some fail
> pulling it from the controller. I have worked around this in my spark
> charm to some degree by putting units that can't get the resource in a
> blocked state and have them naturally retry again when its their time.
> This ends up working itself out e.g. all of my units end up eventually
> getting the resource, but its for sure an extreme hack.
>
> This can be reproduced by running the following command:
>
> juju deploy cs:~omnivector/spark --constraints "instance-type=t3.medium"
> -n 10
>
> Exhibited in this juju show here https://youtu.be/lirfA5a9Xik?t=1351
> +
> + The charm code that accounts for the block and return demented spin lock
> + mechanism is here https://github.com/omnivector-solutions/layer-spark-
> + base/blob/master/lib/charms/layer/spark_base.py#L27,L30 and here
> + https://github.com/omnivector-solutions/layer-spark-
> + base/blob/master/reactive/spark_base.py#L76,L81
> +
> + similarly for layer-hadoop-base, https://github.com/omnivector-solutions
> + /layer-hadoop-base/blob/master/lib/charms/layer/hadoop_base.py#L25,L28
> + and https://github.com/omnivector-solutions/layer-hadoop-
> + base/blob/master/lib/charms/layer/hadoop_base.py#L24,L28
>
> ** Description changed:
>
> When multiple units try and pull a resource(s) at once everything seems
> to lock up, some units are able to get the resource, and some fail
> pulling it from the controller. I have worked around this in my spark
> charm to some degree by putting units that can't get the resource in a
> blocked state and have them naturally retry again when its their time.
> This ends up working itself out e.g. all of my units end up eventually
> getting the resource, but its for sure an extreme hack.
>
> This can be reproduced by running the following command:
>
> juju deploy cs:~omnivector/spark --constraints "instance-type=t3.medium"
> -n 10
>
> Exhibited in this juju show here https://youtu.be/lirfA5a9Xik?t=1351
>
> - The charm code that accounts for the block and return demented spin lock
> - mechanism is here https://github.com/omnivector-solutions/layer-spark-
> - base/blob/master/lib/charms/layer/spark_base.py#L27,L30 and here
> + The charm code that accounts for this demented block and return spin
> + lock mechanism is here https://github.com/omnivector-solutions/layer-
> + spark-base/blob/master/lib/charms/layer/spark_base.py#L27,L30 and here
> https://github.com/omniv...

Read more...

Changed in juju:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.