Canonical Juju

application units can't get resource from controller

Bug #1826297 reported by james beedy on 2019-04-25

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Low	Unassigned

Bug Description

When multiple units try and pull a resource(s) at once everything seems to lock up, some units are able to get the resource, and some fail pulling it from the controller. I have worked around this in my spark charm to some degree by putting units that can't get the resource in a blocked state and have them naturally retry again when its their time. This ends up working itself out e.g. all of my units end up eventually getting the resource, but its for sure an extreme hack.

This can be reproduced by running the following command:

juju deploy cs:~omnivector/spark --constraints "instance-type=t3.medium" -n 10

Exhibited in this juju show here https://youtu.be/lirfA5a9Xik?t=1351

The charm code that accounts for this demented block and return spin lock mechanism is here https://github.com/omnivector-solutions/layer-spark-base/blob/master/lib/charms/layer/spark_base.py#L27,L30 and here https://github.com/omnivector-solutions/layer-spark-base/blob/master/reactive/spark_base.py#L76,L81

similarly for layer-hadoop-base, https://github.com/omnivector-solutions/layer-hadoop-base/blob/master/lib/charms/layer/hadoop_base.py#L25,L28 and https://github.com/omnivector-solutions/layer-hadoop-base/blob/master/lib/charms/layer/hadoop_base.py#L24,L28

See original description

Tags:

james beedy (jamesbeedy) on 2019-04-25

description:

updated

james beedy (jamesbeedy) on 2019-04-25

description:	updated
description:	updated

Revision history for this message

John A Meinel (jameinel) wrote on 2019-04-25: Re: [Bug 1826297] Re: application units can't get resource from controller

Download full text (3.8 KiB)

I'm guessing the issue could be that if multiple units are requesting the
same resource, we aren't handling the caching and queuing on the controller
correctly. The controller should be downloading the resource on demand
(some charms have very large resources, so we don't want to cache them
unless they are needed). My guess is that multiple requests for the same
resource is causing confusion in the queuing system, and not having one
request start the download, and the rest be blocked until that is finished.

On Thu, Apr 25, 2019 at 6:15 AM james beedy <email address hidden> wrote:

> ** Description changed:
>
> When multiple units try and pull a resource(s) at once everything seems
> to lock up, some units are able to get the resource, and some fail
> pulling it from the controller. I have worked around this in my spark
> charm to some degree by putting units that can't get the resource in a
> blocked state and have them naturally retry again when its their time.
> This ends up working itself out e.g. all of my units end up eventually
> getting the resource, but its for sure an extreme hack.
>
> This can be reproduced by running the following command:
>
> juju deploy cs:~omnivector/spark --constraints "instance-type=t3.medium"
> -n 10
>
> Exhibited in this juju show here https://youtu.be/lirfA5a9Xik?t=1351
> +
> + The charm code that accounts for the block and return demented spin lock
> + mechanism is here https://github.com/omnivector-solutions/layer-spark-
> + base/blob/master/lib/charms/layer/spark_base.py#L27,L30 and here
> + https://github.com/omnivector-solutions/layer-spark-
> + base/blob/master/reactive/spark_base.py#L76,L81
> +
> + similarly for layer-hadoop-base, https://github.com/omnivector-solutions
> + /layer-hadoop-base/blob/master/lib/charms/layer/hadoop_base.py#L25,L28
> + and https://github.com/omnivector-solutions/layer-hadoop-
> + base/blob/master/lib/charms/layer/hadoop_base.py#L24,L28
>
> ** Description changed:
>
> When multiple units try and pull a resource(s) at once everything seems
> to lock up, some units are able to get the resource, and some fail
> pulling it from the controller. I have worked around this in my spark
> charm to some degree by putting units that can't get the resource in a
> blocked state and have them naturally retry again when its their time.
> This ends up working itself out e.g. all of my units end up eventually
> getting the resource, but its for sure an extreme hack.
>
> This can be reproduced by running the following command:
>
> juju deploy cs:~omnivector/spark --constraints "instance-type=t3.medium"
> -n 10
>
> Exhibited in this juju show here https://youtu.be/lirfA5a9Xik?t=1351
>
> - The charm code that accounts for the block and return demented spin lock
> - mechanism is here https://github.com/omnivector-solutions/layer-spark-
> - base/blob/master/lib/charms/layer/spark_base.py#L27,L30 and here
> + The charm code that accounts for this demented block and return spin
> + lock mechanism is here https://github.com/omnivector-solutions/layer-
> + spark-base/blob/master/lib/charms/layer/spark_base.py#L27,L30 and here
> https://github.com/omniv...

On Thu, Apr 25, 2019 at 6:15 AM james beedy <jamesbeedy@gmail.com> wrote:

> ** Description changed:
>
>   When multiple units try and pull a resource(s) at once everything seems
>   to lock up, some units are able to get the resource, and some fail
>   pulling it from the controller. I have worked around this in my spark
>   charm to some degree by putting units that can't get the resource in a
>   blocked state and have them naturally retry again when its their time.
>   This ends up working itself out e.g. all of my units end up eventually
>   getting the resource, but its for sure an extreme hack.
>
>   This can be reproduced by running the following command:
>
>   juju deploy cs:~omnivector/spark --constraints "instance-type=t3.medium"
>   -n 10
>
>   Exhibited in this juju show here https://youtu.be/lirfA5a9Xik?t=1351
> +
> + The charm code that accounts for the block and return demented spin lock
> + mechanism is here https://github.com/omnivector-solutions/layer-spark-
> + base/blob/master/lib/charms/layer/spark_base.py#L27,L30 and here
> + https://github.com/omnivector-solutions/layer-spark-
> + base/blob/master/reactive/spark_base.py#L76,L81
> +
> + similarly for layer-hadoop-base, https://github.com/omnivector-solutions
> + /layer-hadoop-base/blob/master/lib/charms/layer/hadoop_base.py#L25,L28
> + and https://github.com/omnivector-solutions/layer-hadoop-
> + base/blob/master/lib/charms/layer/hadoop_base.py#L24,L28
>
> ** Description changed:
>
>   When multiple units try and pull a resource(s) at once everything seems
>   to lock up, some units are able to get the resource, and some fail
>   pulling it from the controller. I have worked around this in my spark
>   charm to some degree by putting units that can't get the resource in a
>   blocked state and have them naturally retry again when its their time.
>   This ends up working itself out e.g. all of my units end up eventually
>   getting the resource, but its for sure an extreme hack.
>
>   This can be reproduced by running the following command:
>
>   juju deploy cs:~omnivector/spark --constraints "instance-type=t3.medium"
>   -n 10
>
>   Exhibited in this juju show here https://youtu.be/lirfA5a9Xik?t=1351
>
> - The charm code that accounts for the block and return demented spin lock
> - mechanism is here https://github.com/omnivector-solutions/layer-spark-
> - base/blob/master/lib/charms/layer/spark_base.py#L27,L30 and here
> + The charm code that accounts for this demented block and return spin
> + lock mechanism is here https://github.com/omnivector-solutions/layer-
> + spark-base/blob/master/lib/charms/layer/spark_base.py#L27,L30 and here
>   https://github.com/omnivector-solutions/layer-spark-
>   base/blob/master/reactive/spark_base.py#L76,L81
>
>   similarly for layer-hadoop-base, https://github.com/omnivector-solutions
>   /layer-hadoop-base/blob/master/lib/charms/layer/hadoop_base.py#L25,L28
>   and https://github.com/omnivector-solutions/layer-hadoop-
>   base/blob/master/lib/charms/layer/hadoop_base.py#L24,L28
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1826297
>
> Title:
>   application units can't get resource from controller
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1826297/+subscriptions
>

Richard Harding (rharding) on 2019-04-29

Changed in juju:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	High → Low
tags:	added: expirebugs-bot

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.