juju 2.2.4 and 2.2.6 actions dissappear when state is changed from running to complete

Bug #1729880 reported by Matt Dirba on 2017-11-03
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
juju
High
Eric Claude Jones

Bug Description

Description: After 20ish hours of calling actions in a while loop that take either a minute, 30 minutes, an hour, or 4 hours to run. The result of an action disappears and juju loop.run gets stuck in an infinite loop.

Code:
    def run_action(self, unit, action, params):
        from juju import loop
        app = self.model.units[unit].application
        action = loop.run(self.model.applications[app].units[0].run_action(action, **params))
        logger.info("Running action %s on %s UUID %s with Params %s", action.data['name'],
                    action.data['receiver'], action.data['id'], params)
        loop.run(action.wait())
        self.update_status()
        return action

    while True:
        datetime.now()
        self.run_action('sshdefault/0', 'benchmark', params={'time': 1800})

Log from above code with timestamp of the start of the action:
    jujubench2.jujubench - 2017-11-02 17:16:00,747 Running action benchmark on sshdefault/0 UUID b57d6c41-6cf6-4f92-81f1-558887cd2062 with Params {'time': 1800}

$ juju debug-log -m sshdefault
unit-sshdefault-0: 17:46:47 ERROR juju.worker.uniter resolver loop error: executing operation "run action b57d6c41-6cf6-4f92-81f1-558887cd2062": running action "benchmark": action "b57d6c41-6cf6-4f92-81f1-558887cd2062" not found
unit-sshdefault-0: 17:46:47 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: executing operation "run action b57d6c41-6cf6-4f92-81f1-558887cd2062": running action "benchmark": action "b57d6c41-6cf6-4f92-81f1-558887cd2062" not found

$ juju show-action-status -m sshdefault b57d6c41-6cf6-4f92-81f1-558887cd2062
ERROR no actions found matching prefix "b57d6c41-6cf6-4f92-81f1-558887cd2062"

Note: The errors show up in the juju debug-log after the action completes, but I can not find the action using juju show-action-status. This indicates to me that when the action is moved from the running to completed state it gets lost.

Tags: sts Edit Tag help
Matt Dirba (5qxm) wrote :

I have tried this with both a default bootstrap of the juju controller, as well as a bootstrap of a controller with the following config values.

automatically-retry-hooks: false
enable-os-refresh-update: false
enable-os-upgrade: false
firewall-mode: none
test-mode: true
transmit-vendor-metrics: false

I do believe we prune old action data in newer releases of Juju. I'd have
to ask Eric to know what the trigger is for pruning old actions. Its also
possible there is just a bug in the pruning logic.

On Fri, Nov 3, 2017 at 7:40 PM, Matt Dirba <email address hidden> wrote:

> I have tried this with both a default bootstrap of the juju controller,
> as well as a bootstrap of a controller with the following config values.
>
> automatically-retry-hooks: false
> enable-os-refresh-update: false
> enable-os-upgrade: false
> firewall-mode: none
> test-mode: true
> transmit-vendor-metrics: false
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1729880
>
> Title:
> juju 2.2.4 and 2.2.6 actions dissappear when state is changed from
> running to complete
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1729880/+subscriptions
>

Ian Booth (wallyworld) wrote :

There are model parameters which are used to control how action results are pruned.

max-action-results-age
- the maximum age of actions to keep when pruning (default 2 weeks)

max-action-results-size
- the maximum size the actions collection can grow to before it is pruned (default 5G)

If the action results are required to be kept for longer, either of the above can be tweaked.

If the results are being pruned after only 20h, that does seem too soon.

Changed in juju:
assignee: nobody → Eric Claude Jones (ecjones)
milestone: none → 2.3-beta3
importance: Undecided → High
status: New → Triaged
Matt Dirba (5qxm) wrote :

The previous 20 hours worth of actions still exist. It is the most recent action result that goes missing. The juju.loop waits for the action to go to either the failed or completed state, but the action never makes it to that state.

Matt Dirba (5qxm) wrote :

The previous 20 hours worth of actions still exist. It is the most recent action result that goes missing. juju.loop waits for the action to go to either the failed or completed state, but the action never makes it to that state.

Changed in juju:
milestone: 2.3-beta3 → none
Changed in juju:
status: Triaged → In Progress
Eric Claude Jones (ecjones) wrote :

Hello Matt Dirba,

I suspect that the issue here is a bug with the action pruner - and a nasty one perhaps. Once your log grows to a particular size the action pruner will indeed cleanse your log of the oldest actions until your size limit is respected. However, it may be the case that new actions that have not completed are hanging out in the database with a `completed time` initialized to the zero value of the epoch time, making them the oldest values. So if the pruner happens to run while the `uniter` is waiting to run those actions, then those actions may get wiped out.

Eric Claude Jones (ecjones) wrote :

To shrink the issue I set the action pruner to run every 30 seconds, then I ran a loop similar to the reported loop:

`for i in {1..100}; do juju run --unit=ubuntu/0 sleep 5 && hostname; done`

indeed,

`unit-ubuntu-0: 17:53:49 ERROR juju.worker.dependency "uniter" manifold worker returned unexpected error: executing operation "run action fb3613e4-6858-42d5-8c03-92e0009bbfc0": running action "juju-run": action "fb3613e4-6858-42d5-8c03-92e0009bbfc0" not found`

happens every once in a while.

If we tell the pruner to ignoring age:

juju model-config max-action-results-age=0h

We see the problem goes away.

Eric Claude Jones (ecjones) wrote :

The solution is to tell the root pruner transaction to ignore action documents with completed times of 0s past the Unix Epoch or of time 0.

John A Meinel (jameinel) wrote :

Should we be using a *time.Time and setting it to nil along with
bson:"...,omitempty" so that we don't have any value in the database rather
than an ugly value?

John
=:->

On Nov 22, 2017 03:50, "Eric Claude Jones" <email address hidden>
wrote:

> https://github.com/juju/juju/pull/8116/files
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1729880
>
> Title:
> juju 2.2.4 and 2.2.6 actions dissappear when state is changed from
> running to complete
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1729880/+subscriptions
>

Changed in juju:
status: In Progress → Fix Committed
Matt Dirba (5qxm) wrote :

I have rebuilt juju with your patch. Do I need to upgrade the controller/model? If yes, how do I do that? juju upgrade-juju -m test says no upgrades available

Eric Claude Jones (ecjones) wrote :

Hello Matt Dirba,

`juju upgrade-juju -m <the-controller-model> --build-agent`. This should serve to upgrade the controller agent and all of its model's agents using the new code with the patch.

Matt Dirba (5qxm) wrote :

I have verified that this fixes my issue in the development branch. Thank you.

Matt Dirba (5qxm) wrote :

How would I build a model agent for a different architectures? Some of my test machines are aarch64 and others are amd64.

John A Meinel (jameinel) wrote :

If you can't use the published builds, I believe Andrew Wilkins has a tool
that lets you cross compile and then upload them to the controller.

We should probably look to move that into the official Juju repo under
scripts/ so people know where to find it.

John
=:->

On Dec 4, 2017 21:41, "Matt Dirba" <email address hidden> wrote:

> How would I build a model agent for a different architectures? Some of
> my test machines are aarch64 and others are amd64.
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1729880
>
> Title:
> juju 2.2.4 and 2.2.6 actions dissappear when state is changed from
> running to complete
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1729880/+subscriptions
>

Matt Dirba (5qxm) wrote :
Download full text (4.0 KiB)

After I bootstrap or upgrade my controller with --build-agent it changes
the agent-version from 2.2.6 to 2.2.6.1 (I backported the patch), and I am
able to add machines with the same architecture from my MaaS server, but
when I attempt to add a machine with a different architecture wild things
happen. Every machine matching the constraints is added and then almost
immediately released once juju realizes that the architecture is different
and it does not have an agent already built. Is there a way to hard code
the build version such that when I bootstrap with build-agent it shows as
2.2.6?

Regards,
Matt Dirba
512-779-4696

On Mon, Dec 4, 2017 at 9:38 PM, John A Meinel <email address hidden>
wrote:

> If you can't use the published builds, I believe Andrew Wilkins has a tool
> that lets you cross compile and then upload them to the controller.
>
> We should probably look to move that into the official Juju repo under
> scripts/ so people know where to find it.
>
> John
> =:->
>
> On Dec 4, 2017 21:41, "Matt Dirba" <email address hidden> wrote:
>
> > How would I build a model agent for a different architectures? Some of
> > my test machines are aarch64 and others are amd64.
> >
> > --
> > You received this bug notification because you are subscribed to juju.
> > Matching subscriptions: juju bugs
> > https://bugs.launchpad.net/bugs/1729880
> >
> > Title:
> > juju 2.2.4 and 2.2.6 actions dissappear when state is changed from
> > running to complete
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/juju/+bug/1729880/+subscriptions
> >
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1729880
>
> Title:
> juju 2.2.4 and 2.2.6 actions dissappear when state is changed from
> running to complete
>
> Status in juju:
> Fix Committed
>
> Bug description:
> Description: After 20ish hours of calling actions in a while loop that
> take either a minute, 30 minutes, an hour, or 4 hours to run. The
> result of an action disappears and juju loop.run gets stuck in an
> infinite loop.
>
> Code:
> def run_action(self, unit, action, params):
> from juju import loop
> app = self.model.units[unit].application
> action = loop.run(self.model.applications[app].units[0].run_action(action,
> **params))
> logger.info("Running action %s on %s UUID %s with Params %s",
> action.data['name'],
> action.data['receiver'], action.data['id'], params)
> loop.run(action.wait())
> self.update_status()
> return action
>
> while True:
> datetime.now()
> self.run_action('sshdefault/0', 'benchmark', params={'time':
> 1800})
>
> Log from above code with timestamp of the start of the action:
> jujubench2.jujubench - 2017-11-02 17:16:00,747 Running action
> benchmark on sshdefault/0 UUID b57d6c41-6cf6-4f92-81f1-558887cd2062 with
> Params {'time': 1800}
>
>
> $ juju debug-log -m sshdefault
> unit-sshdefault-0: 17:46:47 ERROR juju.worker.uniter resolver loop
> error: executing operation "run action b57d6c41-6cf6-4f92-81f1-558887...

Read more...

John A Meinel (jameinel) wrote :
Download full text (4.9 KiB)

The tool Andrew has is: https://github.com/axw/juju-tools

I don't know a lot about the parameters, etc of how to use it, but I
believe you would build it, and put the 'juju-tools' binary into your PATH
and then run "juju tools --help".

On Tue, Dec 5, 2017 at 6:24 PM, Matt Dirba <email address hidden> wrote:

> After I bootstrap or upgrade my controller with --build-agent it changes
> the agent-version from 2.2.6 to 2.2.6.1 (I backported the patch), and I am
> able to add machines with the same architecture from my MaaS server, but
> when I attempt to add a machine with a different architecture wild things
> happen. Every machine matching the constraints is added and then almost
> immediately released once juju realizes that the architecture is different
> and it does not have an agent already built. Is there a way to hard code
> the build version such that when I bootstrap with build-agent it shows as
> 2.2.6?
>
>
> Regards,
> Matt Dirba
> 512-779-4696
>
> On Mon, Dec 4, 2017 at 9:38 PM, John A Meinel <email address hidden>
> wrote:
>
> > If you can't use the published builds, I believe Andrew Wilkins has a
> tool
> > that lets you cross compile and then upload them to the controller.
> >
> > We should probably look to move that into the official Juju repo under
> > scripts/ so people know where to find it.
> >
> > John
> > =:->
> >
> > On Dec 4, 2017 21:41, "Matt Dirba" <email address hidden> wrote:
> >
> > > How would I build a model agent for a different architectures? Some of
> > > my test machines are aarch64 and others are amd64.
> > >
> > > --
> > > You received this bug notification because you are subscribed to juju.
> > > Matching subscriptions: juju bugs
> > > https://bugs.launchpad.net/bugs/1729880
> > >
> > > Title:
> > > juju 2.2.4 and 2.2.6 actions dissappear when state is changed from
> > > running to complete
> > >
> > > To manage notifications about this bug go to:
> > > https://bugs.launchpad.net/juju/+bug/1729880/+subscriptions
> > >
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/1729880
> >
> > Title:
> > juju 2.2.4 and 2.2.6 actions dissappear when state is changed from
> > running to complete
> >
> > Status in juju:
> > Fix Committed
> >
> > Bug description:
> > Description: After 20ish hours of calling actions in a while loop that
> > take either a minute, 30 minutes, an hour, or 4 hours to run. The
> > result of an action disappears and juju loop.run gets stuck in an
> > infinite loop.
> >
> > Code:
> > def run_action(self, unit, action, params):
> > from juju import loop
> > app = self.model.units[unit].application
> > action = loop.run(self.model.applications[app].units[0].
> run_action(action,
> > **params))
> > logger.info("Running action %s on %s UUID %s with Params %s",
> > action.data['name'],
> > action.data['receiver'], action.data['id'], params)
> > loop.run(action.wait())
> > self.update_status()
> > return action
> >
> > while True:
> > datetime.now()
> > self.run_action('sshdefa...

Read more...

Matt Dirba (5qxm) wrote :

Thanks. I was able to build juju and juju-tools in go on a different architecture. Build the tarball with juju tools build. Replace FORCE-VERSION file with 2.2.6.1 and then go back to the machine I deployed my controller from and run juju tools upgrade juju-2.2.6.1-xenial-arm64.tgz.

However, I think most of this was unnecessary. I should have copied an existing 2.2.6-xenial-arm64, modified/created FORCE-VERSION with 2.2.6.1 and renamed the tarball to 2.2.6.1-xenial-arm64.tgz and uploaded that with juju tools upgrade since the patch only affects the controller

Changed in juju:
milestone: none → 2.3.5
Felipe Reyes (freyes) on 2018-03-07
tags: added: sts
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers