WatchAllModels sends notifications for empty changes

Bug #1747708 reported by Stuart Bishop on 2018-02-06
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju
Medium
Anastasia

Bug Description

Using the WatchAllModels API call, I receive notifications from a model containing a single idle cs:postgresql unit every 5 minutes when the update-status hook runs. I suspect it is the charm calling status-set to ensure the workload status message is set correctly, or some similar no-op change.

Using WatchAllModels on a production controller with hundreds of models and thousands of units, I receive notifications faster than I can keep up with.

Stuart Bishop (stub) wrote :

It looks like when the charm calls status-set, the 'since' field in the application and unit status is updated, even if there was no change made.

Here are two batches of the notifications I'm seeing sent every five minutes from a minimal deployment. The deltas are identical, apart from the since field in the statuses:

[application change map[subordinate:false charm-url:cs:postgresql-165 owner-tag: min-units:0 life:alive constraints:map[] status:map[current:active message:Live master (9.5.10) since:2018-02-06T16:34:06.401035688Z version:] workload-version:9.5.10 model-uuid:94f4722f-4a1e-47d0-8fe8-985f9f45da08 name:postgresql exposed:false]]
[unit change map[series:xenial charm-url:cs:postgresql-165 private-address:10.0.4.244 machine-id:0 subordinate:false workload-status:map[version: current:active message:Live master (9.5.10) since:2018-02-06T16:34:06.401035688Z] name:postgresql/0 application:postgresql public-address:10.0.4.244 ports:[map[protocol:tcp number:5432]] port-ranges:[map[from-port:5432 to-port:5432 protocol:tcp]] agent-status:map[current:idle message: since:2018-02-06T16:29:53.494424567Z version:] model-uuid:94f4722f-4a1e-47d0-8fe8-985f9f45da08]]

[application change map[model-uuid:94f4722f-4a1e-47d0-8fe8-985f9f45da08 exposed:false owner-tag: status:map[version: current:active message:Live master (9.5.10) since:2018-02-06T16:38:33.795944943Z] workload-version:9.5.10 subordinate:false name:postgresql charm-url:cs:postgresql-165 life:alive min-units:0 constraints:map[]]]
[unit change map[application:postgresql series:xenial public-address:10.0.4.244 private-address:10.0.4.244 ports:[map[protocol:tcp number:5432]] subordinate:false workload-status:map[current:active message:Live master (9.5.10) since:2018-02-06T16:38:33.795944943Z version:] agent-status:map[message: since:2018-02-06T16:29:53.494424567Z version: current:idle] model-uuid:94f4722f-4a1e-47d0-8fe8-985f9f45da08 name:postgresql/0 charm-url:cs:postgresql-165 machine-id:0 port-ranges:[map[protocol:tcp from-port:5432 to-port:5432]]]]

So this may be a bug in status handling, rather than the WatchAllModels call. It is very common for charms to blindly reset their workload status.

tags: added: canonical-is
Anastasia (anastasia-macmood) wrote :

We do seem to have a few *interesting* behaviors around time on status. I suspect that they are all linked to https://bugs.launchpad.net/juju/+bug/1577606.

I will look into this to see if there is a way forward.

Changed in juju:
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Anastasia (anastasia-macmood)
tags: added: status
tags: added: status-time
removed: status
Anastasia (anastasia-macmood) wrote :

After further investigations, I do not actually think that the activity you are seeing is related to whether the time on the StatusInfo is a pointer or not, i.e.e the bug I linked is not relevant.

What is relevant is that we do check if the status changed since last time we have set it [1], but seem to update the status regardless [2].

Looking into making sure we do not update the status if nothing but the timestamp changed...

[1]
https://github.com/juju/juju/blob/develop/state/status.go#L351

[2]
https://github.com/juju/juju/blob/develop/state/status.go#L283

Anastasia (anastasia-macmood) wrote :

@Stuart Bishop,

What version of Juju are you using? This seems to have been tackled in Juju 2.3 with https://github.com/juju/juju/commit/34327f07b42643a17ffbdf290daa579da315be35

Changed in juju:
status: In Progress → Incomplete
assignee: Anastasia (anastasia-macmood) → nobody
Anastasia (anastasia-macmood) wrote :

Argh, that commit deals with status-history not the actual status updates.

I'll work on the change.

Changed in juju:
status: Incomplete → In Progress
assignee: nobody → Anastasia (anastasia-macmood)
Stuart Bishop (stub) wrote :

I'm duplicating locally with Juju 2.3, and seeing it on production with Juju 2.2

We do specifically track the last time that we got a message from the
charm, and we *did* get a status update from the charm, and are changing
the document tracking it to let you know that it told us its status is the
same. (If we didn't, you wouldn't know if the status was "available 5 hours
ago, and update-status just never ran").
Juju itself doesn't update that particular status field without getting a
message from the unit.

On Wed, Feb 7, 2018 at 9:59 AM, Anastasia <email address hidden>
wrote:

> Argh, that commit deals with status-history not the actual status
> updates.
>
> I'll work on the change.
>
> ** Changed in: juju
> Status: Incomplete => In Progress
>
> ** Changed in: juju
> Assignee: (unassigned) => Anastasia (anastasia-macmood)
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1747708
>
> Title:
> WatchAllModels sends notifications for empty changes
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1747708/+subscriptions
>

Stuart Bishop (stub) wrote :

'since' seems a poor name for 'last time the charm told us its status'. If a charm has been 'active, live master' for several months, I would expect 'since' to show me several months, rather than always <= 5 minutes on a normally functioning unit and 'active, live master' with a larger 'since' indicating a failure somewhere. The current behaviour and naming makes it look like things are flapping when things are good and stable when things have failed. Maybe we need a separate timestamp for the 'keepalive', if people make use of it. FWIW, juju-wait uses 'since' as part of its heuristic to determine stability of an application and will fail if someone sets the threshold higher than the update-status frequency.

I imagine if we change things here they can't happen until at least 2.4, since it would be a behaviour change rather than a bug fix. For now, I'll need to cache the last seen change delta for every unit and application to detect if the model change notification actually represents a model change or just a keepalive to be ignored.

Anastasia (anastasia-macmood) wrote :

I agree that what you are currently doing is sub-optimal.

I also agree that 'Since' should indicate when this status was first received and we should have a separate field for 'last-time-i've-checked'. At the moment, this will have to reside on the status itself, although I do wonder if we want to have a separate means to indicate the last time any hook(s) ran?... Something like 'juju hooks <unit-name>/<application-name>' to give you this information...

And, yes, whatever we decide, the work itself will be done post 2.3.

Stuart Bishop (stub) wrote :

I think knowing 'last time a hook was run' is more useful than knowing 'last time status-set was run'. The latter requires knowledge about the charms internals to interpret, while knowing the last time a hook was run is immediately useful. I also still want to know what hooks are due to be run on a unit, but that is certainly outside of this bug reports scope ;)

Stuart Bishop (stub) wrote :

I'm also getting bad charms flapping their statuses (active -> blocked -> active during an update-status hook). This pollutes bot show-status-log and the WatchAllModels stream. I'm not sure if Juju should be handling this more gracefully and only using the final status, or if this is legitimate and I need to deal with it at my end.

Anastasia (anastasia-macmood) wrote :

@Stuart Bishop (stub),

Tim and I agree that 'since' field in status should reflect when this status was received. So, if the status has not changed next time the hook is ran, we will not update status record. In other words, if status was changed to 'active' 5hrs ago, 'since' field will have a timestamp for 5hrs ago. I will work on that.

As for your desire to know when a particular hook was run, we provide that via status-log (aka status-history). It was specifically designed to track several types of executions, see 'juju help show-status-log'. Agent execution specifically tracks hooks that have been run. You can filter the output to discover when a particular hook was last run.

Anastasia (anastasia-macmood) wrote :

The substance of the change to support correct meaning of 'since' field is in PR against develop (heading into 2.4): https://github.com/juju/juju/pull/8350

I have not addressed all existing unit tests yet - I do anticipate failures in some places as we correct the behavior. However, I want to get a buy-in for the code change first.

Changed in juju:
status: In Progress → Fix Committed
milestone: none → 2.4-beta1
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers