Machine agent state changes not included in the mega-watcher

Bug #1453096 reported by Adam Collard
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned
juju-gui
Triaged
High
Unassigned

Bug Description

Using Juju GUI (cs:trusty/juju-gui-27) on the MAAS provider with Juju 1.22.1, when I have a service deployed with 3 units across different machines, when I (outside of Juju) power down one of the machines I expect the GUI to show me that one of the units is in error or otherwise call my attention to it.

Juju CLI knows that the agent is down

$ juju status --format=short openstack-dashboard

- openstack-dashboard/0: 10.1.11.5 (down) 80/tcp, 443/tcp
  - hacluster-openstack-dashboard/0: 10.1.11.5 (down)
  - landscape-client/17: 10.1.11.5 (down)
- openstack-dashboard/1: 10.1.11.8 (started) 80/tcp, 443/tcp
  - hacluster-openstack-dashboard/1: 10.1.11.8 (started)
  - landscape-client/15: 10.1.11.8 (started)
- openstack-dashboard/2: 10.1.11.13 (started) 80/tcp, 443/tcp
  - hacluster-openstack-dashboard/2: 10.1.11.13 (started)
  - landscape-client/16: 10.1.11.13 (started)

description: updated
Revision history for this message
Richard Harding (rharding) wrote :

We'll investigate if we can get this information over the watcher. If it's not exposed maybe we can request core support it.

Changed in juju-gui:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Francesco Banconi (frankban) wrote :

I confirm this information is not sent as part of the mega-watcher data, so there is currently no way for the GUI to detect the machine was stopped/destroyed from outside of Juju.

See https://github.com/juju/juju/blob/master/apiserver/client/status.go#L746

Assigning this bug to juju-core.

summary: - Agent down doesn't change service bar, unit is still green
+ Machine agent state changes not included in the mega-watcher
Curtis Hovey (sinzui)
tags: added: api
Changed in juju-core:
status: New → Triaged
importance: Undecided → High
milestone: none → 1.25.0
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Looked at this a bit with Menno, and we had expected that this was just not implemented. However, it looks like the machine agent status should be included, and there might just be a bug preventing it from being reported properly. Will look more in the morning.

Changed in juju-core:
assignee: nobody → Cheryl Jennings (cherylj)
Revision history for this message
Cheryl Jennings (cherylj) wrote :

From Menno:

I remember seeing this when I was working on status just after I started with Canonical. The "down" status isn't set or decided within state and is never reflected in the database. Instead the API server switches out the status when generating the result of the FullStatus API call. See the large comment I wrote to explain this towards the bottom of processAgent in apiserver/client/status.go. It even mentioned that the down status won't be seen by clients using a watcher.

So the "down" status is kinda of synthetic when it really shouldn't be. To fix this ticket, "down" needs to be somehow reflected in the database. If that's done then the AllWatcher API will start reporting machine or unit change events when the agent goes down (I don't think anything needs to change with the watcher code at all).

This might not be easy, especially to do efficiently (this might require some changes in state/presence). Something running in the state server needs to notice when any agent presence has changed and then update the agent's status in the database. It'll need to remember the previous status so that when the agent comes back the old status can be restored (the agent might not set the status back to "started" again itself).

This is the kind of thing that is worth talking to Will or John about as they will probably have some thoughts on how it should be done.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Sent an email to John and Will today to start the conversation around how to address this issue.

Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.0 → 1.26.0
Revision history for this message
Cheryl Jennings (cherylj) wrote :

The conversation around this continues, as it is turning out to be quite a challenge to handle this appropriately.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

There is still a lot of discussion around this, and we are aiming to have it complete as part of larger observability work in 1.26

Changed in juju-core:
assignee: Cheryl Jennings (cherylj) → nobody
Changed in juju-core:
milestone: 1.26.0 → 2.0-beta5
Changed in juju-core:
milestone: 2.0-beta5 → 2.0-beta4
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Fixing this bug requires a substantial change to how we manage / track agent presence, and will likely need to be addressed in a feature in a future release.

Changed in juju-core:
milestone: 2.0-beta4 → 2.1.0
affects: juju-core → juju
Changed in juju:
milestone: 2.1.0 → none
milestone: none → 2.1.0
Revision history for this message
Anastasia (anastasia-macmood) wrote :

Although code referred to here has been re-factored, the issue was not addressed in Juju 2.1.
I reproduced on AWS following original steps.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Removing 2.1 milestone as we will not be addressing this issue in 2.1.

Changed in juju:
milestone: 2.1.0 → none
Revision history for this message
Anastasia (anastasia-macmood) wrote :

An update to comment # 2, on develop (2.3-alpha1), the code can now be found at https://github.com/juju/juju/blob/develop/apiserver/facades/client/client/status.go#L722

Revision history for this message
Menno Finlay-Smits (menno.smits) wrote :

This has indeed not been fixed yet. The analysis in comment #4 is still accurate.

Changed in juju:
milestone: none → 2.5-beta1
Changed in juju:
assignee: nobody → Richard Harding (rharding)
Changed in juju:
milestone: 2.5-beta1 → 2.5-beta2
Revision history for this message
Fabrice Matrat (fabricematrat) wrote :

this is affecting jaas metrics
Any news on that bug ?

Revision history for this message
Richard Harding (rharding) wrote :

@fabricematrat

Just that this is moved to the beta2 milestone and still something we're working to fit into the polish of 2.5 final. We're looking to cut beta1 today and so this didn't make the deadline this week.

Changed in juju:
milestone: 2.5-beta2 → 2.5.1
Changed in juju:
assignee: Richard Harding (rharding) → nobody
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.5.1 → 2.5.2
Changed in juju:
milestone: 2.5.2 → 2.5.3
Changed in juju:
milestone: 2.5.3 → 2.5.4
Changed in juju:
milestone: 2.5.4 → 2.5.5
Changed in juju:
milestone: 2.5.6 → 2.7-beta1
Changed in juju:
milestone: 2.7-beta1 → 2.7-rc1
Changed in juju:
milestone: 2.7-rc1 → none
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.