juju status <service_name> broken

Bug #1516989 reported by Jacek Nykis
62
This bug affects 13 people
Affects Status Importance Assigned to Milestone
juju-core
Invalid
Critical
Jesse Meek
1.25
Fix Released
Critical
Jesse Meek
juju-core (Ubuntu)
Confirmed
Undecided
Unassigned
Nominated for Wily by Felipe Reyes

Bug Description

I have a juju environment where "juju status" works fine when used without arguments but I can't get status for single service:

$ juju status wordpress
ERROR could not filter units: could not filter units: cannot get status: unit not found

Service "wordpress" is definitely there and full "juju status" output shows that it has 2 units:
$ juju status|grep wordpress\/
      wordpress/0:
      wordpress/1:

Here is what is logged in machine-0.log when I make "juju status wordpress" call:

2015-11-17 11:32:53 INFO juju.apiserver apiserver.go:276 [13B] API connection from 1.2.3.4:35770
2015-11-17 11:32:53 DEBUG juju.apiserver utils.go:71 validate env uuid: state server environment - xxx
2015-11-17 11:32:53 DEBUG juju.apiserver apiserver.go:257 <- [13B] <unknown> {"RequestId":1,"Type":"Admin","Version":2,"Request":"Login","Params":"'params redacted'"}
2015-11-17 11:32:53 DEBUG juju.apiserver admin.go:149 hostPorts: [[1.1.1.1:17070 127.0.0.1:17070 [::1]:17070]]
2015-11-17 11:32:53 DEBUG juju.apiserver apiserver.go:271 -> [13B] user-admin@local 60.299337ms {"RequestId":1,"Response":"'body redacted'"} Admin[""].Login
2015-11-17 11:32:53 DEBUG juju.apiserver apiserver.go:257 <- [13B] user-admin@local {"RequestId":3,"Type":"Client","Request":"FullStatus","Params":"'params redacted'"}
2015-11-17 11:32:53 DEBUG juju.apiserver.client status.go:124 Services: map[wordpress-plugin-teams-integration:wordpress-plugin-teams-integration nrpe-mysql:nrpe-mysql mysql:mysql squid-reverseproxy:squid-reverseproxy wordpress-plugin-launchpad-integration:wordpress-plugin-launchpad-integration wordpress-plugin-openstack-objectstorage:wordpress-plugin-openstack-objectstorage swift-log-archive:swift-log-archive storage:storage wordpress-theme:wordpress-theme nrpe:nrpe apache2-subordinate:apache2-subordinate block-storage-broker:block-storage-broker wordpress-plugin-openid:wordpress-plugin-openid ssl-terminator-subordinate:ssl-terminator-subordinate wordpress:wordpress]
2015-11-17 11:32:53 DEBUG juju.apiserver apiserver.go:271 -> [13B] user-admin@local 11.210941ms {"RequestId":3,"Error":"could not filter units: could not filter units: cannot get status: unit not found","ErrorCode":"not found","Response":"'body redacted'"} Client[""].FullStatus
2015-11-17 11:32:53 INFO juju.apiserver apiserver.go:280 [13B] user-admin@local API connection terminated after 78.271433ms

Not sure if this is relavant but this environment was recently upgraded from juju 1.20.14

$ juju --version
1.24.4-trusty-amd64

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Can you attach the following?
1 - Full machine-0.log
2 - Full output of juju status
3 - Output of juju status wordpress --debug

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Setting to incomplete, pending the additional info requested in comment #1

Changed in juju-core:
status: New → Incomplete
Revision history for this message
Jacek Nykis (jacekn) wrote :

Some info in those logs is sensitive so I prefer not to publish them here. I archived them and let developers know in private where to find them.

Changed in juju-core:
status: Incomplete → New
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Reviewed the logs, and wasn't able to find any obvious answer. Will need to look into later. Any devs interested in assisting can contact me for the logs.

Changed in juju-core:
status: New → Triaged
importance: Undecided → Medium
Changed in juju-core:
importance: Medium → Critical
milestone: none → 1.25.3
assignee: nobody → Cheryl Jennings (cherylj)
Revision history for this message
Cheryl Jennings (cherylj) wrote :

I see what the problem is now. This was injected when the unit status was split into unit / unit agent status. Starting in 1.24, we key the unit status with an id of "u#<unit name>#charm", whereas older versions just used "u#<unit name>". The key "u#<unit name>" is now used for the unit agent status.

There does not seem to be an upgrade step to create the unit / workload statuses, so when we try to query the unit status, we get an error back. In running the full status, this error is ignored, and a nil status is displayed for the workload. You can see this when we just run juju status: http://paste.ubuntu.com/13538895/
In the above paste, the ubuntu and ubuntu2 workload-status is nil. But, ubuntu3, which was deployed post-upgrade to 1.25, has a valid status.

When running juju status to query a specific service, it will error because the workload status does not exist.

An upgrade step is needed to correctly create the unit / workload status. If needed, it may be possible to create a script to insert the values into an existing environment. But, I need to defer that to someone who has more knowledge of mongodb internals. Going to ask Menno for help on this part.

Revision history for this message
Cheryl Jennings (cherylj) wrote :

This is also a test escape that should have a testcase added to the juju test suite.

Changed in juju-core:
milestone: 1.25.3 → 1.25.2
Felipe Reyes (freyes)
tags: added: sts
Revision history for this message
Cheryl Jennings (cherylj) wrote :

fwereade is working on a script to repair an environment after this is hit in an upgrade.

Changed in juju-core:
assignee: Cheryl Jennings (cherylj) → Jesse Meek (waigani)
Revision history for this message
Jesse Meek (waigani) wrote :

When targeting the upgrade step in 1.25 I cannot see a way to distinguish between 1. a correct agent statusDoc and 2. a unit statusDoc with a agent style id - as the id is the only distinguishing feature between the two.

I imagine fwereade will hit this when writing the fix-it script.

unitAgentGlobalKey was introduced in 1.22-alpha1, before then we just had unitGlobalKey. That is, Prior to 1.22-alpha1 we didn't have an agent status. If we targeted 1.22 then the upgrade step could grab all docs from statusC with format u#<name> and update them to u#<name>#charm

Revision history for this message
Jesse Meek (waigani) wrote :

Cheryl just pointed out that each unit needs to have a unit agent status doc and a workload (unit status) doc. So we can:

1 - generate a list of all the units
2 - query the status for each unit (not agent status)
3 - if it returns not found, then create the status

Revision history for this message
Cheryl Jennings (cherylj) wrote :

The fix for this is underway, and should be included in 1.25.2.

William kindly put together a script to directly modify mongo to add in the missing statuses to work around this problem in the meantime. You can contact me directly to get the script until I figure out where to put it.

I have verified that running the script after an upgrade resolves the unit not found error.

Revision history for this message
Jesse Meek (waigani) wrote :
Changed in juju-core:
status: Triaged → In Progress
Jesse Meek (waigani)
Changed in juju-core:
status: In Progress → Fix Committed
Changed in juju-core:
milestone: 1.25.2 → 1.26-alpha3
status: Fix Committed → In Progress
Revision history for this message
Johan Ehnberg (johan-ehnberg) wrote :

I confirm the fix is effective. For large environments where hooks have failed, here's a one-liner after running the fix to re-run all the failed hooks, which should return the agents to normal (idle) state:
for i in `juju status --format tabular|grep failed|awk '{print $1;}'`;do echo Fixing $i;juju resolved -r $i;done

Revision history for this message
Cheryl Jennings (cherylj) wrote :

Just a note to those using the script to insert the workload status - you'll see a bogus time in the "since" field: 31 Dec 1969 18:00:00-06:00

tags: added: canonical-bootstack
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in juju-core (Ubuntu):
status: New → Confirmed
Changed in juju-core:
milestone: 1.26-alpha3 → 2.0-alpha1
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Not fixing in 2.0, as users will be required to upgrade to 1.25.2 before upgrading to 2.0

Changed in juju-core:
status: In Progress → Invalid
status: Invalid → Won't Fix
status: Won't Fix → Invalid
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 2.0-alpha1 → none
Revision history for this message
Tim Penhey (thumper) wrote :

WARNING: don't use the script in comment #16, but instead use this one:

http://pastebin.ubuntu.com/24186473/

This is a copy of the earlier script but adds in the txn-revno and txn-queue fields that are required for when juju wants to update the status to something new.

Revision history for this message
John A Meinel (jameinel) wrote :

Should I mark comment #16 as "hidden" so it doesn't give people the wrong idea?

Revision history for this message
Anastasia (anastasia-macmood) wrote :

Great idea, John \o/ Marked as hidden now!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.