"unknown operation kind run-action" (1.26alpha3)

Bug #1534103 reported by Cory Johns
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Michael Foord
juju-core
Won't Fix
Undecided
Unassigned
1.25
Won't Fix
Undecided
Unassigned

Bug Description

Relevant log entries:

2016-07-27 14:51:59 INFO juju.worker.uniter.actions resolver.go:62 "run-action" hook is nil

2016-07-27 14:51:59 DEBUG juju.worker.uniter agent.go:17 [AGENT-STATUS]
failed: resolver loop error

2016-07-27 14:51:59 INFO juju.worker.uniter uniter.go:355 unit "nova-compute-c06/0" shutting down: unknown operation kind run-action

An action was run, and got stuck in "running". Upon inspection, the log was full of "unknown operation kind run-action" errors, which started just after the action finished (last line of action code is "benchmark-raw" which is a helper that calls action-set).

This same juju, charm, and action worked fine on AWS.

uniter: http://pastebin.ubuntu.com/14495300/
juju status: http://pastebin.ubuntu.com/14495292/
juju action status: http://pastebin.ubuntu.com/14495294/

Revision history for this message
Cory Johns (johnsca) wrote :
Revision history for this message
Cheryl Jennings (cherylj) wrote :

Cory - what provider are you using where this fails?

Changed in juju-core:
status: New → Triaged
importance: Undecided → High
tags: added: actions
tags: added: 2.0-count
Revision history for this message
Casey Marshall (cmars) wrote :

I just got this with beta8 and lxd provider, after doing a juju run command. Funny thing is, I'd used the juju run command successfully many times on this service until this happened:

2016-06-08 23:37:10 INFO worker.uniter.jujuc tools.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-foo-0
2016-06-08 23:37:10 INFO worker.uniter.jujuc tools.go:40 was a symlink, now looking at /var/lib/juju/tools/2.0-beta9.1-trusty-amd64
2016-06-08 23:37:10 INFO juju.worker.uniter.relation relations.go:431 joining relation "foo:database mongodb:database"
2016-06-08 23:37:10 INFO juju.worker.uniter.relation relations.go:467 joined relation "foo:database mongodb:database"
2016-06-08 23:37:10 INFO juju.worker.uniter.relation relations.go:431 joining relation "nginx:reverseproxy foo:website"
2016-06-08 23:37:10 INFO juju.worker.uniter.relation relations.go:467 joined relation "nginx:reverseproxy foo:website"
2016-06-08 23:37:10 INFO juju.worker.uniter.relation relations.go:431 joining relation "prometheus:scrape foo:target"
2016-06-08 23:37:10 INFO juju.worker.uniter.relation relations.go:467 joined relation "prometheus:scrape foo:target"
2016-06-08 23:37:10 INFO juju.worker.uniter uniter.go:161 unit "foo/0" started
2016-06-08 23:37:10 INFO juju.worker.uniter uniter.go:200 hooks are retried true
2016-06-08 23:37:10 INFO juju.worker.uniter.actions resolver.go:62 "run-action" hook is nil
2016-06-08 23:37:10 INFO juju.worker.uniter uniter.go:352 unit "foo/0" shutting down: unknown operation kind run-action
2016-06-08 23:37:10 ERROR juju.worker.dependency engine.go:526 "uniter" manifold worker returned unexpected error: unknown operation kind run-action

I can't even destroy the service in this state. I've tried rebooting the container, `killall jujud`, nothing seems to get the unit agent out of this state.

Revision history for this message
Cory Johns (johnsca) wrote :

I somehow missed Cheryl's reply way back when. The provider I was using when I originally hit this was GCE. I haven't hit this since, but I also haven't been working with Juju 2.0 as often.

Revision history for this message
Matt Rae (mattrae) wrote :
Download full text (6.1 KiB)

I'm also seeing this happen for juju 2.0 beta 7 and MAAS 2.0 RC2

After using juju run --service, eventually agents for the service show up as 'failed' in juju status: https://gist.github.com/raema/dfed067cb9120c14682966996cedf1f7#file-gistfile1-txt-L37

The agent log is continually printing the following lines:

2016-07-27 14:51:58 DEBUG juju.worker.dependency engine.go:465 "uniter" manifold worker started
2016-07-27 14:51:58 INFO worker.uniter.jujuc tools.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-nova-compute-c06-0
2016-07-27 14:51:58 INFO worker.uniter.jujuc tools.go:40 was a symlink, now looking at /var/lib/juju/tools/2.0-beta7-xenial-amd64
2016-07-27 14:51:58 DEBUG worker.uniter.jujuc tools.go:44 jujud path /var/lib/juju/tools/2.0-beta7-xenial-amd64/jujud
2016-07-27 14:51:58 INFO juju.worker.uniter.relation relations.go:431 joining relation "ntp:juju-info nova-compute-c06:juju-info"
2016-07-27 14:51:58 INFO juju.worker.uniter.relation relations.go:467 joined relation "ntp:juju-info nova-compute-c06:juju-info"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:431 joining relation "nova-compute-c06:shared-db percona-cluster:shared-db"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:467 joined relation "nova-compute-c06:shared-db percona-cluster:shared-db"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:431 joining relation "nova-cloud-controller:cloud-compute nova-compute-c06:cloud-compute"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:467 joined relation "nova-cloud-controller:cloud-compute nova-compute-c06:cloud-compute"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:431 joining relation "nova-compute-c06:ceph ceph:client"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:467 joined relation "nova-compute-c06:ceph ceph:client"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:431 joining relation "ceilometer-agent:container nova-compute-c06:juju-info"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:467 joined relation "ceilometer-agent:container nova-compute-c06:juju-info"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:431 joining relation "nova-compute-c06:compute-peer"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:467 joined relation "nova-compute-c06:compute-peer"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:431 joining relation "nova-compute-c06:amqp rabbitmq:amqp"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:467 joined relation "nova-compute-c06:amqp rabbitmq:amqp"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:431 joining relation "nova-compute-c06:image-service glance:image-service"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:467 joined relation "nova-compute-c06:image-service glance:image-service"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:431 joining relation "nova-compute-c06:neutron-plugin neutron-calico-c06:neutron-plugin"
2016-07-27 14:51:59 INFO juju.worker.uniter.relation relations.go:467 joined relation "nova-compute-c06:neutro...

Read more...

description: updated
Changed in juju-core:
milestone: none → 2.0.0
tags: added: sts
Changed in juju-core:
milestone: 2.0.0 → 2.0-beta16
Changed in juju-core:
milestone: 2.0-beta16 → 2.0-beta17
Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Rick

Is there a workaround that we can suggest while we are addressing the issue.

Changed in juju-core:
assignee: nobody → Richard Harding (rharding)
Revision history for this message
William Reade (fwereade) wrote :

This renders unit agents permanently non-functional and should never have survived 8 months without resolution.

Changed in juju-core:
importance: High → Critical
Changed in juju-core:
milestone: 2.0-beta17 → 2.0-beta16
Revision history for this message
William Reade (fwereade) wrote :

Initial guess: it seems there's a path where we can load a uniter/operation.State file and send it down a code path that can't handle run-action.

Easy fix is to find that problematic switch and add handling to Do The Right Thing:

Hard fix is to properly isolate decision-making, so that the top-level resolver is responsible only for delegating *appropriate information* to other resolvers -- so that the other resolvers don't get passed global state, and only know what they need to.

(whatever resolver it is obviously never even considered the possibility that it might get invoked with action state, but the "let's make resolver an interface!" approach makes these bugs inevitable)

Changed in juju-core:
assignee: Richard Harding (rharding) → Michael Foord (mfoord)
Changed in juju-core:
status: Triaged → In Progress
affects: juju-core → juju
Changed in juju:
milestone: 2.0-beta16 → none
milestone: none → 2.0-beta16
Changed in juju-core:
status: New → Won't Fix
Revision history for this message
Michael Foord (mfoord) wrote :

Is there a reliable way to reproduce this issue?

Revision history for this message
Bert JW Regeer (bregeer-ctl) wrote :

@mfoord:

I am the customer that Matt Rae has been working with, and the logs above were from one of our failed JuJu agents.

The way that I was able to get the JuJu agents into this state was fairly simple:

juju run --service nova-compute 'sudo shutdown -r now'

Hit Ctrl+C because this will hang forever because there is no data returned at all... because the node is rebooting.

Wait for the node to fully come back up. juju status will now display "failed" and the service will not be able to run.

After that, juju show-action-status --name juju-run will show that the juju-run is still in running on the nova-compute nodes. Any attempts to juju run another command will add it to the show-action-status list as "pending".

There is an work-around that Doug posted to our open ticket in Salesforce for this case that was provided by Jorge to get the units unstuck and back to operational. This will fix the issue, please see below the dashes.

In lab we are running JuJu beta15, in production we are running JuJu beta7. No change in behavior between those versions.

----

Using Jorge's procedure with the updates below, the nova-compute juju agents were able to be freed up. So, the procedure is a remedy to the problem for which lp1534103 was filed.

On nova-compute unit:
Replace /var/lib/juju/agents/unit-nova-compute-*/state/uniter with <<EOF
leader: false
started: true
stopped: false
installed: true
status-set: true
op: continue
opstep: done
EOF

restart jujud-unit-nova-compute-*

On JuJu client, and presuming unit is in "failed" status

juju resolved -r <nova-compute unit>

It will take a few minutes to for the hooks to complete executing but should return to a normal state ("idle" or "executing")

-----

This will return the unit to sane status:

UNIT WORKLOAD AGENT MACHINE PUBLIC-ADDRESS PORTS MESSAGE
nova-compute/4 active idle 13 10.189.69.19 Unit is ready
  ceilometer-agent/3 active idle 10.189.69.19 Unit is ready
  ntp/13 unknown idle 10.189.69.19

Previously:

UNIT WORKLOAD AGENT MACHINE PUBLIC-ADDRESS PORTS MESSAGE
nova-compute/4 active failed 13 10.189.69.19 Unit is ready
  ceilometer-agent/3 active idle 10.189.69.19 Unit is ready
  ntp/13 unknown idle 10.189.69.19

Revision history for this message
Bert JW Regeer (bregeer-ctl) wrote :

Note that using the work-around will then also execute any remaining pending commands, and clear up the backlog in juju show-action-status --name juju-run.

Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0-beta16 → 2.0-beta17
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.0-beta17 → 2.0-beta18
Revision history for this message
Michael Foord (mfoord) wrote :

I can reproduce this with a simple charm with an action that never returns. Then running shutdown triggers this bug.

Changed in juju:
milestone: 2.0-beta18 → 2.0-rc1
Revision history for this message
Michael Foord (mfoord) wrote :
Michael Foord (mfoord)
Changed in juju:
status: In Progress → Fix Committed
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.