"juju run" timeouts on single units of an application

Bug #1733463 reported by Paul Collins
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
High
Unassigned

Bug Description

We've been observing a strange problem with Juju 2.2.6 (controller and model) where a single unit of an application does not seem to run its command until many minutes have elapsed. Here is an example with the timeout extended so that we get output from all commands, but the with the default 5m timeout the unit would not get to run the command in time for "juju run" to collect the results. The problem occurred with multiple applications in the model, not just this one, both primary and subordinate.

(mojo-ols-prod-scasnap)prodstack-ols@wekufe:/srv/prodstack/ols/specs/switch-deploys/scripts/autodeploy$ juju run --timeout 15m --debug --application sca-conn-check 'date -u'
01:29:31 INFO juju.cmd supercommand.go:63 running juju [2.2.6 gc go1.8]
01:29:31 DEBUG juju.cmd supercommand.go:64 args: []string{"juju", "run", "--timeout", "15m", "--debug", "--application", "sca-conn-check", "date -u"}
01:29:31 INFO juju.juju api.go:67 connecting to API addresses: [10.25.2.110:17070 10.25.2.109:17070 10.25.2.111:17070]
01:29:31 DEBUG juju.api apiclient.go:863 successfully dialed "wss://10.25.2.110:17070/model/4c3d68ed-15d3-4002-825a-f2361ebe9048/api"
01:29:31 INFO juju.api apiclient.go:617 connection established to "wss://10.25.2.110:17070/model/4c3d68ed-15d3-4002-825a-f2361ebe9048/api"
- Stdout: |
    Tue Nov 21 01:29:39 UTC 2017
  UnitId: sca-conn-check/20
- Stdout: |
    Tue Nov 21 01:29:40 UTC 2017
  UnitId: sca-conn-check/21
- Stdout: |
    Tue Nov 21 01:29:40 UTC 2017
  UnitId: sca-conn-check/23
- Stdout: |
    Tue Nov 21 01:29:40 UTC 2017
  UnitId: sca-conn-check/24
- Stdout: |
    Tue Nov 21 01:29:40 UTC 2017
  UnitId: sca-conn-check/26
- Stdout: |
    Tue Nov 21 01:29:40 UTC 2017
  UnitId: sca-conn-check/27
- Stdout: |
    Tue Nov 21 01:29:50 UTC 2017
  UnitId: sca-conn-check/25
- Stdout: |
    Tue Nov 21 01:39:49 UTC 2017
  UnitId: sca-conn-check/22

01:39:50 DEBUG juju.api monitor.go:35 RPC connection died
01:39:50 INFO cmd supercommand.go:465 command finished

And here is the corresponding action for the command that took 10 minutes to begin execution:

(mojo-ols-prod-scasnap)prodstack-ols@wekufe:~$ juju show-action-output 2739aa35-bf62-44a7-86d2-1178e719f777
results:
  Code: "0"
  Stderr: ""
  Stdout: |
    Tue Nov 21 01:39:49 UTC 2017
status: completed
timing:
  completed: 2017-11-21 01:39:49 +0000 UTC
  enqueued: 2017-11-21 01:29:33 +0000 UTC
  started: 2017-11-21 01:39:46 +0000 UTC
(mojo-ols-prod-scasnap)prodstack-ols@wekufe:~$

Paul Collins (pjdc)
description: updated
description: updated
Paul Collins (pjdc)
description: updated
Revision history for this message
Adam Collard (adam-collard) wrote :

AIUI juju run with a unit will take the lock that's used for hook execution. Was there a hook that was running on the given unit during the period that it was waiting?

Revision history for this message
Paul Collins (pjdc) wrote :

I don't believe there were hooks running. here are some statushistory records I pulled for the affected unit and its subordinates in the example above, covering a roughly four-hour window and a bunch of "juju run" attempts, but IIRC it should be the most recent instance: https://pastebin.canonical.com/203712/ (Canonical-only link, sorry!)

Witold Krecicki (wpk)
Changed in juju:
importance: Undecided → High
status: New → Triaged
no longer affects: juju/2.2
no longer affects: juju/2.3
Revision history for this message
Tim Penhey (thumper) wrote :

This needs some investigation to figure out what is happening, and whether or not the event is hitting the unit in a reasonable time frame.

@Paul, what is the frequency of "juju run" timing out? Is it every time? Is it just some apps?

Changed in juju:
assignee: nobody → Eric Claude Jones (ecjones)
Revision history for this message
Eric Claude Jones (ecjones) wrote :

I cannot reproduce the issue. I've deployed several charms and setups and I haven't been able to see this issue with a 2.2.6 controller. We may have to look at @Paul's setup to determine the cause of the problem. I will mark this bug as incomplete for now.

Changed in juju:
status: Triaged → Incomplete
Changed in juju:
assignee: Eric Claude Jones (ecjones) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
Revision history for this message
Tim Penhey (thumper) wrote :

@Paul, if you observe this again, or in more recent models, please let us know.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.