[2.0.1] Juju status hangs for all models and builds are stuck
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
Critical
|
Michael Foord | ||
2.1 |
Fix Released
|
High
|
Michael Foord |
Bug Description
We have a number of deployment builds running for separate models and a single controller. Sometimes after 9:02 UTC, juju status started to hang for all the models.
This seems recreatable since it's 2nd time and this was the juju status process hanging for the 1st occurrence: https:/
On the controller, I see:
ubuntu@erchitu:~$ free mem
total used free shared buff/cache available
Mem: 32897860 25051228 3324256 50004 4522376 7232276
Swap: 8388604 171616 8216988
ubuntu@erchitu:~$ free -m
total used free shared buff/cache available
Mem: 32126 24464 3243 48 4418 7062
Swap: 8191 167 8024
and top sometimes reports some very high numbers for jujud:
top - 14:01:37 up 2 days, 16:10, 1 user, load average: 1.89, 1.82, 1.83
Tasks: 178 total, 2 running, 176 sleeping, 0 stopped, 0 zombie
%Cpu(s): 94.9 us, 4.1 sy, 0.0 ni, 1.0 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 32897860 total, 3283916 free, 25074604 used, 4539340 buff/cache
KiB Swap: 8388604 total, 8216988 free, 171616 used. 7209136 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26635 root 20 0 14.134g 9.535g 55200 R 763.8 30.4 2429:08 jujud
26493 root 20 0 22.772g 0.013t 27008 S 28.6 43.9 2029:31 mongod
11454 ubuntu 20 0 40524 3680 3052 R 0.7 0.0 0:00.09 top
34 root rt 0 0 0 0 S 0.3 0.0 0:26.52 migration/5
1 root 20 0 185400 6120 4084 S 0.0 0.0 0:17.95 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:12.08 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
8 root 20 0 0 0 0 S 0.0 0.0 2:10.63 rcu_sched
9 root 20 0 0 0 0 S 0.0 0.0 0:00.80 rcu_bh
10 root rt 0 0 0 0 S 0.0 0.0 0:19.74 migration/0
11 root rt 0 0 0 0 S 0.0 0.0 0:00.93 watchdog/0
12 root rt 0 0 0 0 S 0.0 0.0 0:00.91 watchdog/1
13 root rt 0 0 0 0 S 0.0 0.0 0:24.44 migration/1
14 root 20 0 0 0 0 S 0.0 0.0 0:35.85 ksoftirqd/1
16 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H
18 root rt 0 0 0 0 S 0.0 0.0 0:00.88 watchdog/2
19 root rt 0 0 0 0 S 0.0 0.0 0:22.76 migration/2
20 root 20 0 0 0 0 S 0.0 0.0 0:46.84 ksoftirqd/2
22 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0H
23 root rt 0 0 0 0 S 0.0 0.0 0:00.91 watchdog/3
24 root rt 0 0 0 0 S 0.0 0.0 0:28.38 migration/3
25 root 20 0 0 0 0 S 0.0 0.0 0:32.87 ksoftirqd/3
27 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/3:0H
28 root rt 0 0 0 0 S 0.0 0.0 0:00.87 watchdog/4
29 root rt 0 0 0 0 S 0.0 0.0 0:21.02 migration/4
30 root 20 0 0 0 0 S 0.0 0.0 0:50.88 ksoftirqd/4
32 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/4:0H
33 root rt 0 0 0 0 S 0.0 0.0 0:00.91 watchdog/5
35 root 20 0 0 0 0 S 0.0 0.0 0:31.28 ksoftirqd/5
37 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/5:0H
38 root rt 0 0 0 0 S 0.0 0.0 0:00.87 watchdog/6
39 root rt 0 0 0 0 S 0.0 0.0 0:21.69 migration/6
But on averag:
top - 14:02:25 up 2 days, 16:11, 1 user, load average: 2.23, 1.93, 1.87
Tasks: 179 total, 1 running, 178 sleeping, 0 stopped, 0 zombie
%Cpu(s): 13.8 us, 1.3 sy, 0.0 ni, 84.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32897860 total, 3283832 free, 25072700 used, 4541328 buff/cache
KiB Swap: 8388604 total, 8216988 free, 171616 used. 7211120 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26493 root 20 0 22.772g 0.013t 27008 S 100.7 43.9 2030:19 mongod
26635 root 20 0 14.134g 9.535g 55200 S 23.6 30.4 2429:46 jujud
11454 ubuntu 20 0 40524 3680 3052 R 0.7 0.0 0:00.24 top
31172 root 20 0 0 0 0 S 0.3 0.0 0:00.79 kworker/u64:2
1 root 20 0 185400 6120 4084 S 0.0 0.0 0:17.95 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:12.08 ksoftirqd/0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
8 root 20 0 0 0 0 S 0.0 0.0 2:10.66 rcu_sched
9 root 20 0 0 0 0 S 0.0 0.0 0:00.80 rcu_bh
10 root rt 0 0 0 0 S 0.0 0.0 0:19.74 migration/0
11 root rt 0 0 0 0 S 0.0 0.0 0:00.93 watchdog/0
12 root rt 0 0 0 0 S 0.0 0.0 0:00.91 watchdog/1
13 root rt 0 0 0 0 S 0.0 0.0 0:24.46 migration/1
14 root 20 0 0 0 0 S 0.0 0.0 0:35.85 ksoftirqd/1
16 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H
18 root rt 0 0 0 0 S 0.0 0.0 0:00.88 watchdog/2
19 root rt 0 0 0 0 S 0.0 0.0 0:22.77 migration/2
20 root 20 0 0 0 0 S 0.0 0.0 0:46.84 ksoftirqd/2
22 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0H
23 root rt 0 0 0 0 S 0.0 0.0 0:00.91 watchdog/3
24 root rt 0 0 0 0 S 0.0 0.0 0:28.39 migration/3
25 root 20 0 0 0 0 S 0.0 0.0 0:32.87 ksoftirqd/3
27 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/3:0H
28 root rt 0 0 0 0 S 0.0 0.0 0:00.87 watchdog/4
29 root rt 0 0 0 0 S 0.0 0.0 0:21.02 migration/4
30 root 20 0 0 0 0 S 0.0 0.0 0:50.89 ksoftirqd/4
32 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/4:0H
33 root rt 0 0 0 0 S 0.0 0.0 0:00.91 watchdog/5
34 root rt 0 0 0 0 S 0.0 0.0 0:26.53 migration/5
35 root 20 0 0 0 0 S 0.0 0.0 0:31.28 ksoftirqd/5
37 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/5:0H
38 root rt 0 0 0 0 S 0.0 0.0 0:00.87 watchdog/6
I've collected logs now as it's happening and I also have the last logs collected which would have been around the time where last successful juju status happened.
They can be accessed at https:/
Changed in juju: | |
status: | New → Triaged |
importance: | Undecided → Critical |
milestone: | none → 2.0.3 |
assignee: | nobody → Richard Harding (rharding) |
tags: | added: cdo-qa-blocker |
Changed in juju: | |
status: | Fix Committed → Fix Released |
Attached to issue #1632362 is a custom build of juju with changed locking semantics. It uses juju/mutex instead of file based locking, so stale lock files are not possible. It also has a longer timeout and no exponential back off so should be less liable to lock contention.
https:/ /bugs.launchpad .net/juju- wait/+bug/ 1632362
Please try that version of juju if possible and see if it appears to fix the issue. If there's no feedback by Monday (I'm off Friday) I'll snap it up to make it easier to try. This version is based off the 2.1 branch. Porting the fix to 2.0.x is trivial as it's just changing the dependency to a newer version of juju/persistent -cookiejar where the change is made.