[2.0.1] Juju status hangs for all models and builds are stuck

Bug #1639577 reported by Larry Michel
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Michael Foord
2.1
Fix Released
High
Michael Foord

Bug Description

We have a number of deployment builds running for separate models and a single controller. Sometimes after 9:02 UTC, juju status started to hang for all the models.

This seems recreatable since it's 2nd time and this was the juju status process hanging for the 1st occurrence: https://pastebin.canonical.com/169696/

On the controller, I see:

ubuntu@erchitu:~$ free mem
              total used free shared buff/cache available
Mem: 32897860 25051228 3324256 50004 4522376 7232276
Swap: 8388604 171616 8216988
ubuntu@erchitu:~$ free -m
              total used free shared buff/cache available
Mem: 32126 24464 3243 48 4418 7062
Swap: 8191 167 8024

and top sometimes reports some very high numbers for jujud:

top - 14:01:37 up 2 days, 16:10, 1 user, load average: 1.89, 1.82, 1.83
Tasks: 178 total, 2 running, 176 sleeping, 0 stopped, 0 zombie
%Cpu(s): 94.9 us, 4.1 sy, 0.0 ni, 1.0 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 32897860 total, 3283916 free, 25074604 used, 4539340 buff/cache
KiB Swap: 8388604 total, 8216988 free, 171616 used. 7209136 avail Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26635 root 20 0 14.134g 9.535g 55200 R 763.8 30.4 2429:08 jujud
26493 root 20 0 22.772g 0.013t 27008 S 28.6 43.9 2029:31 mongod
11454 ubuntu 20 0 40524 3680 3052 R 0.7 0.0 0:00.09 top
   34 root rt 0 0 0 0 S 0.3 0.0 0:26.52 migration/5
    1 root 20 0 185400 6120 4084 S 0.0 0.0 0:17.95 systemd
    2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd
    3 root 20 0 0 0 0 S 0.0 0.0 0:12.08 ksoftirqd/0
    5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
    8 root 20 0 0 0 0 S 0.0 0.0 2:10.63 rcu_sched
    9 root 20 0 0 0 0 S 0.0 0.0 0:00.80 rcu_bh
   10 root rt 0 0 0 0 S 0.0 0.0 0:19.74 migration/0
   11 root rt 0 0 0 0 S 0.0 0.0 0:00.93 watchdog/0
   12 root rt 0 0 0 0 S 0.0 0.0 0:00.91 watchdog/1
   13 root rt 0 0 0 0 S 0.0 0.0 0:24.44 migration/1
   14 root 20 0 0 0 0 S 0.0 0.0 0:35.85 ksoftirqd/1
   16 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H
   18 root rt 0 0 0 0 S 0.0 0.0 0:00.88 watchdog/2
   19 root rt 0 0 0 0 S 0.0 0.0 0:22.76 migration/2
   20 root 20 0 0 0 0 S 0.0 0.0 0:46.84 ksoftirqd/2
   22 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0H
   23 root rt 0 0 0 0 S 0.0 0.0 0:00.91 watchdog/3
   24 root rt 0 0 0 0 S 0.0 0.0 0:28.38 migration/3
   25 root 20 0 0 0 0 S 0.0 0.0 0:32.87 ksoftirqd/3
   27 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/3:0H
   28 root rt 0 0 0 0 S 0.0 0.0 0:00.87 watchdog/4
   29 root rt 0 0 0 0 S 0.0 0.0 0:21.02 migration/4
   30 root 20 0 0 0 0 S 0.0 0.0 0:50.88 ksoftirqd/4
   32 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/4:0H
   33 root rt 0 0 0 0 S 0.0 0.0 0:00.91 watchdog/5
   35 root 20 0 0 0 0 S 0.0 0.0 0:31.28 ksoftirqd/5
   37 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/5:0H
   38 root rt 0 0 0 0 S 0.0 0.0 0:00.87 watchdog/6
   39 root rt 0 0 0 0 S 0.0 0.0 0:21.69 migration/6

But on averag:

top - 14:02:25 up 2 days, 16:11, 1 user, load average: 2.23, 1.93, 1.87
Tasks: 179 total, 1 running, 178 sleeping, 0 stopped, 0 zombie
%Cpu(s): 13.8 us, 1.3 sy, 0.0 ni, 84.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32897860 total, 3283832 free, 25072700 used, 4541328 buff/cache
KiB Swap: 8388604 total, 8216988 free, 171616 used. 7211120 avail Mem

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
26493 root 20 0 22.772g 0.013t 27008 S 100.7 43.9 2030:19 mongod
26635 root 20 0 14.134g 9.535g 55200 S 23.6 30.4 2429:46 jujud
11454 ubuntu 20 0 40524 3680 3052 R 0.7 0.0 0:00.24 top
31172 root 20 0 0 0 0 S 0.3 0.0 0:00.79 kworker/u64:2
    1 root 20 0 185400 6120 4084 S 0.0 0.0 0:17.95 systemd
    2 root 20 0 0 0 0 S 0.0 0.0 0:00.04 kthreadd
    3 root 20 0 0 0 0 S 0.0 0.0 0:12.08 ksoftirqd/0
    5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0H
    8 root 20 0 0 0 0 S 0.0 0.0 2:10.66 rcu_sched
    9 root 20 0 0 0 0 S 0.0 0.0 0:00.80 rcu_bh
   10 root rt 0 0 0 0 S 0.0 0.0 0:19.74 migration/0
   11 root rt 0 0 0 0 S 0.0 0.0 0:00.93 watchdog/0
   12 root rt 0 0 0 0 S 0.0 0.0 0:00.91 watchdog/1
   13 root rt 0 0 0 0 S 0.0 0.0 0:24.46 migration/1
   14 root 20 0 0 0 0 S 0.0 0.0 0:35.85 ksoftirqd/1
   16 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:0H
   18 root rt 0 0 0 0 S 0.0 0.0 0:00.88 watchdog/2
   19 root rt 0 0 0 0 S 0.0 0.0 0:22.77 migration/2
   20 root 20 0 0 0 0 S 0.0 0.0 0:46.84 ksoftirqd/2
   22 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/2:0H
   23 root rt 0 0 0 0 S 0.0 0.0 0:00.91 watchdog/3
   24 root rt 0 0 0 0 S 0.0 0.0 0:28.39 migration/3
   25 root 20 0 0 0 0 S 0.0 0.0 0:32.87 ksoftirqd/3
   27 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/3:0H
   28 root rt 0 0 0 0 S 0.0 0.0 0:00.87 watchdog/4
   29 root rt 0 0 0 0 S 0.0 0.0 0:21.02 migration/4
   30 root 20 0 0 0 0 S 0.0 0.0 0:50.89 ksoftirqd/4
   32 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/4:0H
   33 root rt 0 0 0 0 S 0.0 0.0 0:00.91 watchdog/5
   34 root rt 0 0 0 0 S 0.0 0.0 0:26.53 migration/5
   35 root 20 0 0 0 0 S 0.0 0.0 0:31.28 ksoftirqd/5
   37 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/5:0H
   38 root rt 0 0 0 0 S 0.0 0.0 0:00.87 watchdog/6

I've collected logs now as it's happening and I also have the last logs collected which would have been around the time where last successful juju status happened.

They can be accessed at https://drive.google.com/a/canonical.com/file/d/0BxsyAMSSDhWcMlNHZmtlSTRYTG8/view?usp=sharing

Changed in juju:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 2.0.3
assignee: nobody → Richard Harding (rharding)
Revision history for this message
Michael Foord (mfoord) wrote :

Attached to issue #1632362 is a custom build of juju with changed locking semantics. It uses juju/mutex instead of file based locking, so stale lock files are not possible. It also has a longer timeout and no exponential back off so should be less liable to lock contention.

    https://bugs.launchpad.net/juju-wait/+bug/1632362

Please try that version of juju if possible and see if it appears to fix the issue. If there's no feedback by Monday (I'm off Friday) I'll snap it up to make it easier to try. This version is based off the 2.1 branch. Porting the fix to 2.0.x is trivial as it's just changing the dependency to a newer version of juju/persistent-cookiejar where the change is made.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

As the fix for the bug # 1632362 is committed to 2.0.x and 2.1.x, this issue should also be resolved.

Changed in juju:
status: Triaged → Fix Committed
assignee: Richard Harding (rharding) → Michael Foord (mfoord)
Revision history for this message
Larry Michel (lmic) wrote :

This one is Fix committed but we can't test with snap and won't be able to test until bug 1642609 is resolved. In 2.0.1, we're hitting what may or may not be a recreate with all the models stuck in destroy-models (December 5th)

+ juju destroy-model --yes maas:ci-oil-slave13
<hangs>
+ juju destroy-model --yes maas:ci-oil-slave8
<hangs>
+ juju destroy-model --yes maas:ci-oil-slave3
<hangs>
+ juju destroy-model --yes maas:ci-oil-slave9
<hangs>
+ juju destroy-model --yes maas:ci-oil-slave5
<hangs>
+ juju destroy-model --yes maas:ci-oil-slave14
<hangs>

Saved off the logs at before rebooting the controller: https://drive.google.com/open?id=0BxsyAMSSDhWcLV85Q3BhaEJhdWs

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@Larry,
2.1-beta2 - in devel ppa, released last week - has a fix too \o/ Is there any chance you could test it?

Revision history for this message
Larry Michel (lmic) wrote :

@anastasia, per conversation with @mfoord, there's a good chance we could hit bug 1642609 with 2.1 beta2 so waiting for a resolution on that one before we're able to test it.

tags: added: cdo-qa-blocker
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.