environment unstable after 1.25.8 upgrade

Bug #1645729 reported by Jacek Nykis
38
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Andrew Wilkins
2.1
Fix Released
High
Andrew Wilkins
juju-core
Fix Released
Critical
Andrew Wilkins
1.25
Fix Released
Critical
Andrew Wilkins

Bug Description

We recently upgraded a few environments from juju 1.25.6 to 1.25.8 and we started experiencing problems on some of them.

What we know so far:
* the problems only affects bigger environments, 8-10 machines and bigger. Smaller environments look stable
* on the problematic environments jujud uses lots of memory on node 0, for example nearly 1GB RES on bootstrap node with 2GB RAM
* we see "lost" agents ocassionally. It's intermittent, sometimes environments are fine for hours
* occasionally hooks end up in error state, we see error like this in the logs:
2016-11-29 09:30:29 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2016-11-29 09:30:29 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2016-11-29 09:30:29 ERROR juju.worker.uniter.filter filter.go:137 watcher iteration error: Closed explicitly
2016-11-29 09:30:29 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down
2016-11-29 09:30:29 ERROR juju.api.watcher watcher.go:84 error trying to stop watcher: connection is shut down

juju version is 1.25.8, running on amd64 trusty guests.

I uploaded logs from the bootstrap node here:
https://private-fileshare.canonical.com/~jacek/lp1645729.tgz

Tags: canonical-is
Jacek Nykis (jacekn)
description: updated
Changed in juju-core:
importance: Undecided → Critical
Jacek Nykis (jacekn)
description: updated
Revision history for this message
Richard Harding (rharding) wrote :

Looking through the logs. This looks like a memory issue but will have one of the Juju Core folks dive into this asap. If there's any way to update the controller to one with a larger memory footprint and see if things settle better that'd be appreciated while we investigate.

Changed in juju-core:
status: New → Triaged
milestone: none → 1.25.9
Revision history for this message
Tim Penhey (thumper) wrote :

How big is this model? Can we get a count of machines/applications/units?

Revision history for this message
Tim Penhey (thumper) wrote :

I have audited all the code changes between 1.25.6 and 1.25.8 and the only thing that seems even remotely likely to cause any impact is the change in the TLS ciphers used in to secure the API connections.

It is possible that the problematic code was in existence in 1.25.6 but we weren't triggering it.

@jacekn were there any changes in any of the charms around the time of the 1.25.8 upgrade?

Revision history for this message
Paul Gear (paulgear) wrote :

@thumper: 42 machines, 42 services (coincidental - it's not a 1:1 mapping), 165 services.

Full status available at https://private-fileshare.canonical.com/~paulgear/lp:1645729/status.txt.gz

We generally do not upgrade charms at the same time as juju, but the logs should be the definitive indicator of this. I'll have a quick run through them to see what I can find.

Revision history for this message
Paul Gear (paulgear) wrote :

Extract from machine-0.log showing dates of juju upgrades:

2016-08-17 22:51:15 INFO juju.cmd.jujud upgrade.go:343 starting upgrade from 1.24.7 to 1.25.6 for "machine-0"
2016-11-23 13:29:24 INFO juju.cmd.jujud upgrade.go:343 starting upgrade from 1.25.6 to 1.25.8 for "machine-0"

The only charm upgrades within 1 week of the 1.25.8 upgrade were:

2016-11-21 jc-jujugui
2016-11-21 jc-blues-browser
2016-11-23 jc-jujugui
2016-11-23 jc-blues-browser
2016-11-28 jc-omnibus2
2016-11-29 jc-blues-identity

I've placed logs for the above services in https://private-fileshare.canonical.com/~paulgear/lp:1645729/ also.

Revision history for this message
Paul Gear (paulgear) wrote :

Correction to figures above: 42 machines, 42 services, 165 *UNITS*

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

When I took down the unit for maintenance once, the mongodb logs showed nearly 800 open connections. Not sure if it's related, but that does sound high even for an environment with 42 machines.

Revision history for this message
Paul Gear (paulgear) wrote :

Profiling data for the 42-machine environment available at https://private-fileshare.canonical.com/~paulgear/lp:1645729/ now.

Revision history for this message
Haw Loeung (hloeung) wrote :
Download full text (5.8 KiB)

Here's what I've found:

| $ while date; do ps afxuw | grep -E "[[:space:]]$(status jujud-machine-0 | awk '{ print $4 }')|RS\\S"; sudo pmap -XX $(status jujud-machine-0 | awk '{ print $4 }') | grep -E 'Rss|[[:space:]][1-9][[:digit:]]{5}'; sleep 600; done
| Wed Nov 30 23:51:47 UTC 2016
| USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
| root 13713 29.0 25.7 1156072 528852 ? Ssl 22:03 31:35 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug
| Address Perm Offset Device Inode Size Rss Pss Shared_Clean Shared_Dirty Private_Clean Private_Dirty Referenced Anonymous AnonHugePages Swap KernelPageSize MMUPageSize Locked VmFlagsMapping
| 034a6000 rw-p 030a6000 fd:01 1071550 208 112 95 20 0 4 88 112 88 0 0 4 4 0 rd wr mr mw me dw ac sd jujud
| c81f158000 rw-p 00000000 00:00 0 242336 241052 241052 0 0 0 241052 239684 241052 6144 0 4 4 0 rd wr mr mw me ac sd [stack:13716]
| c82ea00000 rw-p 00000000 00:00 0 141312 141256 141256 0 0 0 141256 140672 141256 47104 0 4 4 0 rd wr mr mw me ac sd
| 1156076 528908 523262 7008 0 10608 511292 526164 511292 59392 0 392 392 0 KB
| ...
| Thu Dec 1 01:31:48 UTC 2016
| USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
| root 13713 29.9 44.9 1593580 922328 ? Ssl Nov30 62:23 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug
| Address Perm Offset Device Inode Size Rss Pss Shared_Clean Shared_Dirty Private_Clean Private_Dirty Referenced Anonymous AnonHugePages Swap KernelPageSize MMUPageSize Locked VmFlagsMapping
| 034a6000 rw-p 030a6000 fd:01 1071550 208 108 95 16 0 4 88 108 88 0 0 4 4 0 rd wr mr mw me dw ac sd jujud
| c81e49a000 rw-p 00000000 00:00 0 255384 254112 254112 0 0 0 254112 251616 254112 0 0 4 4 0 rd wr mr mw me ac sd [stack:13716]
| c838000000 rw-p 00000000 00:00 0 223232 223232 223232 0 0 0 223232 218328 223232 0 0 4 4 0 rd wr mr mw me ac sd
| c846a10000 rw-p 00000000 00:00 0 264832 227964 227964 0 0 0 227964 227264 227964 0 0 4 4 0 rd wr mr mw me ac sd
| 1593584 922828 917828 6172 0 4340 912316 907352 912316 ...

Read more...

Revision history for this message
Haw Loeung (hloeung) wrote :
Download full text (3.6 KiB)

That'll be when it OOM's and something happens freeing up memory again.

| $ free -m
| -bash: fork: Cannot allocate memory
| $ free
| -bash: fork: Cannot allocate memory
| $ dmesg
| -bash: fork: Cannot allocate memory
| $ dmesg
| -bash: fork: Cannot allocate memory
| $ ps afx
| -bash: fork: Cannot allocate memory
| $ uptime
| -bash: fork: Cannot allocate memory

| Thu Dec 1 01:52:45 UTC 2016
| Fetching profile from http://localhost:10000/debug/pprof/heap
| http fetch http://localhost:10000/debug/pprof/heap: Get http://localhost:10000/debug/pprof/heap: EOF

| Thu Dec 1 01:56:42 UTC 2016
| USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
| root 13713 29.7 14.9 2186152 306676 ? Ssl Nov30 69:27 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug
| Address Perm Offset Device Inode Size Rss Pss Shared_Clean Shared_Dirty Private_Clean Private_Dirty Referenced Anonymous AnonHugePages Swap KernelPageSize MMUPageSize Locked VmFlagsMapping
| 00400000 r-xp 00000000 fd:01 1071550 20664 5316 3528 2240 0 3076 0 5316 0 0 0 4 4 0 rd ex mr mw me dw sd jujud
| 0182e000 r--p 0142e000 fd:01 1071550 29152 6720 4203 3188 0 3532 0 6648 0 0 0 4 4 0 rd mr mw me dw sd jujud
| 034a6000 rw-p 030a6000 fd:01 1071550 208 124 98 32 0 4 88 124 88 0 0 4 4 0 rd wr mr mw me dw ac sd jujud
| c820200000 rw-p 00000000 00:00 0 491520 136536 136536 0 0 0 136536 136504 136536 0 0 4 4 0 rd wr mr mw me ac sd nh [stack:13721]
| c846c00000 rw-p 00000000 00:00 0 206848 49488 49488 0 0 0 49488 49488 49488 0 0 4 4 0 rd wr mr mw me ac sd nh [stack:13719]
| 2186156 306736 302146 5732 0 6612 294392 306476 294392 2048 0 576 576 0 KB

| Thu Dec 1 01:56:27 UTC 2016
| Fetching profile from http://localhost:10000/debug/pprof/heap
| Saved profile in /home/hloeung/pprof/pprof.localhost:10000.inuse_objects.inuse_space.021.pb.gz
| Entering interactive mode (type "help" for commands)
| (pprof) 53.93MB of 66.78MB total (80.76%)
| Dropped 754 nodes (cum <= 0.33MB)
| Showing top 10 nodes out of 230 (cum >= 8MB)
| flat flat% sum% cum cum%
| 16.02MB 23.99% 23.99% 16.52MB 24.74% reflect.Value.call
| 9.50MB 14.23% 38.22% 9.50MB 14.23% launchpad.net/tomb.(*Tomb).init
| 8.82MB 13.20% 51.42% 12.83MB 19.22% github.com/juju/juju/apiserver.(*apiRoot).FindMethod.func1
| 7.50MB 11.23% 62.65% 7.50MB 11.23% runtime.malg
| 5.06MB 7.58% 70.23% 5.06MB 7.58% cry...

Read more...

Revision history for this message
Andrew Wilkins (axwalk) wrote :

Tim and I have been poring over data, and we have found the probable cause of memory growth. There is a cache in the apiserver that accumulates objects, and never frees them for the lifetime of an API connection. Each time an agent asks for a watcher, a new object will be allocated and added to the cache. If the unit agent is erroring and (internally) restarting workers, it will continuously ask for new watchers, which will lead to that cache growing.

Revision history for this message
Andrew Wilkins (axwalk) wrote :
Andrew Wilkins (axwalk)
Changed in juju-core:
status: Triaged → In Progress
assignee: nobody → Andrew Wilkins (axwalk)
milestone: 1.25.9 → none
importance: Critical → High
Andrew Wilkins (axwalk)
Changed in juju-core:
status: In Progress → Fix Committed
Changed in juju-core:
milestone: none → 1.25.9
Changed in juju:
status: New → Fix Committed
importance: Undecided → High
milestone: none → 2.0.3
assignee: nobody → Andrew Wilkins (axwalk)
no longer affects: juju-core/2.0
Changed in juju-core:
importance: High → Critical
Curtis Hovey (sinzui)
Changed in juju-core:
milestone: 1.25.9 → none
Revision history for this message
Andrew Wilkins (axwalk) wrote :

The change has been tested in an affected environment, and shown to stop the memory from growing. As described in Comment #11, the issue is that we were accumulating memory each time an agent acquired a watcher, and never released that memory until the API connection was severed. We now release the memory when the watchers are closed.

There may be additional issues causing agents to repeatedly acquire watchers; there is no change for this as of yet. We have not observed any other problems in the affected environment since the fix was tested.

Revision history for this message
Nicholas Skaggs (nskaggs) wrote :

These changes are in the proposed ppa / stream. Confirmation that this doesn't occur in the proposed client is appreciated. See https://launchpad.net/~juju/+archive/ubuntu/1.25-proposed, and the ppa description for details on setting your environment to use the proposed agents to go with the binary.

Revision history for this message
Paul Gear (paulgear) wrote :

I've upgraded two Canonistack environments successfully to the proposed version and will see how they go in terms of resource usage over this week.

Revision history for this message
Paul Gear (paulgear) wrote :

I've just checked my Canonistack LCY01 environment, and the memory usage seems to climb more slowly, but I still had lost agents this morning (although that's really an issue for lp:1587644). Here are graphs of the last week of memory & network traffic from the controller for that environment.

Revision history for this message
Paul Gear (paulgear) wrote :
Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

We have upgraded the environment to 1.25.9, and will check back tomorrow to see if the error has disappeared.

Changed in juju-core:
status: Fix Committed → Fix Released
Revision history for this message
Paul Gear (paulgear) wrote :

For the record, I've still been getting occasional instability in both of my Canonistack environments which have been upgraded to 1.25.9, but a equivalent 2.0.2 environment + controller has been fine since Dec 27 05:51:53 2016.

The production environment with 42 units mentioned earlier has been affected by bug 1484105 in the meantime, but it seems that despite constant network traffic, it has been reclaiming memory appropriately. (I'll attach graph screenshots shortly to show this.)

Revision history for this message
Paul Gear (paulgear) wrote :
Revision history for this message
Paul Gear (paulgear) wrote :
Paul Gear (paulgear)
tags: added: canonical-is
Revision history for this message
Paul Gear (paulgear) wrote :

This afternoon we had what appears to be a recurrence of this bug in one of our Azure 2.0.2 environments. Gathered profiling data per https://wiki.canonical.com/InformationInfrastructure/WebOps/Juju#Gathering_debug_info_with_pprof and placed in https://private-fileshare.canonical.com/~paulgear/lp:1645729/2017-01-10/

Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.