azure controller becomes unusable after a few days

Bug #1636634 reported by Kevin W Monroe
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Alexis Bruemmer

Bug Description

I first noticed a problem where juju commands took forever on azure models after a few days (bug 1628206). I suspected the controller size was too small, but later thought it was related to a lease manager problem.

Now I'm on juju2 GA with the lease manager problem fixed, but azure models are crawling again. As in, it takes > 30 minutes for 'juju status' to return. I was able to ssh directly to the controller and noticed the load was > 10, free memory was < 100MB, and there are 40K entries like this in /var/log/juju/*:

logsink.log:ea1a4b16-8baf-4622-8f23-657a7ad19da3 machine-0: 2016-10-25 03:16:01 ERROR juju.worker.dependency engine.go:539 "disk-manager" manifold worker returned unexpected error: cannot list block devices: lsblk failed: fork/exec /bin/lsblk: cannot allocate memory

Soooooo, i'm back to wondering if the default azure controller instance size is too small. AWS gives me a 3.5G instance; can we bump azure to something similar (it currently give me a 1.7G instance)?

Reproduce with: juju deploy spark-processing; check on it in 5 days.

Logs coming...

Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

controller logs after 7 days uptime (i had to reboot on 10/25 to get the machine usable enough to pull the logs off).

maybe useful info: i was testing a very noisy rtm charm during the first 3 days of uptime. maybe all the log entries from that charm pushed it over the edge?

At 2016-10-23 16:15:26, you'll start seeing the "cannot allocate memory" errors.

Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

I should mention this happened in azure/southcentralus. I can repro in other regions if those logs would be helpful.

Changed in juju:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 2.1.0
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

I bootstrapped azure/southcentralus with --constraints mem=3G and deployed a couple heavy hitting bundles (hadoop-processing and hadoop-kafka) in 2 different models.

I'm 5 days in, load average is < 0.1, and memory seems to be holding steady at 2G used, 1G free:

ubuntu@machine-0:~$ free -h
              total used free shared buff/cache available
Mem: 3.4G 2.0G 1.0G 35M 344M 1.0G
Swap: 0B 0B 0B

There may be a slow leak somewhere, so I'll continue to let this run for another week. However, in the short term, bumping the controller memory from 1.7 to 3.5G has been working well for me.

Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

I'm 10 more days in (15 total), and I've been regularly adding/removing decently large bundles (5-10 units, 10-15 apps). So far, so good. Load is still way down (< 0.1), though memory usage has crept up:

ubuntu@machine-0:~$ free -h
              total used free shared buff/cache available
Mem: 3.4G 2.6G 164M 35M 606M 426M
Swap: 0B 0B 0B

Resident mem for mongod (1.9G) and jujud (580MB) seems high to me and i'm pessimistic about what will happen when i hit the 3.5G ceiling:

ubuntu@machine-0:~$ ps aux --sort -rss | head -3
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 5635 3.0 53.2 5479312 1875076 ? Ssl Oct26 656:26 /usr/lib/juju/mongo3.2/bin/mongod --dbpath /var/lib/juju/db --sslOnNormalPorts --sslPEMKeyFile /var/lib/juju/server.pem --sslPEMKeyPassword=xxxxxxx --port 37017 --syslog --journal --replSet juju --quiet --oplogSize 1024 --ipv6 --auth --keyFile /var/lib/juju/shared-secret --storageEngine wiredTiger
root 5783 3.2 16.4 1700204 579812 ? Sl Oct26 708:08 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug

Let me know if you'd like me to run any debug on this. Not sure how much my l33t ps and free data helps, but I'll keep watching.

Changed in juju:
assignee: nobody → Alexis Bruemmer (alexis-bruemmer)
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

It's 18 days later (33 total), and the controller is starting to struggle:

Load is hovering just above 1:

ubuntu@machine-0:~$ w
 19:04:41 up 33 days, 3:49, 1 user, load average: 1.15, 1.15, 1.18
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
ubuntu pts/0 208.91.64.48 19:02 0.00s 0.07s 0.00s w

It's down to 161M of free ram:

ubuntu@machine-0:~$ free -h
              total used free shared buff/cache available
Mem: 3.4G 2.9G 182M 35M 317M 161M
Swap: 0B 0B 0B

Both mongod and jujud have crept up another ~100MB in resident mem usage:

ubuntu@machine-0:~$ ps aux --sort -rss | head -3
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 5635 2.2 56.5 7177656 1991216 ? Ssl Oct26 1067:49 /usr/lib/juju/mongo3.2/bin/mongod --dbpath /var/lib/juju/db --sslOnNormalPorts --sslPEMKeyFile /var/lib/juju/server.pem --sslPEMKeyPassword=xxxxxxx --port 37017 --syslog --journal --replSet juju --quiet --oplogSize 1024 --ipv6 --auth --keyFile /var/lib/juju/shared-secret --storageEngine wiredTiger
root 5783 2.4 18.4 1767276 651068 ? Sl Oct26 1146:23 /var/lib/juju/tools/machine-0/jujud machine --data-dir /var/lib/juju --machine-id 0 --debug

And commands like 'juju status' and 'juju models' are taking ~15s to return. So the larger controller size has helped, but even with 3.5G RAM, it's in bad shape after about a month.

Note this is with 2.0.0. I'm not sure if memory leak fixes made it into 2.1.0, but I'd be happy to upgrade and try to repro. Otherwise, I can leave this env around to help debug if needed.

Revision history for this message
Anastasia (anastasia-macmood) wrote :

@kwmonroe

Several fixes went into both next 2.0.x as well as should be available on 2.1-beta1.
Please upgrade and re-open the bug if the issues will continue :D

Changed in juju:
status: Triaged → Fix Committed
Curtis Hovey (sinzui)
Changed in juju:
milestone: 2.1.0 → 2.1-beta3
Curtis Hovey (sinzui)
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.