azure controller becomes unusable after a few days
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
Critical
|
Alexis Bruemmer |
Bug Description
I first noticed a problem where juju commands took forever on azure models after a few days (bug 1628206). I suspected the controller size was too small, but later thought it was related to a lease manager problem.
Now I'm on juju2 GA with the lease manager problem fixed, but azure models are crawling again. As in, it takes > 30 minutes for 'juju status' to return. I was able to ssh directly to the controller and noticed the load was > 10, free memory was < 100MB, and there are 40K entries like this in /var/log/juju/*:
logsink.
Soooooo, i'm back to wondering if the default azure controller instance size is too small. AWS gives me a 3.5G instance; can we bump azure to something similar (it currently give me a 1.7G instance)?
Reproduce with: juju deploy spark-processing; check on it in 5 days.
Logs coming...
Changed in juju: | |
status: | New → Triaged |
importance: | Undecided → Critical |
milestone: | none → 2.1.0 |
Changed in juju: | |
assignee: | nobody → Alexis Bruemmer (alexis-bruemmer) |
Changed in juju: | |
milestone: | 2.1.0 → 2.1-beta3 |
Changed in juju: | |
status: | Fix Committed → Fix Released |
controller logs after 7 days uptime (i had to reboot on 10/25 to get the machine usable enough to pull the logs off).
maybe useful info: i was testing a very noisy rtm charm during the first 3 days of uptime. maybe all the log entries from that charm pushed it over the edge?
At 2016-10-23 16:15:26, you'll start seeing the "cannot allocate memory" errors.