[2.4.3] API server unresponsive in HA env
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Canonical Juju |
Fix Released
|
High
|
Christian Muirhead | ||
2.4 |
Fix Released
|
High
|
Christian Muirhead | ||
2.5 |
Fix Released
|
High
|
Christian Muirhead |
Bug Description
Today when checking connectivity for adding a new machine, I discovered one of our main production controller machines was refusing connections, and it and another were reporting as down in juju status. @thumper assisted and we found that:
- API servers on both machines were not listening on :17070
- juju_engine_report was unresponsive; strace showed it waiting on a mutex.
- juju_presence_
- there were panics in the machine log of the remaining functional controller: https:/
I performed a rolling restart; during initial reconnect of agents, a large number of 'too many open files' messages were seen on at least 2 of the controller machines. This quieted down after all the remote agents reconnected and the load settled.
Changed in juju: | |
assignee: | nobody → Christian Muirhead (2-xtian) |
Changed in juju: | |
milestone: | 2.5-beta1 → 2.5-beta2 |
Changed in juju: | |
status: | Triaged → Fix Committed |
milestone: | 2.5-beta2 → 2.5-beta1 |
Changed in juju: | |
status: | Fix Committed → Fix Released |
Changed in juju: | |
milestone: | 2.5-beta1 → none |
status: | Fix Released → Confirmed |
https:/ /private- fileshare. canonical. com/~paulgear/ lp1792299/ has controller logs which hopefully cover the period of non-functionality and initial startup after the rolling restart, along with an engine report for each controller after rolling restart.