Agents lost/unresponsive due to too many open files

Bug #1733469 reported by Paul Gear
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ian Booth

Bug Description

In investigating downrev/lost model agents after a controller upgrade to 2.2.6, I came across a model with a machine which was running out of file descriptors, with this traffic showing up in the logs constantly: https://pastebin.canonical.com/203713/plain/ There were 48201 open files on the system, 44684 of which were opened by jujuds, 43722 of which were unix domain sockets.

@jsing believes these are opened by: https://github.com/juju/juju/blob/develop/juju/sockets/sockets_nix.go#L47

Please advise what diagnostics you'd like gathered before we restart these agents.

Tags: canonical-is
Revision history for this message
Paul Gear (paulgear) wrote :

The agents in question are version 2.2.2.

tags: added: canonical-is
Revision history for this message
Andrew Wilkins (axwalk) wrote :

There's a bug in https://github.com/juju/juju/blob/develop/worker/metrics/spool/listener.go#L76, which says "NewPeriodicWorker returns a periodic worker, that will call a stop function when it is killed."

That is only true if the worker is explicitly killed from the outside, and not if the worker dies due to an error from the inside. In the latter case, the stop function (which closes the Unix socket) is not called.

Revision history for this message
Ian Booth (wallyworld) wrote :

It would be good if you could look back through the logs see identify the original root cause error with the metrics functionality. That will still need to be fixed. The leak in response to that error is obviously wrong, but there will still be an issue to address even when leak is plugged.

Changed in juju:
milestone: none → 2.3-rc2
Ian Booth (wallyworld)
Changed in juju:
importance: Undecided → High
assignee: nobody → Ian Booth (wallyworld)
status: New → Triaged
Revision history for this message
Joel Sing (jsing) wrote :

The full logs between the agent restart and the first "failed to open metric reader" are:

https://pastebin.canonical.com/203715/

Revision history for this message
Paul Gear (paulgear) wrote :

Confirmed: restarting the agents when zero-length files are present in /var/lib/juju/metricspool/ reproduces the log error shown, and the agents do not recover. Workaround is removing those zero-length files and restarting juju agents.

Revision history for this message
Joel Sing (jsing) wrote :

This is readily reproducible as follows:

 $ sudo touch /var/lib/juju/metricspool/metrics.meta
 <restart unit agent>

At which point the unit agent will start logging:

  2017-11-21 06:49:31 ERROR juju.worker.dependency engine.go:551 "metric-sender" manifold worker returned unexpected error: failed to open the metric reader: EOF

And leaking file handles:

  $ while sleep 5; do sudo lsof | grep \/var\/lib\/juju\/agents | wc -l; done
  36
  54
  63
  81
  99

A similar failure case will also occur if there are corrupt metrics metadata files (e.g. zero bytes due to power outage/file system corruption):

  $ sudo dd if=/dev/zero of=metrics.meta bs=512 count=1
  <restart unit agent>

  2017-11-21 06:58:49 ERROR juju.worker.dependency engine.go:551 "metric-sender" manifold worker returned unexpected error: failed to open the metric reader: invalid character '\x00' looking for beginning of value

The zero-length metadata case can be caused by a bug in recordMetaData, since the code that closes and moves the file is called unconditionally from a defer - this means that if the encode fails (e.g. full disk), the metadata file is still put into place:

  https://github.com/juju/juju/blob/develop/worker/metrics/spool/metrics.go#L265

Ian Booth (wallyworld)
Changed in juju:
status: Triaged → In Progress
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Revision history for this message
Joel Sing (jsing) wrote :
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.