Canonical Juju

mongo outage causes huge logging output

Bug #1744374 reported by Roger Peppe on 2018-01-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Low	Unassigned

Bug Description

I just noticed this in a machine agent log:

ERROR juju.state.presence pruner.go:93 error removing old entries from "presence.pings": read tcp 172.31.23.250:56654->172.31.23.250:37017: i/o timeout

There were about 1.6 million such lines, being produced about 133 messages per second.

This was after a mongo outage, but I don't think Juju should be this verbose when that happens - it will fill up disk space very quickly (and this was on a controller with only WARNING level enabled).

I've attached a summary of the log lines, with the timestamps removed, sorted, uniq'ed and sorted in reverse frequency count.

The logs spanned about 3h30m.

Tags:

Revision history for this message

Roger Peppe (rogpeppe) wrote on 2018-01-19:

io-timeout-log-messages.txt Edit (64.9 KiB, text/plain)

Revision history for this message

John A Meinel (jameinel) wrote on 2018-01-21: Re: [Bug 1744374] [NEW] mongo outage causes huge logging output

So, we shouldn't actually be filling up disk very much, because
a) we can't talk to the database, so that isn't filling up
b) we should be rotating our log files to a maximum size.

That said, we don't want to be spamming the log file, but I'm less
concerned about that particular issue than it looks like we're trying to
re-use a closed connection.
Namely things like:

12096 ERROR juju.worker runner.go:392 exited "txnlog": read tcp
172.31.23.250:56654->172.31.23.250:37017: i/o timeout

12,000 lines using exactly the same source port. Now, there are 3000
different lines, so maybe its just a small pool that we pull from? I'm not
really sure.
We could throttle how much we try to reconnect to 'txnlog' which is
actually the only place with 10,000+ entries. Everything else seems to only
have <30.

John
=:->

On Fri, Jan 19, 2018 at 10:58 PM, Roger Peppe <email address hidden>
wrote:

> Public bug reported:
>
> I just noticed this in a machine agent log:
>
> ERROR juju.state.presence pruner.go:93 error removing old entries
> from "presence.pings": read tcp
> 172.31.23.250:56654->172.31.23.250:37017: i/o timeout
>
> There were about 1.6 million such lines, being produced about 133
> messages per second.
>
> This was after a mongo outage, but I don't think Juju should be this
> verbose when that happens - it will fill up disk space very quickly (and
> this was on a controller with only WARNING level enabled).
>
> I've attached a summary of the log lines, with the timestamps removed,
> sorted, uniq'ed and sorted in reverse frequency count.
>
> The logs spanned about 3h30m.
>
> ** Affects: juju
> Importance: Undecided
> Status: New
>
> ** Attachment added: "io-timeout-log-messages.txt"
> https://bugs.launchpad.net/bugs/1744374/+attachment/
> 5039688/+files/io-timeout-log-messages.txt
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1744374
>
> Title:
> mongo outage causes huge logging output
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1744374/+subscriptions
>

So, we shouldn't actually be filling up disk very much, because
a) we can't talk to the database, so that isn't filling up
b) we should be rotating our log files to a maximum size.

That said, we don't want to be spamming the log file, but I'm less
concerned about that particular issue than it looks like we're trying to
re-use a closed connection.
Namely things like:

12096 ERROR juju.worker runner.go:392 exited "txnlog": read tcp
172.31.23.250:56654->172.31.23.250:37017: i/o timeout

John
=:->

On Fri, Jan 19, 2018 at 10:58 PM, Roger Peppe <1744374@bugs.launchpad.net>
wrote:

> Public bug reported:
>
> I just noticed this in a machine agent log:
>
>      ERROR juju.state.presence pruner.go:93 error removing old entries
> from "presence.pings": read tcp
> 172.31.23.250:56654->172.31.23.250:37017: i/o timeout
>
> There were about 1.6 million such lines, being produced about 133
> messages per second.
>
> This was after a mongo outage, but I don't think Juju should be this
> verbose when that happens - it will fill up disk space very quickly (and
> this was on a controller with only WARNING level enabled).
>
> I've attached a summary of the log lines, with the timestamps removed,
> sorted, uniq'ed and sorted in reverse frequency count.
>
> The logs spanned about 3h30m.
>
> ** Affects: juju
>      Importance: Undecided
>          Status: New
>
> ** Attachment added: "io-timeout-log-messages.txt"
>    https://bugs.launchpad.net/bugs/1744374/+attachment/
> 5039688/+files/io-timeout-log-messages.txt
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1744374
>
> Title:
>   mongo outage causes huge logging output
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1744374/+subscriptions
>

Anastasia (anastasia-macmood) on 2018-01-31