mongo outage causes huge logging output

Bug #1744374 reported by Roger Peppe
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

I just noticed this in a machine agent log:

     ERROR juju.state.presence pruner.go:93 error removing old entries from "presence.pings": read tcp 172.31.23.250:56654->172.31.23.250:37017: i/o timeout

There were about 1.6 million such lines, being produced about 133 messages per second.

This was after a mongo outage, but I don't think Juju should be this verbose when that happens - it will fill up disk space very quickly (and this was on a controller with only WARNING level enabled).

I've attached a summary of the log lines, with the timestamps removed, sorted, uniq'ed and sorted in reverse frequency count.

The logs spanned about 3h30m.

Revision history for this message
Roger Peppe (rogpeppe) wrote :
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1744374] [NEW] mongo outage causes huge logging output

So, we shouldn't actually be filling up disk very much, because
a) we can't talk to the database, so that isn't filling up
b) we should be rotating our log files to a maximum size.

That said, we don't want to be spamming the log file, but I'm less
concerned about that particular issue than it looks like we're trying to
re-use a closed connection.
Namely things like:

12096 ERROR juju.worker runner.go:392 exited "txnlog": read tcp
172.31.23.250:56654->172.31.23.250:37017: i/o timeout

12,000 lines using exactly the same source port. Now, there are 3000
different lines, so maybe its just a small pool that we pull from? I'm not
really sure.
We could throttle how much we try to reconnect to 'txnlog' which is
actually the only place with 10,000+ entries. Everything else seems to only
have <30.

John
=:->

On Fri, Jan 19, 2018 at 10:58 PM, Roger Peppe <email address hidden>
wrote:

> Public bug reported:
>
> I just noticed this in a machine agent log:
>
> ERROR juju.state.presence pruner.go:93 error removing old entries
> from "presence.pings": read tcp
> 172.31.23.250:56654->172.31.23.250:37017: i/o timeout
>
> There were about 1.6 million such lines, being produced about 133
> messages per second.
>
> This was after a mongo outage, but I don't think Juju should be this
> verbose when that happens - it will fill up disk space very quickly (and
> this was on a controller with only WARNING level enabled).
>
> I've attached a summary of the log lines, with the timestamps removed,
> sorted, uniq'ed and sorted in reverse frequency count.
>
> The logs spanned about 3h30m.
>
> ** Affects: juju
> Importance: Undecided
> Status: New
>
> ** Attachment added: "io-timeout-log-messages.txt"
> https://bugs.launchpad.net/bugs/1744374/+attachment/
> 5039688/+files/io-timeout-log-messages.txt
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1744374
>
> Title:
> mongo outage causes huge logging output
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1744374/+subscriptions
>

Changed in juju:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: Medium → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.