juju-core

/var/spool/rsyslog grows without bound

Bug #1453801 reported by John A Meinel on 2015-05-11

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
juju-core	Fix Released	High	Andrew Wilkins	juju-core 1.25-alpha1
1.22	Fix Released	Critical	Andrew Wilkins	juju-core 1.22.4
1.23	Fix Released	High	Andrew Wilkins	juju-core 1.23.4
1.24	Fix Released	High	Andrew Wilkins	juju-core 1.24-beta3

Bug Description

We are currently using rsyslog with Disk-Assisted Memory Queues:
ActionQueueFileName machine-X_X

However, we are not passing a value for
ActionQueueMaxDiskSpace

Which means that the size of the rsyslog disk assistance is unbounded.
On one production site we currently have a 12GB disk queue which is causing us to run out of disk space (which then makes mongo go crazy).

This problem is probably exacerbated when you go into HA mode, because then we end up with a forward rule (and associated Queue) for each other API server. Which means that instead of just 1 disk assisted queue, we end up with 3 or even 5 of them.

I'm not sure if we are properly sending our messages (it is possible that we forward every message we get, including the ones that are sent to us from someone else, which would mean we overbroadcast by a factor of at least N, if not N^2)

But at the very least we should bound our QueueSize to be no larger than our maximum all-machines.log size.

See original description

Tags:

Revision history for this message

John A Meinel (jameinel) wrote on 2015-05-11:

In current trunk: https://github.com/juju/juju/blob/master/utils/syslog/config.go#L75 says that we will rotate all-machines.log when it grows beyond 512MB, thus it is more than safe enough to set ActionQueueSize to 512MB. That is probably still too big, but at least it is bounded.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2015-05-13:

According to http://www.rsyslog.com/doc/queues.html (under "Limiting the Queue Size"), we should be setting "$<object>QueueMaxDiskSpace", and not ActionQueuSize.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2015-05-13:

I'm not sure how to repro this, so I haven't been able to verify my fix beyond ensuring that the rsyslog config is updated after upgrading. I tried creating an HA env, creating a giant machine-0.log and all-machines.log on machine 0, and then disabling rsyslog on the 2nd and 3rd state servers; didn't cause the spool to grow by more than a couple MB.

I'll mark Fix Committed once I've merged, but would be good if someone could provide steps to repro.

Revision history for this message

Andrew Wilkins (axwalk) wrote on 2015-05-13:

Seems that long log lines get silently chucked away. When I created the "giant all-machines.log", I was creating log lines that were ~1MB long. John provided a snippet of Python to create log lines in all-machines.log:

python -c "import syslog; syslog.openlog('juju-test'); syslog.syslog(syslog.LOG_WARNING, 'test this out')"

(which can be tweaked to generate longer lines, and so on)

Andrew Wilkins (axwalk) on 2015-05-13

Changed in juju-core:
status:	Triaged → In Progress
assignee:	nobody → Andrew Wilkins (axwalk)
milestone:	none → 1.25.0

John A Meinel (jameinel) on 2015-05-13

description:

updated

Andrew Wilkins (axwalk) on 2015-05-13

Changed in juju-core:
status:	In Progress → Fix Committed

Curtis Hovey (sinzui) on 2015-05-20

Changed in juju-core:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.