heka generating huge json log with read permission errors
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
kolla-ansible |
Invalid
|
Undecided
|
Unassigned |
Bug Description
The heka container is creating a huge logfile named which is currently 63GB and growing quickly.
I identified the problem container via uuid of the directory in /var/lib/
docker ps -a | grep 367f6556c
367f6556cfa7 kolla4echo.
The huge log file:
du -sh 367f6556cfa7472
63G 367f6556cfa7472
The cause of the problem was these two log files did not have world read so heka could not read them:
/var/lib/
-rw-r----- 1 polkitd systemd-bus-proxy 2031937 Mar 8 05:54 /var/lib/
-rw-r----- 1 nobody polkitd 183429 Mar 7 15:16 /var/lib/
The following two entries were being appended the json.log file at a rate of about 12 per second:
{"log":"2017/03/07 17:19:11 Input 'mariadb_
{"log":"2017/03/07 17:19:11 Input 'openstack_
I was able to stop the logging with the three following commands:
setfacl -R -m g:1000:r /var/lib/
setfacl -R -m d:g:1000:r /var/lib/
docker restart heka
Thee first command facl to give the group 1000 which is the group that kolla had in the context of the heka container read access to all logs. The second command sets the default facl on all the log directories so that any newly created logs will inherit the read facl. Maybe I should have set the default facl up one level to _data so that any new log directories created would also inherit the facl.
My opinion is that it would be better for kolla to have unique group id across containers that should also exist on the bare metal server so that the group acl or just plain posix group would work more reliably.
Those huge logfiles are written to the root filesystem of the server so when /root eventually fills the this will also crash the bare metal server.
There is another side effect once is that elastic search hogs the rabbitmq connection spraying the errors into it log file on again on the root filesystem with teh message: output' error: can't deliver matched message: Queue is full\r\ n","stream" :"stdout" ,"time" :"2017- 03-08T20: 29:18.914034353 Z"} output' error: can't deliver matched message: Queue is full\r\ n","stream" :"stdout" ,"time" :"2017- 03-08T20: 29:18.914052368 Z"}
{"log":"2017/03/08 12:29:18 Plugin 'elasticsearch_
{"log":"2017/03/08 12:29:18 Plugin 'elasticsearch_
but at a much faster rate of hundreds log records per second...
I assume this from elastic search indexing the gigabytes of data generated by heka.
With the message bus overwhelmed with logging the how Openstack cluster stops working... Kinda major problem.