heka generating huge json log with read permission errors

Bug #1671177 reported by James McEvoy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla-ansible
Invalid
Undecided
Unassigned

Bug Description

The heka container is creating a huge logfile named which is currently 63GB and growing quickly.

I identified the problem container via uuid of the directory in /var/lib/docker/containers which was huge:
docker ps -a | grep 367f6556c
367f6556cfa7 kolla4echo.scm.penguincomputing.com:4000/kolla/centos-binary-heka:3.0.2 "kolla_start" 5 weeks ago Up 18 hours

The huge log file:
du -sh 367f6556cfa747271a655a4254f37f5ceba40f91014e4a3383433a23b31a66c2-json.log
63G 367f6556cfa747271a655a4254f37f5ceba40f91014e4a3383433a23b31a66c2-json.log

The cause of the problem was these two log files did not have world read so heka could not read them:
/var/lib/docker/volumes/kolla_logs/_data/neutron/dnsmasq.log and /var/lib/docker/volumes/kolla_logs/_data/mariadb/mariadb.log

-rw-r----- 1 polkitd systemd-bus-proxy 2031937 Mar 8 05:54 /var/lib/docker/volumes/kolla_logs/_data/mariadb/mariadb.log
-rw-r----- 1 nobody polkitd 183429 Mar 7 15:16 /var/lib/docker/volumes/kolla_logs/_data/neutron/dnsmasq.log

The following two entries were being appended the json.log file at a rate of about 12 per second:

{"log":"2017/03/07 17:19:11 Input 'mariadb_logstreamer_input' error: open /var/log/kolla/mariadb/mariadb.log: permission denied\r\n","stream":"stdout","time":"2017-03-08T01:19:11.076243455Z"}
{"log":"2017/03/07 17:19:11 Input 'openstack_logstreamer_input' error: open /var/log/kolla/neutron/dnsmasq.log: permission denied\r\n","stream":"stdout","time":"2017-03-08T01:19:11.12596695Z"}

I was able to stop the logging with the three following commands:
setfacl -R -m g:1000:r /var/lib/docker/volumes/kolla_logs/_data/*
setfacl -R -m d:g:1000:r /var/lib/docker/volumes/kolla_logs/_data/*
docker restart heka

Thee first command facl to give the group 1000 which is the group that kolla had in the context of the heka container read access to all logs. The second command sets the default facl on all the log directories so that any newly created logs will inherit the read facl. Maybe I should have set the default facl up one level to _data so that any new log directories created would also inherit the facl.

My opinion is that it would be better for kolla to have unique group id across containers that should also exist on the bare metal server so that the group acl or just plain posix group would work more reliably.

Revision history for this message
James McEvoy (jmcevoy) wrote :

Those huge logfiles are written to the root filesystem of the server so when /root eventually fills the this will also crash the bare metal server.

There is another side effect once is that elastic search hogs the rabbitmq connection spraying the errors into it log file on again on the root filesystem with teh message:
{"log":"2017/03/08 12:29:18 Plugin 'elasticsearch_output' error: can't deliver matched message: Queue is full\r\n","stream":"stdout","time":"2017-03-08T20:29:18.914034353Z"}
{"log":"2017/03/08 12:29:18 Plugin 'elasticsearch_output' error: can't deliver matched message: Queue is full\r\n","stream":"stdout","time":"2017-03-08T20:29:18.914052368Z"}
 but at a much faster rate of hundreds log records per second...

I assume this from elastic search indexing the gigabytes of data generated by heka.

With the message bus overwhelmed with logging the how Openstack cluster stops working... Kinda major problem.

Revision history for this message
Eduardo Gonzalez (egonzalez90) wrote :

Heka is no longer used in kolla-ansible

Changed in kolla-ansible:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.