logrotate_crond removes files and sends too many SIGHUPs

Bug #1776533 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Bogdan Dobrelya

Bug Description

So currently the logrotate_crond container has a few issues issues:
A) In the postrotate it matches pids multiple times and sends SIGHUPs multiple time to processes:
    ======== /var/log/messages =====
    Jun 3 09:01:15 overcloud-controller-0 logrotate-crond: kill -HUP 1575
    Jun 3 09:01:15 overcloud-controller-0 rsyslogd: [origin software="rsyslogd" swVersion="8.24.0" x-pid="1575" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
    Jun 3 09:01:15 overcloud-controller-0 logrotate-crond: kill -HUP 1575
    Jun 3 09:01:15 overcloud-controller-0 rsyslogd: [origin software="rsyslogd" swVersion="8.24.0" x-pid="1575" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
    Jun 3 09:01:15 overcloud-controller-0 logrotate-crond: kill -HUP 1575
    Jun 3 09:01:15 overcloud-controller-0 rsyslogd: [origin software="rsyslogd
...

B) The logrotate_crond container should not rotate logs for services running on the host outside of containers (i.e. rsyslog has its own /etc/logrotate.d/rsyslog rules). Doing so violates the principle of least surprise.

C) The find command matches older files to be deleted but the SIGHUP is never sent so we actually can end up in a situation where we remove a file but the new one never gets created because the service does not get a SIGHUP signal:

ls -la /var/log/containers/httpd/*/*
-rw-r--r--. 1 root root 52046652 May 29 14:10 /var/log/containers/httpd/aodh-api/aodh_wsgi_access.log.1
-rw-r--r--. 1 root root 0 May 24 19:14 /var/log/containers/httpd/aodh-api/aodh_wsgi_error.log
-rw-r--r--. 1 root root 5894 May 24 19:14 /var/log/containers/httpd/aodh-api/error_log
-rw-r--r--. 1 root root 50755274 May 29 14:10 /var/log/containers/httpd/cinder-api/cinder_wsgi_access.log.1
-rw-r--r--. 1 root root 4138 May 25 11:58 /var/log/containers/httpd/cinder-api/cinder_wsgi_error.log
-rw-r--r--. 1 root root 5894 May 24 19:13 /var/log/containers/httpd/cinder-api/error_log

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (master)

Fix proposed to branch: master
Review: https://review.openstack.org/574835

Changed in tripleo:
assignee: nobody → Michele Baldessari (michele)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)
Download full text (4.0 KiB)

Reviewed: https://review.openstack.org/574835
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=77d93f32873e740b0f04f323465ad14c044c896b
Submitter: Zuul
Branch: master

commit 77d93f32873e740b0f04f323465ad14c044c896b
Author: Michele Baldessari <email address hidden>
Date: Tue Jun 12 19:34:37 2018 +0200

    Fix logrotate_crond issues

    So currently the logrotate_crond container has a few issues issues:
    A) In the postrotate it matches pids multiple times and sends SIGHUPs multiple time to processes:
        ======== /var/log/messages =====
        Jun 3 09:01:15 overcloud-controller-0 logrotate-crond: kill -HUP 1575
        Jun 3 09:01:15 overcloud-controller-0 rsyslogd: [origin software="rsyslogd" swVersion="8.24.0" x-pid="1575" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
        Jun 3 09:01:15 overcloud-controller-0 logrotate-crond: kill -HUP 1575
        Jun 3 09:01:15 overcloud-controller-0 rsyslogd: [origin software="rsyslogd" swVersion="8.24.0" x-pid="1575" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
        Jun 3 09:01:15 overcloud-controller-0 logrotate-crond: kill -HUP 1575
        Jun 3 09:01:15 overcloud-controller-0 rsyslogd: [origin software="rsyslogd
    ...

    Adding sort -u in the pipeline of the postrotate script takes care of
    that.

    B) The logrotate_crond container should not rotate logs for services
    running on the host outside of containers (i.e. rsyslog has its own
    /etc/logrotate.d/rsyslog rules). Doing so violates the principle of
    least surprise.

    Using 'lsof ..+D /var/log/containers' takes care of this as we won't
    match any non containerized processes

    C) The find command matches older files to be deleted but the SIGHUP is
    never sent so we actually can end up in a situation where we remove a
    file but the new one never gets created because the service does not get
    a SIGHUP signal:

    ls -la /var/log/containers/httpd/*/*
    -rw-r--r--. 1 root root 52046652 May 29 14:10 /var/log/containers/httpd/aodh-api/aodh_wsgi_access.log.1
    -rw-r--r--. 1 root root 0 May 24 19:14 /var/log/containers/httpd/aodh-api/aodh_wsgi_error.log
    -rw-r--r--. 1 root root 5894 May 24 19:14 /var/log/containers/httpd/aodh-api/error_log
    -rw-r--r--. 1 root root 50755274 May 29 14:10 /var/log/containers/httpd/cinder-api/cinder_wsgi_access.log.1
    -rw-r--r--. 1 root root 4138 May 25 11:58 /var/log/containers/httpd/cinder-api/cinder_wsgi_error.log
    -rw-r--r--. 1 root root 5894 May 24 19:13 /var/log/containers/httpd/cinder-api/error_log

    Using 'lsof ..+D /var/log/containers' fixes this case as well because
    now we correctly match the processes that have a deleted file that is
    open and we send a proper SIGHUP to them.

    Tested by doing the following:
    1) Logging rotation of containerized services (B, C)
    1.1) Stopped the keystone container
    1.2) Made the /var/log/container/keystone/keystone.log file 21M large
    1.3) Started the keystone container and observed that it was logging
         correctly to /var/log/container/keystone/keystone.log
    1.4) Inside the logrotate_crond container we...

Read more...

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/575525

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/575526

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (stable/queens)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/575525
Reason: we have gate problems again, please do not restore or recheck, I'll take care of this one when gate is back stable.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (stable/pike)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/575526
Reason: we have gate problems again, please do not restore or recheck, I'll take care of this one when gate is back stable.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/pike)
Download full text (4.2 KiB)

Reviewed: https://review.openstack.org/575526
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=3cd751de2fc533780c7218e2a923dd969724566c
Submitter: Zuul
Branch: stable/pike

commit 3cd751de2fc533780c7218e2a923dd969724566c
Author: Michele Baldessari <email address hidden>
Date: Tue Jun 12 19:34:37 2018 +0200

    Fix logrotate_crond issues

    So currently the logrotate_crond container has a few issues issues:
    A) In the postrotate it matches pids multiple times and sends SIGHUPs multiple time to processes:
        ======== /var/log/messages =====
        Jun 3 09:01:15 overcloud-controller-0 logrotate-crond: kill -HUP 1575
        Jun 3 09:01:15 overcloud-controller-0 rsyslogd: [origin software="rsyslogd" swVersion="8.24.0" x-pid="1575" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
        Jun 3 09:01:15 overcloud-controller-0 logrotate-crond: kill -HUP 1575
        Jun 3 09:01:15 overcloud-controller-0 rsyslogd: [origin software="rsyslogd" swVersion="8.24.0" x-pid="1575" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
        Jun 3 09:01:15 overcloud-controller-0 logrotate-crond: kill -HUP 1575
        Jun 3 09:01:15 overcloud-controller-0 rsyslogd: [origin software="rsyslogd
    ...

    Adding sort -u in the pipeline of the postrotate script takes care of
    that.

    B) The logrotate_crond container should not rotate logs for services
    running on the host outside of containers (i.e. rsyslog has its own
    /etc/logrotate.d/rsyslog rules). Doing so violates the principle of
    least surprise.

    Using 'lsof ..+D /var/log/containers' takes care of this as we won't
    match any non containerized processes

    C) The find command matches older files to be deleted but the SIGHUP is
    never sent so we actually can end up in a situation where we remove a
    file but the new one never gets created because the service does not get
    a SIGHUP signal:

    ls -la /var/log/containers/httpd/*/*
    -rw-r--r--. 1 root root 52046652 May 29 14:10 /var/log/containers/httpd/aodh-api/aodh_wsgi_access.log.1
    -rw-r--r--. 1 root root 0 May 24 19:14 /var/log/containers/httpd/aodh-api/aodh_wsgi_error.log
    -rw-r--r--. 1 root root 5894 May 24 19:14 /var/log/containers/httpd/aodh-api/error_log
    -rw-r--r--. 1 root root 50755274 May 29 14:10 /var/log/containers/httpd/cinder-api/cinder_wsgi_access.log.1
    -rw-r--r--. 1 root root 4138 May 25 11:58 /var/log/containers/httpd/cinder-api/cinder_wsgi_error.log
    -rw-r--r--. 1 root root 5894 May 24 19:13 /var/log/containers/httpd/cinder-api/error_log

    Using 'lsof ..+D /var/log/containers' fixes this case as well because
    now we correctly match the processes that have a deleted file that is
    open and we send a proper SIGHUP to them.

    Tested by doing the following:
    1) Logging rotation of containerized services (B, C)
    1.1) Stopped the keystone container
    1.2) Made the /var/log/container/keystone/keystone.log file 21M large
    1.3) Started the keystone container and observed that it was logging
         correctly to /var/log/container/keystone/keystone.log
    1.4) Inside the logrotate_crond contain...

Read more...

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/queens)
Download full text (4.1 KiB)

Reviewed: https://review.openstack.org/575525
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=823540143f4074a1b7f8adef9626c25aad1cd9b3
Submitter: Zuul
Branch: stable/queens

commit 823540143f4074a1b7f8adef9626c25aad1cd9b3
Author: Michele Baldessari <email address hidden>
Date: Tue Jun 12 19:34:37 2018 +0200

    Fix logrotate_crond issues

    So currently the logrotate_crond container has a few issues issues:
    A) In the postrotate it matches pids multiple times and sends SIGHUPs multiple time to processes:
        ======== /var/log/messages =====
        Jun 3 09:01:15 overcloud-controller-0 logrotate-crond: kill -HUP 1575
        Jun 3 09:01:15 overcloud-controller-0 rsyslogd: [origin software="rsyslogd" swVersion="8.24.0" x-pid="1575" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
        Jun 3 09:01:15 overcloud-controller-0 logrotate-crond: kill -HUP 1575
        Jun 3 09:01:15 overcloud-controller-0 rsyslogd: [origin software="rsyslogd" swVersion="8.24.0" x-pid="1575" x-info="http://www.rsyslog.com"] rsyslogd was HUPed
        Jun 3 09:01:15 overcloud-controller-0 logrotate-crond: kill -HUP 1575
        Jun 3 09:01:15 overcloud-controller-0 rsyslogd: [origin software="rsyslogd
    ...

    Adding sort -u in the pipeline of the postrotate script takes care of
    that.

    B) The logrotate_crond container should not rotate logs for services
    running on the host outside of containers (i.e. rsyslog has its own
    /etc/logrotate.d/rsyslog rules). Doing so violates the principle of
    least surprise.

    Using 'lsof ..+D /var/log/containers' takes care of this as we won't
    match any non containerized processes

    C) The find command matches older files to be deleted but the SIGHUP is
    never sent so we actually can end up in a situation where we remove a
    file but the new one never gets created because the service does not get
    a SIGHUP signal:

    ls -la /var/log/containers/httpd/*/*
    -rw-r--r--. 1 root root 52046652 May 29 14:10 /var/log/containers/httpd/aodh-api/aodh_wsgi_access.log.1
    -rw-r--r--. 1 root root 0 May 24 19:14 /var/log/containers/httpd/aodh-api/aodh_wsgi_error.log
    -rw-r--r--. 1 root root 5894 May 24 19:14 /var/log/containers/httpd/aodh-api/error_log
    -rw-r--r--. 1 root root 50755274 May 29 14:10 /var/log/containers/httpd/cinder-api/cinder_wsgi_access.log.1
    -rw-r--r--. 1 root root 4138 May 25 11:58 /var/log/containers/httpd/cinder-api/cinder_wsgi_error.log
    -rw-r--r--. 1 root root 5894 May 24 19:13 /var/log/containers/httpd/cinder-api/error_log

    Using 'lsof ..+D /var/log/containers' fixes this case as well because
    now we correctly match the processes that have a deleted file that is
    open and we send a proper SIGHUP to them.

    Tested by doing the following:
    1) Logging rotation of containerized services (B, C)
    1.1) Stopped the keystone container
    1.2) Made the /var/log/container/keystone/keystone.log file 21M large
    1.3) Started the keystone container and observed that it was logging
         correctly to /var/log/container/keystone/keystone.log
    1.4) Inside the logrotate_crond conta...

Read more...

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 8.3.4

This issue was fixed in the openstack/puppet-tripleo 8.3.4 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 7.4.14

This issue was fixed in the openstack/puppet-tripleo 7.4.14 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 9.2.0

This issue was fixed in the openstack/puppet-tripleo 9.2.0 release.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The change I5029a4b9c76268455812696290aaf82f1a0c2c23 had caused a regression so that
the filter is stopped working and matches nothing.

The postscript command is executed inside of the logrotate-crond container, like this way:

()[root@5c78303fb8c2 /]# /sbin/lsof -nPs +L1 +D /var/log/containers 2>&1 | awk '/\S+\s+[0-9]+\s.*\/var\/log\/.*\(deleted\)/ {print}'
neutron-s 2572 neutron 5w REG 252,1 234 0 8419523 /var/log/neutron/server.log.1 (deleted)
httpd 2862 root 10w REG 252,1 3287 0 549003 /var/log/httpd/keystone_wsgi_admin_access.log.1 (deleted)
glance-ap 3026 glance 3w REG 252,1 131 0 29510206 /var/log/glance/api.log.1 (deleted)
httpd 3192 keystone 10w REG 252,1 3287 0 549003 /var/log/httpd/keystone_wsgi_admin_access.log.1 (deleted)
httpd 3193 keystone 10w REG 252,1 3287 0 549003 /var/log/httpd/keystone_wsgi_admin_access.log.1 (deleted)
httpd 3194 keystone 10w REG 252,1 3287 0 549003 /var/log/httpd/keystone_wsgi_admin_access.log.1 (deleted)
httpd 3194 keystone 11w REG 252,1 2853 0 96570982 /var/log/keystone/keystone.log.1 (deleted)
httpd 3195 keystone 10w REG 252,1 3287 0 549003 /var/log/httpd/keystone_wsgi_admin_access.log.1 (deleted)
httpd 3195 keystone 11w REG 252,1 2853 0 96570982 /var/log/keystone/keystone.log.1 (deleted)
httpd 3197 48 10w REG 252,1 3287 0 549003 /var/log/httpd/keystone_wsgi_admin_access.log.1 (deleted)
httpd 3198 48 10w REG 252,1 3287 0 549003 /var/log/httpd/keystone_wsgi_admin_access.log.1 (deleted)
httpd 3199 48 10w REG 252,1 3287 0 549003 /var/log/httpd/keystone_wsgi_admin_access.log.1 (deleted)
httpd 3200 48 10w REG 252,1 3287 0 549003 /var/log/httpd/keystone_wsgi_admin_access.log.1 (deleted)
httpd 3201 48 10w REG 252,1 3287 0 549003 /var/log/httpd/keystone_wsgi_admin_access.log.1 (deleted)
httpd 3202 48 10w REG 252,1 3287 0 549003 /var/log/httpd/keystone_wsgi_admin_access.log.1 (deleted)
httpd 3203 48 10w REG 252,1 3287 0 549003 /var/log/httpd/keystone_wsgi_admin_access.log.1 (deleted)
httpd 3204 48 10w REG 252,1 3287 0 549003 /var/log/httpd/keystone_wsgi_admin_access.log.1 (deleted)
nova-api- 3810 nova 3w REG 252,1 129 0 26281737 /var/log/nova/nova-api-metadata.log.1 (deleted)
neutron-s 5129 neutron 5w REG 252,1 234 0 8419523 /var/log/neutron/server.log.1 (deleted)
neutron-s 5130 neutron 5w REG 252,1 234 0 8419523 /var/log/neutron/server.log.1 (deleted)
neutron-s 5131 neutron 5w REG 252,1 234 0 8419523 /var/log/neutron/server.log.1 (deleted)
neutron-s 5132 neutron 5w REG 252,1 234 0 8419523 /var/log/neutron/server.log.1 (deleted)

So you can see that the real path should be examined for deleted (open and unlined) logs is /var/log/ and not /var/log/containers. The latter is only used to apply the logrotation over the bind-mounted host path. It cannot affect any host logs outside of /var/log/containers, so the change I5029a4b9c76268455812696290aaf82f1a0c2c23 needs to be partially discarded.

Changed in tripleo:
status: Fix Released → In Progress
assignee: Michele Baldessari (michele) → Bogdan Dobrelya (bogdando)
milestone: rocky-3 → rocky-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (master)

Fix proposed to branch: master
Review: https://review.openstack.org/591616

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The logrotation configuration for containerized services at best can be illustrated https://pastebin.com/axmvnJVp

The first container started as the service producing foo.log. The 2nd one is observer which filters by the "container view" of /var/log/foo/foo.log, but can apply actions, like logrotate, by the real "host path view" /var/log/containers/foo/foo.log

The fix just follows that path

Changed in tripleo:
milestone: rocky-rc1 → stein-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/591616
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=25060acf284ac24ee3b4d9fdfcfb81e8391d9355
Submitter: Zuul
Branch: master

commit 25060acf284ac24ee3b4d9fdfcfb81e8391d9355
Author: Bogdan Dobrelya <email address hidden>
Date: Tue Aug 14 12:17:17 2018 +0200

    Fix postrotate filter and httpd reloading signal

    The change I5029a4b9c76268455812696290aaf82f1a0c2c23 had caused a
    regression so that the filter is stopped working and matches nothing.

    The postscript command is executed inside of the logrotate-crond
    container, like this way:

    ()[root@5c78303fb8c2 /]# /sbin/lsof -nPs +L1 +D /var/log/containers
    2>&1 | awk '/\S+\s+[0-9]+\s.*\/var\/log\/.*\(deleted\)/ {print}'

    neutron-s ... /var/log/neutron/server.log.1 (deleted)
    httpd .../var/log/httpd/keystone_wsgi_admin_access.log.1 (deleted)

    So you can see that the real path should be examined for deleted
    (open and unlined) logs is /var/log/ and not /var/log/containers. The
    latter is only used to apply the logrotation over the bind-mounted
    host path. It cannot affect any host logs outside of
    /var/log/containers, so the change
    I5029a4b9c76268455812696290aaf82f1a0c2c23 needs to be partially
    discarded.

    Additionally, send USR1 instead of HUP to reload httpd in containers
    gracefully. That part follows-up the reverted
    I15fa0eab1625ac63fd57b6a6d5cd22a6ac85f221 as we want to keep that
    change and it fits the subject bug scope as well.

    Change-Id: Ibb017463b0fbbccda035aeb1fff5f6998bbf2d1e
    Closes-Bug: #1776533
    Related-Bug: #1785659

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 9.3.0

This issue was fixed in the openstack/puppet-tripleo 9.3.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.