logrotate script for apache leads to restarting keystone service

Bug #1491576 reported by Leontiy Istomin on 2015-09-02
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Critical
Sergii Golovatiuk
7.0.x
Critical
Sergii Golovatiuk
8.0.x
Medium
MOS Keystone

Bug Description

during boot_server_from_volume_and_delete rally scenario we have found that volume can't be created. from rally.log:
http://paste.openstack.org/show/442202/
Cinder can't download image from glance. from cinder-volume.log:
http://paste.openstack.org/show/442066/
determine glance server from haproxy:
http://paste.openstack.org/show/442071/
Glance get 500 from keystone when tries to check token. from glance log:
http://paste.openstack.org/show/442207/
Keystone restarting at the time. from keystone log:
http://paste.openstack.org/show/442208/
It happens every day at ~07:35. from /var/log/apache2/error.log:
http://paste.openstack.org/show/442220/

It seems logrotating is root cause of the behavior. /etc/logrotate.d/apache2:
http://paste.openstack.org/show/442323/

If we really need to restart keystone service, then we need to do that not at the same time on all controller nodes.

Cluster configuration:
Baremetal,Ubuntu,IBP,HA,Neutron-vlan,DVR,Ceph-all,Nova-debug,Nova-quotas,7.0-182
Controllers:3 Computes:180 Copmutes+Ceph: 20

api: '1.0'
astute_sha: e24ca066bf6160bc1e419aaa5d486cad1aaa937d
auth_required: true
build_id: 2015-08-17_03-04-59
build_number: '182'
feature_groups:
- mirantis
fuel-agent_sha: 57145b1d8804389304cd04322ba0fb3dc9d30327
fuel-library_sha: 9de2625d26c3b88d22082baecd789b6bd5ddf3fa
fuel-nailgun-agent_sha: e01693992d7a0304d926b922b43f3b747c35964c
fuel-ostf_sha: 17786b86b78e5b66d2b1c15500186648df10c63d
fuelmain_sha: d8c726645be087bc67e2eeca134f0f9747cfeacd
nailgun_sha: 4710801a2f4a6d61d652f8f1e64215d9dde37d2e
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 4c74a60aa60c06c136d9197c7d09fa4f8c8e2863
release: '7.0'

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-09-02_05-59-17.tar.xz

This is not a bug. If you are running keystone under apache, each apache restart will lead to keystone restart. If you want to avoid it, either run keystone as a separate process, or modify the environment so logrotate script is executed at random time.

If you still think this error is purely apache problem, please provide some arguments.

Changed in mos:
status: New → Incomplete
assignee: MOS Linux (mos-linux) → Leontiy Istomin (listomin)
Boris Bobrov (bbobrov) wrote :
Mike Scherbakov (mihgen) wrote :

There is enough data in the bug to consider this as confirmed. The problem is not in Apache may be, but it's clear that there is a defect somewhere which leads to unavailability of control plane for some time (please clarify for how long?).

Do we clean up tokens in memcached when we do apache reload? If so, then OpenStack services will suffer as well...

Please investigate whether it can be fixed by logrotate executed at random time.

Changed in mos:
status: Incomplete → Confirmed
importance: Undecided → High
Vitaly Sedelnik (vsedelnik) wrote :

Per conversation with Leontiy that could be fixed by executing logrotate script at single controller at a time. Reassigning to Fuel Library team

Changed in mos:
assignee: Leontiy Istomin (listomin) → Fuel Library Team (fuel-library)
Bogdan Dobrelya (bogdando) wrote :

This is what the apache2 package does. We should not fix this bad logrotate file with puppet.

# dpkg -L apache2 | grep logrotate
/etc/logrotate.d
/etc/logrotate.d/apache2

Changed in mos:
assignee: Fuel Library Team (fuel-library) → MOS Packaging Team (mos-packaging)
Bogdan Dobrelya (bogdando) wrote :

I believe the fix should be:
Instead
        postrotate
                if /etc/init.d/apache2 status > /dev/null ; then \
                    /etc/init.d/apache2 reload > /dev/null; \
                fi;
Use
        postrotate
                if /etc/init.d/apache2 status > /dev/null ; then \
                    /usr/sbin/apachectl graceful > /dev/null; \
                fi;

Although it looks like that the reload is good as well, and the master process PID never changes. So there should be no keystone backends downtime.

Changed in mos:
assignee: MOS Packaging Team (mos-packaging) → Fuel Library Team (fuel-library)
Bogdan Dobrelya (bogdando) wrote :

Returned back to the fuel library as this seem not a logrotate packaging issue.
Logrotate is not a distributed tool. It can't and shouldn't be done so. I'm not sure we could fix this issue in a distributed manner
with logrotate.

Changed in mos:
assignee: Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)

As Bogdan said, the only reasonable fix for now is using graceful instead of restart. We are already using copytruncate, so possible log entries will land into appropriate file. We should also report it upstream.

Alex Schultz (alex-schultz) wrote :

By the way if you look at the init script, reload is the same as graceful.

  reload|force-reload|graceful)

Boris Bobrov (bbobrov) wrote :

There is an upstream issue about that https://github.com/GrahamDumpleton/mod_wsgi/issues/81

I've contacted the reporter and waiting for a reply. Maybe a workaround already exists.

Changed in mos:
status: Confirmed → Fix Committed

We applied workaround which will minimise occurrence to almost 0

Firstly, logrotate on controllers will be a bit shuffled by adding random delay to every controller. IT won't run at the same particular time on all controllers.

However, when apache with keystone dies it will have side effect for a couple of connections/tokens as there is a bit delay between
apache2clt graceful || apache2ctl restart as apache will lose all connections. Also it requires time to start all workers.

I am assigning this bug to 8.0 to keystone team for further investigation of https://github.com/GrahamDumpleton/mod_wsgi/issues/81

Issue reproduced again on next scenario:

1. Deploy Neutron tun cluster with 1 controller
2. After deployment add 2 controllers and re-deploy
3. After deployment add 2 controllers and 1 compute and re-deploy

Actual result - after third step deployment failed on the very first controller (node-5)

here apache mod_wsgi fails due to segfaults:

2015-09-07T18:05:20.858706+00:00 err: [Mon Sep 07 18:05:14.493726 2015] [core:notice] [pid 4163:tid 140140617865088] AH00052: child pid 20360 exit signal Segmentation fault (11)

Logs are attached

Changed in mos:
status: Fix Committed → Confirmed
Changed in mos:
importance: High → Critical
Vladimir Kuklin (vkuklin) wrote :

I think the following patch cherry-picked from http://koji.fedoraproject.org/koji/buildinfo?buildID=602307 which is cherry-picked from https://github.com/GrahamDumpleton/mod_wsgi/commit/808e9667fdddad16f94927b9f8ad947d56ea0071.diff helps. We need another iteration of tests.

Alex Schultz (alex-schultz) wrote :

There is an update to the previously committed workaround for logrotate as it seems it wasn't random enough, https://review.openstack.org/#/q/I76f106d85ac363708e9fe35b04e56b189de1407f,n,z

Vova, Alex, Igor, was anyone able to look at the core dump to figure out where it failed?

Vladimir Kuklin (vkuklin) wrote :

Davanum, it seems that mod_wsgi has a bug that it does not generate core dumps. We need to backport this patch https://github.com/GrahamDumpleton/mod_wsgi/commit/027e45804ed0af0dfe9b359225e0ee12aedb7c03 to obtain it.

Boris Bobrov (bbobrov) wrote :

On a call with Igor Marnat, Sergii Golovatiuk and Alexander Makarov we decided to check whether the issue has something to do with logrotate. I am checking that a segfault happens (or not) under load when `apache2ctl graceful` is called, which is what logrotate does.

https://review.openstack.org/221327
https://review.openstack.org/221337

make a workaround as apache will be restart in different time on every controller. It means that 3 apache servers won't be restarted at the same time. It's workaround that works. However, the RCA shows the issues with mod_wsgi and USR1 signal handling which leads to seg fault.

Denis Meltsaykin (dmeltsaykin) wrote :

Replacing mpm_worker with mpm_prefork should also help.

Boris Bobrov (bbobrov) wrote :

Not a single segfault happened when apache2ctl restart was called every 10 minutes for 2.5 hours.

Artem Panchenko (apanchenko-8) wrote :

verified on 7.0 iso #288, steps:

1. deployed cluster with 3 controllers
2. started Rally task 'KeystoneBasic.create_delete_user'
3. scheduled logs rotation at the same time on all controllers (crontab -e; add line '15 9 * * * /etc/cron.daily/logrotate')
4. prepared httpd access log for rotation `mv var/log/apache2/access.log{,.bak}; fallocate -l 500M /var/log/apache2/access.log`

Result - no errors in Rally task (100% success), access log is successfully rotated. Sleep time before reloading apache configuration is different on controllers:

[root@nailgun ~]# runc 'grep sleep /etc/logrotate.d/httpd-prerotate/apache2'
DEPRECATION WARNING: /etc/fuel/client/config.yaml exists and will be used as the source for settings. This behavior is deprecated. Please specify the path to your custom settings file in the FUELCLIENT_CUSTOM_SETTINGS environment variable.
node-6.test.domain.local
sleep 60
node-8.test.domain.local
sleep 240
node-7.test.domain.local
sleep 60

Ivan Lozgachev (ilozgachev) wrote :

Verified on ENV-13 MOS 8.0 build 518

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers