Mirantis OpenStack

logrotate script for apache leads to restarting keystone service

Bug #1491576 reported by Leontii Istomin on 2015-09-02

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Fix Released	Critical	Sergii Golovatiuk	Mirantis OpenStack 7.0
7.0.x	Fix Released	Critical	Sergii Golovatiuk	Mirantis OpenStack 7.0
8.0.x	Fix Released	Medium	MOS Keystone	Mirantis OpenStack 8.0

Bug Description

during boot_server_from_volume_and_delete rally scenario we have found that volume can't be created. from rally.log:
http://paste.openstack.org/show/442202/
Cinder can't download image from glance. from cinder-volume.log:
http://paste.openstack.org/show/442066/
determine glance server from haproxy:
http://paste.openstack.org/show/442071/
Glance get 500 from keystone when tries to check token. from glance log:
http://paste.openstack.org/show/442207/
Keystone restarting at the time. from keystone log:
http://paste.openstack.org/show/442208/
It happens every day at ~07:35. from /var/log/apache2/error.log:
http://paste.openstack.org/show/442220/

It seems logrotating is root cause of the behavior. /etc/logrotate.d/apache2:
http://paste.openstack.org/show/442323/

If we really need to restart keystone service, then we need to do that not at the same time on all controller nodes.

Cluster configuration:
Baremetal,Ubuntu,IBP,HA,Neutron-vlan,DVR,Ceph-all,Nova-debug,Nova-quotas,7.0-182
Controllers:3 Computes:180 Copmutes+Ceph: 20

api: '1.0'
astute_sha: e24ca066bf6160bc1e419aaa5d486cad1aaa937d
auth_required: true
build_id: 2015-08-17_03-04-59
build_number: '182'
feature_groups:
- mirantis
fuel-agent_sha: 57145b1d8804389304cd04322ba0fb3dc9d30327
fuel-library_sha: 9de2625d26c3b88d22082baecd789b6bd5ddf3fa
fuel-nailgun-agent_sha: e01693992d7a0304d926b922b43f3b747c35964c
fuel-ostf_sha: 17786b86b78e5b66d2b1c15500186648df10c63d
fuelmain_sha: d8c726645be087bc67e2eeca134f0f9747cfeacd
nailgun_sha: 4710801a2f4a6d61d652f8f1e64215d9dde37d2e
openstack_version: 2015.1.0-7.0
production: docker
python-fuelclient_sha: 4c74a60aa60c06c136d9197c7d09fa4f8c8e2863
release: '7.0'

Diagnostic Snapshot: http://mos-scale-share.mirantis.com/fuel-snapshot-2015-09-02_05-59-17.tar.xz

Tags:

Revision history for this message

Aleksander Mogylchenko (amogylchenko) wrote on 2015-09-02:

This is not a bug. If you are running keystone under apache, each apache restart will lead to keystone restart. If you want to avoid it, either run keystone as a separate process, or modify the environment so logrotate script is executed at random time.

If you still think this error is purely apache problem, please provide some arguments.

Changed in mos:
status:	New → Incomplete
assignee:	MOS Linux (mos-linux) → Leontiy Istomin (listomin)

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-09-02:

This is an unfortunate consequence of https://bugs.launchpad.net/mos/+bug/1471318 and friends (https://bugs.launchpad.net/fuel/+bug/1480153).

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2015-09-03:

There is enough data in the bug to consider this as confirmed. The problem is not in Apache may be, but it's clear that there is a defect somewhere which leads to unavailability of control plane for some time (please clarify for how long?).

Do we clean up tokens in memcached when we do apache reload? If so, then OpenStack services will suffer as well...

Please investigate whether it can be fixed by logrotate executed at random time.

Changed in mos:
status:	Incomplete → Confirmed
importance:	Undecided → High

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2015-09-03:

Per conversation with Leontiy that could be fixed by executing logrotate script at single controller at a time. Reassigning to Fuel Library team

Changed in mos:
assignee:	Leontiy Istomin (listomin) → Fuel Library Team (fuel-library)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-09-03:

This is what the apache2 package does. We should not fix this bad logrotate file with puppet.

# dpkg -L apache2 | grep logrotate
/etc/logrotate.d
/etc/logrotate.d/apache2

Changed in mos:
assignee:	Fuel Library Team (fuel-library) → MOS Packaging Team (mos-packaging)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-09-03:

I believe the fix should be:
Instead
        postrotate
                if /etc/init.d/apache2 status > /dev/null ; then \
                    /etc/init.d/apache2 reload > /dev/null; \
                fi;
Use
        postrotate
                if /etc/init.d/apache2 status > /dev/null ; then \
                    /usr/sbin/apachectl graceful > /dev/null; \
                fi;

Although it looks like that the reload is good as well, and the master process PID never changes. So there should be no keystone backends downtime.

Changed in mos:
assignee:	MOS Packaging Team (mos-packaging) → Fuel Library Team (fuel-library)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-09-03:

Returned back to the fuel library as this seem not a logrotate packaging issue.
Logrotate is not a distributed tool. It can't and shouldn't be done so. I'm not sure we could fix this issue in a distributed manner
with logrotate.

Sergii Golovatiuk (sgolovatiuk) on 2015-09-03

Changed in mos:
assignee:	Fuel Library Team (fuel-library) → Sergii Golovatiuk (sgolovatiuk)

Revision history for this message

Bartłomiej Piotrowski (bpiotrowski) wrote on 2015-09-03:

As Bogdan said, the only reasonable fix for now is using graceful instead of restart. We are already using copytruncate, so possible log entries will land into appropriate file. We should also report it upstream.

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2015-09-04:

By the way if you look at the init script, reload is the same as graceful.

reload|force-reload|graceful)

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-09-04:

#10

There is an upstream issue about that https://github.com/GrahamDumpleton/mod_wsgi/issues/81

I've contacted the reporter and waiting for a reply. Maybe a workaround already exists.

Revision history for this message

Sergii Golovatiuk (sgolovatiuk) wrote on 2015-09-05:

#11

https://review.openstack.org/220650

Changed in mos:
status:	Confirmed → Fix Committed

Revision history for this message

Sergii Golovatiuk (sgolovatiuk) wrote on 2015-09-05:

#12

We applied workaround which will minimise occurrence to almost 0

Firstly, logrotate on controllers will be a bit shuffled by adding random delay to every controller. IT won't run at the same particular time on all controllers.

However, when apache with keystone dies it will have side effect for a couple of connections/tokens as there is a bit delay between
apache2clt graceful || apache2ctl restart as apache will lose all connections. Also it requires time to start all workers.

I am assigning this bug to 8.0 to keystone team for further investigation of https://github.com/GrahamDumpleton/mod_wsgi/issues/81

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2015-09-08:

#13

Issue reproduced again on next scenario:

1. Deploy Neutron tun cluster with 1 controller
2. After deployment add 2 controllers and re-deploy
3. After deployment add 2 controllers and 1 compute and re-deploy

Actual result - after third step deployment failed on the very first controller (node-5)

here apache mod_wsgi fails due to segfaults:

2015-09-07T18:05:20.858706+00:00 err: [Mon Sep 07 18:05:14.493726 2015] [core:notice] [pid 4163:tid 140140617865088] AH00052: child pid 20360 exit signal Segmentation fault (11)

Logs are attached

Revision history for this message

Andrey Sledzinskiy (asledzinskiy) wrote on 2015-09-08:

#14

fail_error_neutron_tun_scalability-fuel-snapshot-2015-09-07_19-37-03.tar.xz Edit (96.4 MiB, application/octet-stream)

Andrey Sledzinskiy (asledzinskiy) on 2015-09-08

Changed in mos:
status:	Fix Committed → Confirmed

Vladimir Kuklin (vkuklin) on 2015-09-08

Changed in mos:
importance:	High → Critical

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-09-08:

#15

Possible patch Edit (538 bytes, text/plain)

I think the following patch cherry-picked from http://koji.fedoraproject.org/koji/buildinfo?buildID=602307 which is cherry-picked from https://github.com/GrahamDumpleton/mod_wsgi/commit/808e9667fdddad16f94927b9f8ad947d56ea0071.diff helps. We need another iteration of tests.

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2015-09-08:

#16

There is an update to the previously committed workaround for logrotate as it seems it wasn't random enough, https://review.openstack.org/#/q/I76f106d85ac363708e9fe35b04e56b189de1407f,n,z

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2015-09-08:

#17

Vova, Alex, Igor, was anyone able to look at the core dump to figure out where it failed?

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-09-08:

#18

Davanum, it seems that mod_wsgi has a bug that it does not generate core dumps. We need to backport this patch https://github.com/GrahamDumpleton/mod_wsgi/commit/027e45804ed0af0dfe9b359225e0ee12aedb7c03 to obtain it.

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-09-10:

#19

On a call with Igor Marnat, Sergii Golovatiuk and Alexander Makarov we decided to check whether the issue has something to do with logrotate. I am checking that a segfault happens (or not) under load when `apache2ctl graceful` is called, which is what logrotate does.

Revision history for this message

Sergii Golovatiuk (sgolovatiuk) wrote on 2015-09-10:

#20

https://review.openstack.org/221327
https://review.openstack.org/221337

make a workaround as apache will be restart in different time on every controller. It means that 3 apache servers won't be restarted at the same time. It's workaround that works. However, the RCA shows the issues with mod_wsgi and USR1 signal handling which leads to seg fault.

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2015-09-10:

#21

Replacing mpm_worker with mpm_prefork should also help.

Revision history for this message

Boris Bobrov (bbobrov) wrote on 2015-09-10:

#22

Not a single segfault happened when apache2ctl restart was called every 10 minutes for 2.5 hours.

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2015-09-17:

#23

verified on 7.0 iso #288, steps:

1. deployed cluster with 3 controllers
2. started Rally task 'KeystoneBasic.create_delete_user'
3. scheduled logs rotation at the same time on all controllers (crontab -e; add line '15 9 * * * /etc/cron.daily/logrotate')
4. prepared httpd access log for rotation `mv var/log/apache2/access.log{,.bak}; fallocate -l 500M /var/log/apache2/access.log`

Result - no errors in Rally task (100% success), access log is successfully rotated. Sleep time before reloading apache configuration is different on controllers:

[root@nailgun ~]# runc 'grep sleep /etc/logrotate.d/httpd-prerotate/apache2'
DEPRECATION WARNING: /etc/fuel/client/config.yaml exists and will be used as the source for settings. This behavior is deprecated. Please specify the path to your custom settings file in the FUELCLIENT_CUSTOM_SETTINGS environment variable.
node-6.test.domain.local
sleep 60
node-8.test.domain.local
sleep 240
node-7.test.domain.local
sleep 60