RabbitMQ OCF timeout should be used without 'su' childs

Bug #1446241 reported by Bogdan Dobrelya on 2015-04-20
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Critical
Alexander Nevenchannyy
5.1.x
Critical
Bogdan Dobrelya
6.0.x
Critical
Bogdan Dobrelya

Bug Description

This issue was discovered at the scale lab, when rabbit nodes were running under load.

Timeout is being used for rabbitmqctl stop, start and wait, which uses a 'su': sh -x /usr/sbin/rabbitmq-server <...>. The 'su' changes the original process group. And if the timeout expired, it would kill only the child processes in the original process group leaving the commands in the new process group running orphaned.

Here is an example flow (from atop binary logs):
http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/

Here is how to test it:
Case a) The 'sleep' should detach to init and run orhaned:
# timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' &
# ps auxf
 root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000"
 root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 | \_ su rabbitmq sh -c whoami; sleep 1000
 rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 | \_ bash -c whoami; sleep 1000 sh
 rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 | \_ sleep 1000
(killed)
# ps aux
rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000

Case b) The 'sleep' should terminate as well:
# timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' &
# ps auxf
 root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000"
 root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 | \_ sh -c whoami; sleep 1000
 root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 | \_ sleep 1000
(killed)
# ps aux
(now is OK!)

The solution is to issue all timeout wrapped rabbitmqctl commands as a
rabbitmq user, so the rabbitmqctl would not have to use the 'su'.

This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.

Changed in fuel:
milestone: none → 6.1
importance: Undecided → Critical
assignee: nobody → Fuel Library Team (fuel-library)
status: New → Confirmed
description: updated
Bogdan Dobrelya (bogdando) wrote :

I believe the proper fix will be to submit a bug for timeout - it should be able to kill all process tree - and fix timeout package internally for Fuel, so we could not have to wait the upstream fix.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → MOS Linux (mos-linux)
status: Confirmed → New
Changed in fuel:
assignee: MOS Linux (mos-linux) → Bogdan Dobrelya (bogdando)
status: New → In Progress
Bogdan Dobrelya (bogdando) wrote :

The timeout works as it should, it sends kill to process group. The issue is what we use timeout with OCF_RUN wrapper, which uses 'su'. But su changes the process group, hence we're ending up with orphaned processes after timeout kills original group.

description: updated
description: updated

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/175460
Reason: The issue is not related to ocf_run

Dina Belova (dbelova) on 2015-04-21
tags: added: scale
description: updated
summary: - RabbitMQ OCF timeout does not kill child processes
+ RabbitMQ OCF timeout should be used without 'su' childs
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Sergii Golovatiuk (sgolovatiuk)
Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Bogdan Dobrelya (bogdando)
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Alexander Nevenchannyy (anevenchannyy)

Reviewed: https://review.openstack.org/175460
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=b725527d92c31373c84cd780dcf1ad10933f4955
Submitter: Jenkins
Branch: master

commit b725527d92c31373c84cd780dcf1ad10933f4955
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 20 17:42:41 2015 +0200

    Fix RabbitMQ ocf_run with the timeout command.

    W/o this fix, the timeout command is used with rabbitmqctl
    wrapper, which uses a 'su' when invoked as not a rabbitmq user.
    This is an issue as the 'su' changes the original
    process group. And if the timeout command expired, it would kill
    only processes in this original process group, leaving orphaned
    commands what do not belong to whis group anymore.

    The solution is to issue all timeout wrapped rabbitmqctl commands
    as the rabbitmq user in the OCF script.

    Closes-bug: #1446241

    Change-Id: I139255237fd34b555f248cb826deb13b7e134e8d
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/179747
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=85fdf34ec59cc1a7000f98449ad26b2925491b74
Submitter: Jenkins
Branch: stable/6.0

commit 85fdf34ec59cc1a7000f98449ad26b2925491b74
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 20 17:42:41 2015 +0200

    Fix RabbitMQ ocf_run with the timeout command.

    W/o this fix, the timeout command is used with rabbitmqctl
    wrapper, which uses a 'su' when invoked as not a rabbitmq user.
    This is an issue as the 'su' changes the original
    process group. And if the timeout command expired, it would kill
    only processes in this original process group, leaving orphaned
    commands what do not belong to whis group anymore.

    The solution is to issue all timeout wrapped rabbitmqctl commands
    as the rabbitmq user in the OCF script.

    Closes-bug: #1446241

    Change-Id: I139255237fd34b555f248cb826deb13b7e134e8d
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit b725527d92c31373c84cd780dcf1ad10933f4955)

Reviewed: https://review.openstack.org/179748
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=6826b0cd7c9e48888f92b8bcac94a733ff609f29
Submitter: Jenkins
Branch: stable/5.1

commit 6826b0cd7c9e48888f92b8bcac94a733ff609f29
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 20 17:42:41 2015 +0200

    Fix RabbitMQ ocf_run with the timeout command.

    W/o this fix, the timeout command is used with rabbitmqctl
    wrapper, which uses a 'su' when invoked as not a rabbitmq user.
    This is an issue as the 'su' changes the original
    process group. And if the timeout command expired, it would kill
    only processes in this original process group, leaving orphaned
    commands what do not belong to whis group anymore.

    The solution is to issue all timeout wrapped rabbitmqctl commands
    as the rabbitmq user in the OCF script

    Closes-bug: #1446241

    Change-Id: I139255237fd34b555f248cb826deb13b7e134e8d
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit b725527d92c31373c84cd780dcf1ad10933f4955)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers