RabbitMQ OCF timeout should be used without 'su' childs

Bug #1446241 reported by Bogdan Dobrelya
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Critical
Alexander Nevenchannyy
5.1.x
Fix Committed
Critical
Bogdan Dobrelya
6.0.x
Fix Committed
Critical
Bogdan Dobrelya

Bug Description

This issue was discovered at the scale lab, when rabbit nodes were running under load.

Timeout is being used for rabbitmqctl stop, start and wait, which uses a 'su': sh -x /usr/sbin/rabbitmq-server <...>. The 'su' changes the original process group. And if the timeout expired, it would kill only the child processes in the original process group leaving the commands in the new process group running orphaned.

Here is an example flow (from atop binary logs):
http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/

Here is how to test it:
Case a) The 'sleep' should detach to init and run orhaned:
# timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' &
# ps auxf
 root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000"
 root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 | \_ su rabbitmq sh -c whoami; sleep 1000
 rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 | \_ bash -c whoami; sleep 1000 sh
 rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 | \_ sleep 1000
(killed)
# ps aux
rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000

Case b) The 'sleep' should terminate as well:
# timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' &
# ps auxf
 root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000"
 root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 | \_ sh -c whoami; sleep 1000
 root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 | \_ sleep 1000
(killed)
# ps aux
(now is OK!)

The solution is to issue all timeout wrapped rabbitmqctl commands as a
rabbitmq user, so the rabbitmqctl would not have to use the 'su'.

This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.

Tags: scale
Changed in fuel:
milestone: none → 6.1
importance: Undecided → Critical
assignee: nobody → Fuel Library Team (fuel-library)
status: New → Confirmed
description: updated
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I believe the proper fix will be to submit a bug for timeout - it should be able to kill all process tree - and fix timeout package internally for Fuel, so we could not have to wait the upstream fix.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → MOS Linux (mos-linux)
status: Confirmed → New
Changed in fuel:
assignee: MOS Linux (mos-linux) → Bogdan Dobrelya (bogdando)
status: New → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The timeout works as it should, it sends kill to process group. The issue is what we use timeout with OCF_RUN wrapper, which uses 'su'. But su changes the process group, hence we're ending up with orphaned processes after timeout kills original group.

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/175460

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/175460
Reason: The issue is not related to ocf_run

Dina Belova (dbelova)
tags: added: scale
description: updated
summary: - RabbitMQ OCF timeout does not kill child processes
+ RabbitMQ OCF timeout should be used without 'su' childs
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Sergii Golovatiuk (sgolovatiuk)
Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Bogdan Dobrelya (bogdando)
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Alexander Nevenchannyy (anevenchannyy)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/175460
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=b725527d92c31373c84cd780dcf1ad10933f4955
Submitter: Jenkins
Branch: master

commit b725527d92c31373c84cd780dcf1ad10933f4955
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 20 17:42:41 2015 +0200

    Fix RabbitMQ ocf_run with the timeout command.

    W/o this fix, the timeout command is used with rabbitmqctl
    wrapper, which uses a 'su' when invoked as not a rabbitmq user.
    This is an issue as the 'su' changes the original
    process group. And if the timeout command expired, it would kill
    only processes in this original process group, leaving orphaned
    commands what do not belong to whis group anymore.

    The solution is to issue all timeout wrapped rabbitmqctl commands
    as the rabbitmq user in the OCF script.

    Closes-bug: #1446241

    Change-Id: I139255237fd34b555f248cb826deb13b7e134e8d
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/6.0)

Fix proposed to branch: stable/6.0
Review: https://review.openstack.org/179747

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/5.1)

Fix proposed to branch: stable/5.1
Review: https://review.openstack.org/179748

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/6.0)

Reviewed: https://review.openstack.org/179747
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=85fdf34ec59cc1a7000f98449ad26b2925491b74
Submitter: Jenkins
Branch: stable/6.0

commit 85fdf34ec59cc1a7000f98449ad26b2925491b74
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 20 17:42:41 2015 +0200

    Fix RabbitMQ ocf_run with the timeout command.

    W/o this fix, the timeout command is used with rabbitmqctl
    wrapper, which uses a 'su' when invoked as not a rabbitmq user.
    This is an issue as the 'su' changes the original
    process group. And if the timeout command expired, it would kill
    only processes in this original process group, leaving orphaned
    commands what do not belong to whis group anymore.

    The solution is to issue all timeout wrapped rabbitmqctl commands
    as the rabbitmq user in the OCF script.

    Closes-bug: #1446241

    Change-Id: I139255237fd34b555f248cb826deb13b7e134e8d
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit b725527d92c31373c84cd780dcf1ad10933f4955)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/5.1)

Reviewed: https://review.openstack.org/179748
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=6826b0cd7c9e48888f92b8bcac94a733ff609f29
Submitter: Jenkins
Branch: stable/5.1

commit 6826b0cd7c9e48888f92b8bcac94a733ff609f29
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 20 17:42:41 2015 +0200

    Fix RabbitMQ ocf_run with the timeout command.

    W/o this fix, the timeout command is used with rabbitmqctl
    wrapper, which uses a 'su' when invoked as not a rabbitmq user.
    This is an issue as the 'su' changes the original
    process group. And if the timeout command expired, it would kill
    only processes in this original process group, leaving orphaned
    commands what do not belong to whis group anymore.

    The solution is to issue all timeout wrapped rabbitmqctl commands
    as the rabbitmq user in the OCF script

    Closes-bug: #1446241

    Change-Id: I139255237fd34b555f248cb826deb13b7e134e8d
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit b725527d92c31373c84cd780dcf1ad10933f4955)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.