2015-04-20 13:51:03 |
Bogdan Dobrelya |
bug |
|
|
added bug |
2015-04-20 13:51:14 |
Bogdan Dobrelya |
nominated for series |
|
fuel/5.1.x |
|
2015-04-20 13:51:14 |
Bogdan Dobrelya |
bug task added |
|
fuel/5.1.x |
|
2015-04-20 13:51:14 |
Bogdan Dobrelya |
nominated for series |
|
fuel/6.0.x |
|
2015-04-20 13:51:14 |
Bogdan Dobrelya |
bug task added |
|
fuel/6.0.x |
|
2015-04-20 13:51:42 |
Bogdan Dobrelya |
fuel: milestone |
|
6.1 |
|
2015-04-20 13:51:44 |
Bogdan Dobrelya |
fuel: importance |
Undecided |
Critical |
|
2015-04-20 13:51:50 |
Bogdan Dobrelya |
fuel: assignee |
|
Fuel Library Team (fuel-library) |
|
2015-04-20 13:51:55 |
Bogdan Dobrelya |
fuel: status |
New |
Confirmed |
|
2015-04-20 13:52:06 |
Bogdan Dobrelya |
fuel/5.1.x: milestone |
|
5.1.2 |
|
2015-04-20 13:52:09 |
Bogdan Dobrelya |
fuel/6.0.x: milestone |
|
6.0.1 |
|
2015-04-20 13:52:16 |
Bogdan Dobrelya |
fuel/5.1.x: assignee |
|
Fuel Library Team (fuel-library) |
|
2015-04-20 13:52:22 |
Bogdan Dobrelya |
fuel/6.0.x: assignee |
|
Fuel Library Team (fuel-library) |
|
2015-04-20 13:52:26 |
Bogdan Dobrelya |
fuel/5.1.x: importance |
Undecided |
Critical |
|
2015-04-20 13:52:30 |
Bogdan Dobrelya |
fuel/6.0.x: importance |
Undecided |
Critical |
|
2015-04-20 13:52:34 |
Bogdan Dobrelya |
fuel/5.1.x: status |
New |
Confirmed |
|
2015-04-20 13:52:36 |
Bogdan Dobrelya |
fuel/6.0.x: status |
New |
Confirmed |
|
2015-04-20 13:54:09 |
Bogdan Dobrelya |
description |
This issue was discovered at the scale lab, when rabbit nodes were running under load.
Timeout should kill all child processes, when expired
Here is an example flow (from atop binary logs):
http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/
These issues may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact. |
This issue was discovered at the scale lab, when rabbit nodes were running under load.
Timeout should kill all child processes, when expired. Otherwise there are may orphaned commands - such as start, stop - hang for ever detached from the main process after it has been killed
Here is an example flow (from atop binary logs):
http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/
These issues may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact. |
|
2015-04-20 14:15:58 |
Bogdan Dobrelya |
fuel: assignee |
Fuel Library Team (fuel-library) |
MOS Linux (mos-linux) |
|
2015-04-20 14:16:05 |
Bogdan Dobrelya |
fuel: status |
Confirmed |
New |
|
2015-04-20 14:18:43 |
Bogdan Dobrelya |
bug |
|
|
added subscriber Vladimir Kuklin |
2015-04-20 14:18:51 |
Bogdan Dobrelya |
bug |
|
|
added subscriber Sergii Golovatiuk |
2015-04-20 14:49:53 |
Bogdan Dobrelya |
fuel: assignee |
MOS Linux (mos-linux) |
Bogdan Dobrelya (bogdando) |
|
2015-04-20 14:49:56 |
Bogdan Dobrelya |
fuel: status |
New |
In Progress |
|
2015-04-20 14:58:23 |
Bogdan Dobrelya |
fuel/5.1.x: status |
Confirmed |
Triaged |
|
2015-04-20 14:58:25 |
Bogdan Dobrelya |
fuel/6.0.x: status |
Confirmed |
Triaged |
|
2015-04-20 15:03:52 |
Bogdan Dobrelya |
description |
This issue was discovered at the scale lab, when rabbit nodes were running under load.
Timeout should kill all child processes, when expired. Otherwise there are may orphaned commands - such as start, stop - hang for ever detached from the main process after it has been killed
Here is an example flow (from atop binary logs):
http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/
These issues may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact. |
This issue was discovered at the scale lab, when rabbit nodes were running under load.
Timeout is used with ocf_run wrapper, which uses a 'su'. The 'su' changes the original process group. And if the timeout kills all child processes in the original process group, there will be orphaned commands left.
Here is an example flow (from atop binary logs):
http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/
Here is how to test it:
Case a) The 'sleep' should detach to init and run orhaned:
# timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' &
# ps auxf
root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000"
root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 | \_ su rabbitmq sh -c whoami; sleep 1000
rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 | \_ bash -c whoami; sleep 1000 sh
rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 | \_ sleep 1000
(killed)
# ps aux
rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000
Case b) The 'sleep' should terminate as well:
# timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' &
# ps auxf
root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000"
root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 | \_ sh -c whoami; sleep 1000
root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 | \_ sleep 1000
(killed)
# ps aux
(now is OK!)
This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact. |
|
2015-04-20 21:39:00 |
Bogdan Dobrelya |
description |
This issue was discovered at the scale lab, when rabbit nodes were running under load.
Timeout is used with ocf_run wrapper, which uses a 'su'. The 'su' changes the original process group. And if the timeout kills all child processes in the original process group, there will be orphaned commands left.
Here is an example flow (from atop binary logs):
http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/
Here is how to test it:
Case a) The 'sleep' should detach to init and run orhaned:
# timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' &
# ps auxf
root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000"
root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 | \_ su rabbitmq sh -c whoami; sleep 1000
rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 | \_ bash -c whoami; sleep 1000 sh
rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 | \_ sleep 1000
(killed)
# ps aux
rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000
Case b) The 'sleep' should terminate as well:
# timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' &
# ps auxf
root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000"
root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 | \_ sh -c whoami; sleep 1000
root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 | \_ sleep 1000
(killed)
# ps aux
(now is OK!)
This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact. |
This issue was discovered at the scale lab, when rabbit nodes were running under load.
Timeout is being used for rabbitmq-server stop, start and wait, which uses a 'su': sh -x /usr/sbin/rabbitmq-server <...>. The 'su' changes the original process group. And if the timeout kills all child processes in the original process group, there will be orphaned commands left.
Here is an example flow (from atop binary logs):
http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/
Here is how to test it:
Case a) The 'sleep' should detach to init and run orhaned:
# timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' &
# ps auxf
root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000"
root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 | \_ su rabbitmq sh -c whoami; sleep 1000
rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 | \_ bash -c whoami; sleep 1000 sh
rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 | \_ sleep 1000
(killed)
# ps aux
rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000
Case b) The 'sleep' should terminate as well:
# timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' &
# ps auxf
root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000"
root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 | \_ sh -c whoami; sleep 1000
root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 | \_ sleep 1000
(killed)
# ps aux
(now is OK!)
This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact. |
|
2015-04-20 21:39:07 |
Bogdan Dobrelya |
fuel/5.1.x: status |
Triaged |
Confirmed |
|
2015-04-20 21:39:10 |
Bogdan Dobrelya |
fuel/6.0.x: status |
Triaged |
Confirmed |
|
2015-04-21 08:24:22 |
Dina Belova |
tags |
|
scale |
|
2015-04-21 08:28:08 |
Bogdan Dobrelya |
description |
This issue was discovered at the scale lab, when rabbit nodes were running under load.
Timeout is being used for rabbitmq-server stop, start and wait, which uses a 'su': sh -x /usr/sbin/rabbitmq-server <...>. The 'su' changes the original process group. And if the timeout kills all child processes in the original process group, there will be orphaned commands left.
Here is an example flow (from atop binary logs):
http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/
Here is how to test it:
Case a) The 'sleep' should detach to init and run orhaned:
# timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' &
# ps auxf
root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000"
root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 | \_ su rabbitmq sh -c whoami; sleep 1000
rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 | \_ bash -c whoami; sleep 1000 sh
rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 | \_ sleep 1000
(killed)
# ps aux
rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000
Case b) The 'sleep' should terminate as well:
# timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' &
# ps auxf
root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000"
root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 | \_ sh -c whoami; sleep 1000
root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 | \_ sleep 1000
(killed)
# ps aux
(now is OK!)
This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact. |
This issue was discovered at the scale lab, when rabbit nodes were running under load.
Timeout is being used for rabbitmqctl stop, start and wait, which uses a 'su': sh -x /usr/sbin/rabbitmq-server <...>. The 'su' changes the original process group. And if the timeout expired, it would kill only the child processes in the original process group leaving the commands in the new process group running orphaned.
Here is an example flow (from atop binary logs):
http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/
Here is how to test it:
Case a) The 'sleep' should detach to init and run orhaned:
# timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' &
# ps auxf
root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000"
root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 | \_ su rabbitmq sh -c whoami; sleep 1000
rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 | \_ bash -c whoami; sleep 1000 sh
rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 | \_ sleep 1000
(killed)
# ps aux
rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000
Case b) The 'sleep' should terminate as well:
# timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' &
# ps auxf
root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000"
root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 | \_ sh -c whoami; sleep 1000
root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 | \_ sleep 1000
(killed)
# ps aux
(now is OK!)
The solution is to issue all timeout wrapped rabbitmqctl commands as a
rabbitmq user, so the rabbitmqctl would not have to use the 'su'.
This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact. |
|
2015-04-21 08:28:12 |
Bogdan Dobrelya |
fuel/5.1.x: status |
Confirmed |
Triaged |
|
2015-04-21 08:28:14 |
Bogdan Dobrelya |
fuel/6.0.x: status |
Confirmed |
Triaged |
|
2015-04-21 09:48:15 |
Bogdan Dobrelya |
summary |
RabbitMQ OCF timeout does not kill child processes |
RabbitMQ OCF timeout should be used without 'su' childs |
|
2015-04-21 16:39:38 |
OpenStack Infra |
fuel: assignee |
Bogdan Dobrelya (bogdando) |
Sergii Golovatiuk (sgolovatiuk) |
|
2015-04-22 12:19:26 |
OpenStack Infra |
fuel: assignee |
Sergii Golovatiuk (sgolovatiuk) |
Bogdan Dobrelya (bogdando) |
|
2015-04-22 17:43:11 |
OpenStack Infra |
fuel: assignee |
Bogdan Dobrelya (bogdando) |
Alexander Nevenchannyy (anevenchannyy) |
|
2015-04-22 21:07:19 |
OpenStack Infra |
fuel: status |
In Progress |
Fix Committed |
|
2015-05-04 10:24:35 |
Bogdan Dobrelya |
fuel/5.1.x: assignee |
Fuel Library Team (fuel-library) |
Bogdan Dobrelya (bogdando) |
|
2015-05-04 10:24:39 |
Bogdan Dobrelya |
fuel/6.0.x: assignee |
Fuel Library Team (fuel-library) |
Bogdan Dobrelya (bogdando) |
|
2015-05-04 10:24:43 |
Bogdan Dobrelya |
fuel/6.0.x: status |
Triaged |
In Progress |
|
2015-05-04 10:24:47 |
Bogdan Dobrelya |
fuel/5.1.x: status |
Triaged |
In Progress |
|
2015-05-08 09:23:02 |
Bogdan Dobrelya |
fuel/6.0.x: status |
In Progress |
Fix Committed |
|
2015-05-08 09:23:06 |
Bogdan Dobrelya |
fuel/5.1.x: status |
In Progress |
Fix Committed |
|