Activity log for bug #1446241

Date Who What changed Old value New value Message
2015-04-20 13:51:03 Bogdan Dobrelya bug added bug
2015-04-20 13:51:14 Bogdan Dobrelya nominated for series fuel/5.1.x
2015-04-20 13:51:14 Bogdan Dobrelya bug task added fuel/5.1.x
2015-04-20 13:51:14 Bogdan Dobrelya nominated for series fuel/6.0.x
2015-04-20 13:51:14 Bogdan Dobrelya bug task added fuel/6.0.x
2015-04-20 13:51:42 Bogdan Dobrelya fuel: milestone 6.1
2015-04-20 13:51:44 Bogdan Dobrelya fuel: importance Undecided Critical
2015-04-20 13:51:50 Bogdan Dobrelya fuel: assignee Fuel Library Team (fuel-library)
2015-04-20 13:51:55 Bogdan Dobrelya fuel: status New Confirmed
2015-04-20 13:52:06 Bogdan Dobrelya fuel/5.1.x: milestone 5.1.2
2015-04-20 13:52:09 Bogdan Dobrelya fuel/6.0.x: milestone 6.0.1
2015-04-20 13:52:16 Bogdan Dobrelya fuel/5.1.x: assignee Fuel Library Team (fuel-library)
2015-04-20 13:52:22 Bogdan Dobrelya fuel/6.0.x: assignee Fuel Library Team (fuel-library)
2015-04-20 13:52:26 Bogdan Dobrelya fuel/5.1.x: importance Undecided Critical
2015-04-20 13:52:30 Bogdan Dobrelya fuel/6.0.x: importance Undecided Critical
2015-04-20 13:52:34 Bogdan Dobrelya fuel/5.1.x: status New Confirmed
2015-04-20 13:52:36 Bogdan Dobrelya fuel/6.0.x: status New Confirmed
2015-04-20 13:54:09 Bogdan Dobrelya description This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout should kill all child processes, when expired Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ These issues may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact. This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout should kill all child processes, when expired. Otherwise there are may orphaned commands - such as start, stop - hang for ever detached from the main process after it has been killed Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ These issues may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.
2015-04-20 14:15:58 Bogdan Dobrelya fuel: assignee Fuel Library Team (fuel-library) MOS Linux (mos-linux)
2015-04-20 14:16:05 Bogdan Dobrelya fuel: status Confirmed New
2015-04-20 14:18:43 Bogdan Dobrelya bug added subscriber Vladimir Kuklin
2015-04-20 14:18:51 Bogdan Dobrelya bug added subscriber Sergii Golovatiuk
2015-04-20 14:49:53 Bogdan Dobrelya fuel: assignee MOS Linux (mos-linux) Bogdan Dobrelya (bogdando)
2015-04-20 14:49:56 Bogdan Dobrelya fuel: status New In Progress
2015-04-20 14:58:23 Bogdan Dobrelya fuel/5.1.x: status Confirmed Triaged
2015-04-20 14:58:25 Bogdan Dobrelya fuel/6.0.x: status Confirmed Triaged
2015-04-20 15:03:52 Bogdan Dobrelya description This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout should kill all child processes, when expired. Otherwise there are may orphaned commands - such as start, stop - hang for ever detached from the main process after it has been killed Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ These issues may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact. This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout is used with ocf_run wrapper, which uses a 'su'. The 'su' changes the original process group. And if the timeout kills all child processes in the original process group, there will be orphaned commands left. Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ Here is how to test it: Case a) The 'sleep' should detach to init and run orhaned: # timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' & # ps auxf root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000" root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 | \_ su rabbitmq sh -c whoami; sleep 1000 rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 | \_ bash -c whoami; sleep 1000 sh rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 | \_ sleep 1000 (killed) # ps aux rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000 Case b) The 'sleep' should terminate as well: # timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' & # ps auxf root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000" root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 | \_ sh -c whoami; sleep 1000 root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 | \_ sleep 1000 (killed) # ps aux (now is OK!) This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.
2015-04-20 21:39:00 Bogdan Dobrelya description This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout is used with ocf_run wrapper, which uses a 'su'. The 'su' changes the original process group. And if the timeout kills all child processes in the original process group, there will be orphaned commands left. Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ Here is how to test it: Case a) The 'sleep' should detach to init and run orhaned: # timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' & # ps auxf root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000" root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 | \_ su rabbitmq sh -c whoami; sleep 1000 rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 | \_ bash -c whoami; sleep 1000 sh rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 | \_ sleep 1000 (killed) # ps aux rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000 Case b) The 'sleep' should terminate as well: # timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' & # ps auxf root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000" root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 | \_ sh -c whoami; sleep 1000 root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 | \_ sleep 1000 (killed) # ps aux (now is OK!) This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact. This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout is being used for rabbitmq-server stop, start and wait, which uses a 'su': sh -x /usr/sbin/rabbitmq-server <...>. The 'su' changes the original process group. And if the timeout kills all child processes in the original process group, there will be orphaned commands left. Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ Here is how to test it: Case a) The 'sleep' should detach to init and run orhaned: # timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' & # ps auxf  root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000"  root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 | \_ su rabbitmq sh -c whoami; sleep 1000  rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 | \_ bash -c whoami; sleep 1000 sh  rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 | \_ sleep 1000 (killed) # ps aux rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000 Case b) The 'sleep' should terminate as well: # timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' & # ps auxf  root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000"  root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 | \_ sh -c whoami; sleep 1000  root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 | \_ sleep 1000 (killed) # ps aux (now is OK!) This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.
2015-04-20 21:39:07 Bogdan Dobrelya fuel/5.1.x: status Triaged Confirmed
2015-04-20 21:39:10 Bogdan Dobrelya fuel/6.0.x: status Triaged Confirmed
2015-04-21 08:24:22 Dina Belova tags scale
2015-04-21 08:28:08 Bogdan Dobrelya description This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout is being used for rabbitmq-server stop, start and wait, which uses a 'su': sh -x /usr/sbin/rabbitmq-server <...>. The 'su' changes the original process group. And if the timeout kills all child processes in the original process group, there will be orphaned commands left. Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ Here is how to test it: Case a) The 'sleep' should detach to init and run orhaned: # timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' & # ps auxf  root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000"  root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 | \_ su rabbitmq sh -c whoami; sleep 1000  rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 | \_ bash -c whoami; sleep 1000 sh  rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 | \_ sleep 1000 (killed) # ps aux rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000 Case b) The 'sleep' should terminate as well: # timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' & # ps auxf  root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000"  root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 | \_ sh -c whoami; sleep 1000  root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 | \_ sleep 1000 (killed) # ps aux (now is OK!) This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact. This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout is being used for rabbitmqctl stop, start and wait, which uses a 'su': sh -x /usr/sbin/rabbitmq-server <...>. The 'su' changes the original process group. And if the timeout expired, it would kill only the child processes in the original process group leaving the commands in the new process group running orphaned. Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ Here is how to test it: Case a) The 'sleep' should detach to init and run orhaned: # timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' & # ps auxf  root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000"  root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 | \_ su rabbitmq sh -c whoami; sleep 1000  rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 | \_ bash -c whoami; sleep 1000 sh  rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 | \_ sleep 1000 (killed) # ps aux rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000 Case b) The 'sleep' should terminate as well: # timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' & # ps auxf  root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000"  root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 | \_ sh -c whoami; sleep 1000  root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 | \_ sleep 1000 (killed) # ps aux (now is OK!) The solution is to issue all timeout wrapped rabbitmqctl commands as a rabbitmq user, so the rabbitmqctl would not have to use the 'su'. This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.
2015-04-21 08:28:12 Bogdan Dobrelya fuel/5.1.x: status Confirmed Triaged
2015-04-21 08:28:14 Bogdan Dobrelya fuel/6.0.x: status Confirmed Triaged
2015-04-21 09:48:15 Bogdan Dobrelya summary RabbitMQ OCF timeout does not kill child processes RabbitMQ OCF timeout should be used without 'su' childs
2015-04-21 16:39:38 OpenStack Infra fuel: assignee Bogdan Dobrelya (bogdando) Sergii Golovatiuk (sgolovatiuk)
2015-04-22 12:19:26 OpenStack Infra fuel: assignee Sergii Golovatiuk (sgolovatiuk) Bogdan Dobrelya (bogdando)
2015-04-22 17:43:11 OpenStack Infra fuel: assignee Bogdan Dobrelya (bogdando) Alexander Nevenchannyy (anevenchannyy)
2015-04-22 21:07:19 OpenStack Infra fuel: status In Progress Fix Committed
2015-05-04 10:24:35 Bogdan Dobrelya fuel/5.1.x: assignee Fuel Library Team (fuel-library) Bogdan Dobrelya (bogdando)
2015-05-04 10:24:39 Bogdan Dobrelya fuel/6.0.x: assignee Fuel Library Team (fuel-library) Bogdan Dobrelya (bogdando)
2015-05-04 10:24:43 Bogdan Dobrelya fuel/6.0.x: status Triaged In Progress
2015-05-04 10:24:47 Bogdan Dobrelya fuel/5.1.x: status Triaged In Progress
2015-05-08 09:23:02 Bogdan Dobrelya fuel/6.0.x: status In Progress Fix Committed
2015-05-08 09:23:06 Bogdan Dobrelya fuel/5.1.x: status In Progress Fix Committed