Fuel for OpenStack

Bug #1446241
Activity log

Activity log for bug #1446241

Date	Who	What changed	Old value	New value	Message
2015-04-20 13:51:03	Bogdan Dobrelya	bug			added bug
2015-04-20 13:51:14	Bogdan Dobrelya	nominated for series		fuel/5.1.x
2015-04-20 13:51:14	Bogdan Dobrelya	bug task added		fuel/5.1.x
2015-04-20 13:51:14	Bogdan Dobrelya	nominated for series		fuel/6.0.x
2015-04-20 13:51:14	Bogdan Dobrelya	bug task added		fuel/6.0.x
2015-04-20 13:51:42	Bogdan Dobrelya	fuel: milestone		6.1
2015-04-20 13:51:44	Bogdan Dobrelya	fuel: importance	Undecided	Critical
2015-04-20 13:51:50	Bogdan Dobrelya	fuel: assignee		Fuel Library Team (fuel-library)
2015-04-20 13:51:55	Bogdan Dobrelya	fuel: status	New	Confirmed
2015-04-20 13:52:06	Bogdan Dobrelya	fuel/5.1.x: milestone		5.1.2
2015-04-20 13:52:09	Bogdan Dobrelya	fuel/6.0.x: milestone		6.0.1
2015-04-20 13:52:16	Bogdan Dobrelya	fuel/5.1.x: assignee		Fuel Library Team (fuel-library)
2015-04-20 13:52:22	Bogdan Dobrelya	fuel/6.0.x: assignee		Fuel Library Team (fuel-library)
2015-04-20 13:52:26	Bogdan Dobrelya	fuel/5.1.x: importance	Undecided	Critical
2015-04-20 13:52:30	Bogdan Dobrelya	fuel/6.0.x: importance	Undecided	Critical
2015-04-20 13:52:34	Bogdan Dobrelya	fuel/5.1.x: status	New	Confirmed
2015-04-20 13:52:36	Bogdan Dobrelya	fuel/6.0.x: status	New	Confirmed
2015-04-20 13:54:09	Bogdan Dobrelya	description	This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout should kill all child processes, when expired Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ These issues may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.	This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout should kill all child processes, when expired. Otherwise there are may orphaned commands - such as start, stop - hang for ever detached from the main process after it has been killed Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ These issues may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.
2015-04-20 14:15:58	Bogdan Dobrelya	fuel: assignee	Fuel Library Team (fuel-library)	MOS Linux (mos-linux)
2015-04-20 14:16:05	Bogdan Dobrelya	fuel: status	Confirmed	New
2015-04-20 14:18:43	Bogdan Dobrelya	bug			added subscriber Vladimir Kuklin
2015-04-20 14:18:51	Bogdan Dobrelya	bug			added subscriber Sergii Golovatiuk
2015-04-20 14:49:53	Bogdan Dobrelya	fuel: assignee	MOS Linux (mos-linux)	Bogdan Dobrelya (bogdando)
2015-04-20 14:49:56	Bogdan Dobrelya	fuel: status	New	In Progress
2015-04-20 14:58:23	Bogdan Dobrelya	fuel/5.1.x: status	Confirmed	Triaged
2015-04-20 14:58:25	Bogdan Dobrelya	fuel/6.0.x: status	Confirmed	Triaged
2015-04-20 15:03:52	Bogdan Dobrelya	description	This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout should kill all child processes, when expired. Otherwise there are may orphaned commands - such as start, stop - hang for ever detached from the main process after it has been killed Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ These issues may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.	This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout is used with ocf_run wrapper, which uses a 'su'. The 'su' changes the original process group. And if the timeout kills all child processes in the original process group, there will be orphaned commands left. Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ Here is how to test it: Case a) The 'sleep' should detach to init and run orhaned: # timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' & # ps auxf root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000" root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 \| \_ su rabbitmq sh -c whoami; sleep 1000 rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 \| \_ bash -c whoami; sleep 1000 sh rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 \| \_ sleep 1000 (killed) # ps aux rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000 Case b) The 'sleep' should terminate as well: # timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' & # ps auxf root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000" root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 \| \_ sh -c whoami; sleep 1000 root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 \| \_ sleep 1000 (killed) # ps aux (now is OK!) This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.
2015-04-20 21:39:00	Bogdan Dobrelya	description	This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout is used with ocf_run wrapper, which uses a 'su'. The 'su' changes the original process group. And if the timeout kills all child processes in the original process group, there will be orphaned commands left. Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ Here is how to test it: Case a) The 'sleep' should detach to init and run orhaned: # timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' & # ps auxf root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000" root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 \| \_ su rabbitmq sh -c whoami; sleep 1000 rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 \| \_ bash -c whoami; sleep 1000 sh rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 \| \_ sleep 1000 (killed) # ps aux rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000 Case b) The 'sleep' should terminate as well: # timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' & # ps auxf root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000" root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 \| \_ sh -c whoami; sleep 1000 root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 \| \_ sleep 1000 (killed) # ps aux (now is OK!) This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.	This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout is being used for rabbitmq-server stop, start and wait, which uses a 'su': sh -x /usr/sbin/rabbitmq-server <...>. The 'su' changes the original process group. And if the timeout kills all child processes in the original process group, there will be orphaned commands left. Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ Here is how to test it: Case a) The 'sleep' should detach to init and run orhaned: # timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' & # ps auxf root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000" root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 \| \_ su rabbitmq sh -c whoami; sleep 1000 rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 \| \_ bash -c whoami; sleep 1000 sh rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 \| \_ sleep 1000 (killed) # ps aux rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000 Case b) The 'sleep' should terminate as well: # timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' & # ps auxf root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000" root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 \| \_ sh -c whoami; sleep 1000 root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 \| \_ sleep 1000 (killed) # ps aux (now is OK!) This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.
2015-04-20 21:39:07	Bogdan Dobrelya	fuel/5.1.x: status	Triaged	Confirmed
2015-04-20 21:39:10	Bogdan Dobrelya	fuel/6.0.x: status	Triaged	Confirmed
2015-04-21 08:24:22	Dina Belova	tags		scale
2015-04-21 08:28:08	Bogdan Dobrelya	description	This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout is being used for rabbitmq-server stop, start and wait, which uses a 'su': sh -x /usr/sbin/rabbitmq-server <...>. The 'su' changes the original process group. And if the timeout kills all child processes in the original process group, there will be orphaned commands left. Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ Here is how to test it: Case a) The 'sleep' should detach to init and run orhaned: # timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' & # ps auxf root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000" root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 \| \_ su rabbitmq sh -c whoami; sleep 1000 rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 \| \_ bash -c whoami; sleep 1000 sh rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 \| \_ sleep 1000 (killed) # ps aux rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000 Case b) The 'sleep' should terminate as well: # timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' & # ps auxf root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000" root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 \| \_ sh -c whoami; sleep 1000 root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 \| \_ sleep 1000 (killed) # ps aux (now is OK!) This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.	This issue was discovered at the scale lab, when rabbit nodes were running under load. Timeout is being used for rabbitmqctl stop, start and wait, which uses a 'su': sh -x /usr/sbin/rabbitmq-server <...>. The 'su' changes the original process group. And if the timeout expired, it would kill only the child processes in the original process group leaving the commands in the new process group running orphaned. Here is an example flow (from atop binary logs): http://paste.openstack.org/show/cyyI2H5Ih1oT0Q4fgMh5/ Here is how to test it: Case a) The 'sleep' should detach to init and run orhaned: # timeout -s TERM 60 sh -c 'su rabbitmq sh -c "whoami; sleep 1000"' & # ps auxf root 32066 0.0 0.0 100932 708 pts/0 S 14:47 0:00 \_ timeout -s TERM 60 sh -c su rabbitmq sh -c "whoami; sleep 1000" root 32067 0.0 0.0 141316 1564 pts/0 S 14:47 0:00 \| \_ su rabbitmq sh -c whoami; sleep 1000 rabbitmq 32068 0.0 0.0 106060 1304 ? Ss 14:47 0:00 \| \_ bash -c whoami; sleep 1000 sh rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 \| \_ sleep 1000 (killed) # ps aux rabbitmq 32070 0.0 0.0 100904 592 ? S 14:47 0:00 sleep 1000 Case b) The 'sleep' should terminate as well: # timeout -s TERM 60 sh -c 'sh -c "whoami; sleep 1000"' & # ps auxf root 13586 0.0 0.0 100932 708 pts/0 S 14:51 0:00 \_ timeout -s TERM 60 sh -c sh -c "whoami; sleep 1000" root 13587 0.0 0.0 106056 1292 pts/0 S 14:51 0:00 \| \_ sh -c whoami; sleep 1000 root 13589 0.0 0.0 100904 596 pts/0 S 14:51 0:00 \| \_ sleep 1000 (killed) # ps aux (now is OK!) The solution is to issue all timeout wrapped rabbitmqctl commands as a rabbitmq user, so the rabbitmqctl would not have to use the 'su'. This issue may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.
2015-04-21 08:28:12	Bogdan Dobrelya	fuel/5.1.x: status	Confirmed	Triaged
2015-04-21 08:28:14	Bogdan Dobrelya	fuel/6.0.x: status	Confirmed	Triaged
2015-04-21 09:48:15	Bogdan Dobrelya	summary	RabbitMQ OCF timeout does not kill child processes	RabbitMQ OCF timeout should be used without 'su' childs
2015-04-21 16:39:38	OpenStack Infra	fuel: assignee	Bogdan Dobrelya (bogdando)	Sergii Golovatiuk (sgolovatiuk)
2015-04-22 12:19:26	OpenStack Infra	fuel: assignee	Sergii Golovatiuk (sgolovatiuk)	Bogdan Dobrelya (bogdando)
2015-04-22 17:43:11	OpenStack Infra	fuel: assignee	Bogdan Dobrelya (bogdando)	Alexander Nevenchannyy (anevenchannyy)
2015-04-22 21:07:19	OpenStack Infra	fuel: status	In Progress	Fix Committed
2015-05-04 10:24:35	Bogdan Dobrelya	fuel/5.1.x: assignee	Fuel Library Team (fuel-library)	Bogdan Dobrelya (bogdando)
2015-05-04 10:24:39	Bogdan Dobrelya	fuel/6.0.x: assignee	Fuel Library Team (fuel-library)	Bogdan Dobrelya (bogdando)
2015-05-04 10:24:43	Bogdan Dobrelya	fuel/6.0.x: status	Triaged	In Progress
2015-05-04 10:24:47	Bogdan Dobrelya	fuel/5.1.x: status	Triaged	In Progress
2015-05-08 09:23:02	Bogdan Dobrelya	fuel/6.0.x: status	In Progress	Fix Committed
2015-05-08 09:23:06	Bogdan Dobrelya	fuel/5.1.x: status	In Progress	Fix Committed