heat-engine doesn't do shutdown gracefully when SIGTERM received

Bug #1304244 reported by Mitsuru Kanabuchi
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
Medium
Mitsuru Kanabuchi
oslo-incubator
Fix Released
Medium
Mitsuru Kanabuchi

Bug Description

[Issue]

The heat-engine does shutdown immediately when it received SIGTERM in middle of stack creation.

My expectation is the heat-engine would exit process after stack creation finished when SIGTERM received.
Because graceful shutdown functionality already merged from oslo.
In my understanding, this functionality prevent process exit in middle of processing.

  https://blueprints.launchpad.net/oslo/+spec/graceful-shutdown

I'm not sure my expectation is right or not.
However, this behavior is different from nova-compute.

[How to reproduce]

commit id: 043590a62fa576ba339cdad3047a37c4ad583f6c

1) configure heat-engine as daemon process.

$ cat /etc/init/heat-engine.conf
description "heat-engine"
author "openstack"

start on (local-filesystems and net-device-up IFACE!=lo)
stop on runlevel [016]

exec su -s /bin/sh -c "exec /opt/stack/heat/bin/heat-engine --config-file=/etc/heat/heat.conf --log-file /home/devstack/log/heat-engine.log > /dev/null 2>&1" devstack

2) start heat-engine as daemon.

$ sudo service heat-engine start
heat-engine start/running, process 31555

3) create something stack it takes a lot of time

$ heat stack-create -f a-lot-of-vm a-lot-of-vm
+--------------------------------------+-------------+--------------------+----------------------+
| id | stack_name | stack_status | creation_time |
+--------------------------------------+-------------+--------------------+----------------------+
| b1ada312-f8d5-4afc-af09-2ed9a1fbe4b3 | a-lot-of-vm | CREATE_IN_PROGRESS | 2014-04-08T07:47:02Z |
+--------------------------------------+-------------+--------------------+----------------------+

4) send SIGTERM in middle of stack creation

$ ps aux|grep heat-engine|grep -v grep
root 31555 0.0 0.0 4052 1548 ? Ss 16:45 0:00 su -s /bin/sh -c exec /opt/stack/heat/bin/heat-engine --config-file=/etc/heat/heat.conf --log-file /home/devstack/log/heat-engine.log > /dev/null 2>&1 devstack
devstack 31557 1.5 1.9 64868 40480 ? Ss 16:45 0:01 python /opt/stack/heat/bin/heat-engine --config-file=/etc/heat/heat.conf --log-file /home/devstack/log/heat-engine.log
$ kill -SIGTERM 31557

5) check heat-engine.log, so we can see heat-engine does shutdown immediately

    :
2014-04-08 16:47:26.611 DEBUG heat.engine.scheduler [-] Task stack_task from Stack "a-lot-of-vm" [b1ada312-f8d5-4afc-af09-2ed9a1fbe4b3] sleeping from (pid=31557) _sleep /opt/stack/heat/heat/engine/scheduler.py:130
2014-04-08 16:47:27.612 DEBUG heat.engine.scheduler [-] Task stack_task from Stack "a-lot-of-vm" [b1ada312-f8d5-4afc-af09-2ed9a1fbe4b3] running from (pid=31557) step /opt/stack/heat/heat/engine/scheduler.py:186
2014-04-08 16:47:27.612 DEBUG heat.engine.scheduler [-] Task resource_action running from (pid=31557) step /opt/stack/heat/heat/engine/scheduler.py:186
2014-04-08 16:47:27.613 INFO urllib3.connectionpool [-] Starting new HTTP connection (1): 127.0.0.1
2014-04-08 16:47:27.718 INFO heat.openstack.common.service [-] Caught SIGTERM, exiting
2014-04-08 16:47:27.721 DEBUG amqp [-] Closed channel #1 from (pid=31557) _do_close /usr/lib/python2.7/dist-packages/amqp/channel.py:88
2014-04-08 16:47:27.722 DEBUG amqp [-] Closed channel #1 from (pid=31557) _do_close /usr/lib/python2.7/dist-packages/amqp/channel.py:88

6) The stack status is "CREATE_IN_PROGRESS" after rebooted heat-engine.

$ sudo service heat-engine start
heat-engine start/running, process 32161
$ heat stack-list
+--------------------------------------+-------------+--------------------+----------------------+
| id | stack_name | stack_status | creation_time |
+--------------------------------------+-------------+--------------------+----------------------+
| b1ada312-f8d5-4afc-af09-2ed9a1fbe4b3 | a-lot-of-vm | CREATE_IN_PROGRESS | 2014-04-08T07:47:02Z |
+--------------------------------------+-------------+--------------------+----------------------+

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

heat-engine would need to stop processing new requests and remain running until all current IN_PROGRESS jobs were complete.

Changed in heat:
status: New → Triaged
importance: Undecided → Medium
milestone: none → juno-1
Revision history for this message
Clint Byrum (clint-fewbar) wrote : Re: [Bug 1304244] Re: heat-engine doesn't do shutdown gracefully when SIGTERM received
Download full text (5.2 KiB)

IMO this isn't really fixable until Heat records all of its state in
the db and doesn't need to keep a stack action in RAM. That seems like
something we'll need for reliable update failure recovery anyway, so
perhaps we should just make sure this works once that is in place.

Excerpts from Steve Baker's message of 2014-04-08 22:13:20 UTC:
> heat-engine would need to stop processing new requests and remain
> running until all current IN_PROGRESS jobs were complete.
>
> ** Changed in: heat
> Status: New => Triaged
>
> ** Changed in: heat
> Importance: Undecided => Medium
>
> ** Changed in: heat
> Milestone: None => juno-1
>
> --
> You received this bug notification because you are subscribed to heat.
> https://bugs.launchpad.net/bugs/1304244
>
> Title:
> heat-engine doesn't do shutdown gracefully when SIGTERM received
>
> Status in Orchestration API (Heat):
> Triaged
>
> Bug description:
> [Issue]
>
> The heat-engine does shutdown immediately when it received SIGTERM in
> middle of stack creation.
>
> My expectation is the heat-engine would exit process after stack creation finished when SIGTERM received.
> Because graceful shutdown functionality already merged from oslo.
> In my understanding, this functionality prevent process exit in middle of processing.
>
> https://blueprints.launchpad.net/oslo/+spec/graceful-shutdown
>
> I'm not sure my expectation is right or not.
> However, this behavior is different from nova-compute.
>
> [How to reproduce]
>
> commit id: 043590a62fa576ba339cdad3047a37c4ad583f6c
>
> 1) configure heat-engine as daemon process.
>
> $ cat /etc/init/heat-engine.conf
> description "heat-engine"
> author "openstack"
>
> start on (local-filesystems and net-device-up IFACE!=lo)
> stop on runlevel [016]
>
> exec su -s /bin/sh -c "exec /opt/stack/heat/bin/heat-engine --config-
> file=/etc/heat/heat.conf --log-file /home/devstack/log/heat-engine.log
> > /dev/null 2>&1" devstack
>
> 2) start heat-engine as daemon.
>
> $ sudo service heat-engine start
> heat-engine start/running, process 31555
>
> 3) create something stack it takes a lot of time
>
> $ heat stack-create -f a-lot-of-vm a-lot-of-vm
> +--------------------------------------+-------------+--------------------+----------------------+
> | id | stack_name | stack_status | creation_time |
> +--------------------------------------+-------------+--------------------+----------------------+
> | b1ada312-f8d5-4afc-af09-2ed9a1fbe4b3 | a-lot-of-vm | CREATE_IN_PROGRESS | 2014-04-08T07:47:02Z |
> +--------------------------------------+-------------+--------------------+----------------------+
>
> 4) send SIGTERM in middle of stack creation
>
> $ ps aux|grep heat-engine|grep -v grep
> root 31555 0.0 0.0 4052 1548 ? Ss 16:45 0:00 su -s /bin/sh -c exec /opt/stack/heat/bin/heat-engine --config-file=/etc/heat/heat.conf --log-file /home/devstack/log/heat-engine.log > /dev/null 2>&1 devstack
> devstack 31557 1.5 1.9 64868 40480 ? Ss 16:45 0:01 python /opt/stack/heat/bin/heat-engine --config-file=...

Read more...

Revision history for this message
Mitsuru Kanabuchi (kanabuchi) wrote :

I think, this behavior occured by EngineService hasn't implement stop function.
The implementation of Service(rpc/service.py) is closing connection and kill tasks immediately.
I guess, implement EngineService:stop and waiting stack processing finish (by checking ThreagGroupManager?) in the function can resolve this problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/86497

Changed in heat:
assignee: nobody → Mitsuru Kanabuchi (kanabuchi)
status: Triaged → In Progress
Changed in oslo:
assignee: nobody → Mitsuru Kanabuchi (kanabuchi)
Revision history for this message
Mitsuru Kanabuchi (kanabuchi) wrote :

IFor implementing graceful shutdown in heat-engine, heat-engine should stop ThreadGroup.timers before ThreadGroup.threads finished.
Currently oslo's ThreadGroup:stop() would stop both Timers and Threads.
We need stop_timers() and stop_threads() for heat-engine's graceful shutdown.
So I propose this change to oslo.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo-incubator (master)

Fix proposed to branch: master
Review: https://review.openstack.org/87180

Changed in oslo:
status: New → In Progress
Ben Nemec (bnemec)
Changed in oslo:
importance: Undecided → Medium
Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix merged to oslo-incubator (master)

Reviewed: https://review.openstack.org/87180
Committed: https://git.openstack.org/cgit/openstack/oslo-incubator/commit/?id=fdc88831e272dec25ed77a176e6bdfb9cddb6c3d
Submitter: Jenkins
Branch: master

commit fdc88831e272dec25ed77a176e6bdfb9cddb6c3d
Author: Mitsuru Kanabuchi <email address hidden>
Date: Thu Apr 17 11:31:46 2014 +0900

    Add graceful stop function to ThreadGroup.stop

    Currently ThreadGroup.stop() would stop both timers and threads
    immediately.
    However, heat-engine should stop timers before threads finished for
    the purpose of graceful shutdown.

    The graceful shutdown for Heat is "do process exit after stack
    processing finished". Heat implemented the stack processing as threads.
    We should wait its finishing for graceful shutdown's purpose.

    On the one hand, Heat is using timers.
    The timers have the function of make another job threads.
    It means, we should stop timers before waiting threads for preventing
    another thread occur by timers.

    From the above, the appropriate order of Heat's graceful shutdown is:

      * stop timers for preventing new thread occur
      * wait for all threads to be finished
      * process exit

    However, currently ThreadGroup class doesn't have the function of
    graceful stop. So I propose the function of graceful stop.

    Change-Id: Id575674af95ae7ad88c00a2ac5d629ab0d0a9b46
    Closes-bug: #1304244

Changed in oslo:
status: In Progress → Fix Committed
Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/89484

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/89484
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=ea911b0210c4b4317de6bd371c25f5cb9c255655
Submitter: Jenkins
Branch: master

commit ea911b0210c4b4317de6bd371c25f5cb9c255655
Author: Mitsuru Kanabuchi <email address hidden>
Date: Fri Apr 25 20:56:24 2014 +0900

    Update openstack-common in prep for graceful stop

    Now at oslo-incubator version 95c7598917c7cd16f68a253a2048c21b3df9dfda

    Change-Id: I2655ba2b9a0fb5048576f2b738d2af52c76c1d05
    Partial-Bug: #1304244

Revision history for this message
Openstack Gerrit (openstack-gerrit) wrote :

Reviewed: https://review.openstack.org/86497
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=3e3f9a9a4daf3b3fa6465ea6bed97e318ed04b93
Submitter: Jenkins
Branch: master

commit 3e3f9a9a4daf3b3fa6465ea6bed97e318ed04b93
Author: Mitsuru Kanabuchi <email address hidden>
Date: Fri Apr 25 21:07:31 2014 +0900

    Shut the heat-engine after all threads finished

    Currently heat-engine do shutdown immediately when SIGTERM received.
    At that time, a stacks in middle of processing would terminate forcefully.
    It's not graceful.

    This patch aims to implement graceful shutdown with following methods.

      * Close rpc connection at first for preventing new requests arrived
        after SIGTERM received.
      * Stop stack processing with graceful option.
        The graceful stop functionality is provided by oslo-incubator.
      * Then terminating process.

    Change-Id: I8689b830774f7916febb59aca00979d92c0448b5
    Closes-bug: #1304244

Changed in heat:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo-incubator (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/96494

Thierry Carrez (ttx)
Changed in heat:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in oslo:
milestone: none → juno-1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo-incubator (stable/icehouse)

Reviewed: https://review.openstack.org/96494
Committed: https://git.openstack.org/cgit/openstack/oslo-incubator/commit/?id=3f830bfd8c70a237aa5c38abfa27c18e286289e2
Submitter: Jenkins
Branch: stable/icehouse

commit 3f830bfd8c70a237aa5c38abfa27c18e286289e2
Author: Mitsuru Kanabuchi <email address hidden>
Date: Thu Apr 17 11:31:46 2014 +0900

    Add graceful stop function to ThreadGroup.stop

    Currently ThreadGroup.stop() would stop both timers and threads
    immediately.
    However, heat-engine should stop timers before threads finished for
    the purpose of graceful shutdown.

    The graceful shutdown for Heat is "do process exit after stack
    processing finished". Heat implemented the stack processing as threads.
    We should wait its finishing for graceful shutdown's purpose.

    On the one hand, Heat is using timers.
    The timers have the function of make another job threads.
    It means, we should stop timers before waiting threads for preventing
    another thread occur by timers.

    From the above, the appropriate order of Heat's graceful shutdown is:

      * stop timers for preventing new thread occur
      * wait for all threads to be finished
      * process exit

    However, currently ThreadGroup class doesn't have the function of
    graceful stop. So I propose the function of graceful stop.

    Conflicts:
     tests/unit/test_threadgroup.py

    Change-Id: Id575674af95ae7ad88c00a2ac5d629ab0d0a9b46
    Closes-bug: #1304244
    (cherry picked from commit fdc88831e272dec25ed77a176e6bdfb9cddb6c3d)

tags: added: in-stable-icehouse
Thierry Carrez (ttx)
Changed in heat:
milestone: juno-1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.