OpenStack Heat

engine broken with multiple workers w/qpid

Bug #1321303 reported by Steven Hardy on 2014-05-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Heat	Fix Released	High	Steven Hardy	OpenStack Heat 2014.2 "juno"

Bug Description

Since https://github.com/openstack/heat/commit/56041687844c7c04bfcad974a07ac3ff35f99e5c I'm seeing the heat-engine deadlock, or rather not respond to the API request such that any API request times out.

Setting num_engine_workers = 1 in heat.conf resolves the issue for me.

Not quite sure why this is happening yet...

Revision history for this message

Steven Hardy (shardy) wrote on 2014-05-20:

Seems this issue is specific to qpid and those using rabbit are not affected.

summary:

- engine deadlock with multiple workers
+ engine broken with multiple workers w/qpid

Jeff Peeler (jpeeler-z) on 2014-05-20

Changed in heat:
status:	New → Confirmed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-20: Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/94428

Changed in heat:
assignee:	nobody → Clint Byrum (clint-fewbar)
status:	Confirmed → In Progress

Clint Byrum (clint-fewbar) on 2014-05-20

Changed in heat:
importance:	Undecided → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-20: Fix merged to heat (master)

Reviewed: https://review.openstack.org/94428
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=6d308d53e88c3add5ea381bbf8a2d88ecf6b79a1
Submitter: Jenkins
Branch: master

commit 6d308d53e88c3add5ea381bbf8a2d88ecf6b79a1
Author: Clint Byrum <email address hidden>
Date: Tue May 20 10:14:34 2014 -0700

Revert "Default engine workers to the number of CPUs"

This reverts commit 56041687844c7c04bfcad974a07ac3ff35f99e5c.

Problems have been reported on systems using QPID. The gate only tests
RabbitMQ.

Change-Id: Iaf83dc7582835ce5bf4534116c918640da7373aa
Partial-Bug: #1321303

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2014-05-21:

I wonder if this is affected by only multiple workers, or also multiple instances of heat-engine

Revision history for this message

Steven Hardy (shardy) wrote on 2014-05-21:

> I wonder if this is affected by only multiple workers, or also multiple instances of heat-engine

My testing indicates it's only multiple workers, running two heat-engine processes on the same box works OK AFAICT, each request gets round-robin'd as expected and e.g "heat stack-list" works fine.

I'm digging into the worker issue trying to work out a fix.

Changed in heat:
assignee:	Clint Byrum (clint-fewbar) → Steven Hardy (shardy)
milestone:	none → juno-1

Revision history for this message

Steven Hardy (shardy) wrote on 2014-05-21:

So initial investigation indicates it's related to the workers sharing an EngineListener object, if you create the listener in the start() instead of the constructor the problem is solved.

That also highlights the problem that the workers are all creating StackWatch objects and starting periodic tasks on the parent TheadGroupManager object - I think that will result in a periodic watcher task being fired by a timer for every stack on every worker, which is not what we want.

Working on patches for both issues.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-21: Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/94737

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-11: Fix merged to heat (master)

Reviewed: https://review.openstack.org/94737
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=507555a585d22a6dc276344b7b97fa9a015e81e5
Submitter: Jenkins
Branch: master

commit 507555a585d22a6dc276344b7b97fa9a015e81e5
Author: Steven Hardy <email address hidden>
Date: Tue May 27 16:18:06 2014 +0100

Move Engine initialization into service start()

    Currently we create the ThreadGroupManager and EngineListener
    objects in the service constructor, which is not necessarily
    going to work if multiple worker processes are specified in
    the config file (which fork multiple workers after the constructor).

    The ThreadGroupManager appears to work, even when it is created before
    the fork, but this is due to some magic in the oslo ProcessLauncher
    implementation which decouples parent/child use of the eventlet hub.

    So instead, we move all service startup code into the start() method,
    which is the entry point for services, triggered via the oslo Services
    class run_service method:

    - Don't create anything common to the workers in the constructor
    - Move ThreadGroup and EngineListener creation into start()
    - Create the periodic tasks from bin/heat-engine, which means
      the periodic tasks will only be created by the parent, not
      duplicated in every worker process.

    These changes should mean we work correctly with both the ServiceLauncher
    (num_engine_workers==1) and ProcessLauncher(num_engine_workers>1) oslo
    abstractions, and solves the issues observed when running multiple
    workers with the impl_qpid rpc_backend.

Change-Id: If3a11050a03660560a364dec871f85c4b56c1c25
Closes-Bug: #1321303

Changed in heat:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2014-06-11

Changed in heat:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2014-10-16

Changed in heat:
milestone:	juno-1 → 2014.2

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.