engine broken with multiple workers w/qpid

Bug #1321303 reported by Steven Hardy on 2014-05-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Steven Hardy

Bug Description

Since https://github.com/openstack/heat/commit/56041687844c7c04bfcad974a07ac3ff35f99e5c I'm seeing the heat-engine deadlock, or rather not respond to the API request such that any API request times out.

Setting num_engine_workers = 1 in heat.conf resolves the issue for me.

Not quite sure why this is happening yet...

Steven Hardy (shardy) wrote :

Seems this issue is specific to qpid and those using rabbit are not affected.

summary: - engine deadlock with multiple workers
+ engine broken with multiple workers w/qpid
Jeff Peeler (jpeeler-z) on 2014-05-20
Changed in heat:
status: New → Confirmed

Fix proposed to branch: master
Review: https://review.openstack.org/94428

Changed in heat:
assignee: nobody → Clint Byrum (clint-fewbar)
status: Confirmed → In Progress
Changed in heat:
importance: Undecided → High

Reviewed: https://review.openstack.org/94428
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=6d308d53e88c3add5ea381bbf8a2d88ecf6b79a1
Submitter: Jenkins
Branch: master

commit 6d308d53e88c3add5ea381bbf8a2d88ecf6b79a1
Author: Clint Byrum <email address hidden>
Date: Tue May 20 10:14:34 2014 -0700

    Revert "Default engine workers to the number of CPUs"

    This reverts commit 56041687844c7c04bfcad974a07ac3ff35f99e5c.

    Problems have been reported on systems using QPID. The gate only tests
    RabbitMQ.

    Change-Id: Iaf83dc7582835ce5bf4534116c918640da7373aa
    Partial-Bug: #1321303

Steve Baker (steve-stevebaker) wrote :

I wonder if this is affected by only multiple workers, or also multiple instances of heat-engine

Steven Hardy (shardy) wrote :

> I wonder if this is affected by only multiple workers, or also multiple instances of heat-engine

My testing indicates it's only multiple workers, running two heat-engine processes on the same box works OK AFAICT, each request gets round-robin'd as expected and e.g "heat stack-list" works fine.

I'm digging into the worker issue trying to work out a fix.

Changed in heat:
assignee: Clint Byrum (clint-fewbar) → Steven Hardy (shardy)
milestone: none → juno-1
Steven Hardy (shardy) wrote :

So initial investigation indicates it's related to the workers sharing an EngineListener object, if you create the listener in the start() instead of the constructor the problem is solved.

That also highlights the problem that the workers are all creating StackWatch objects and starting periodic tasks on the parent TheadGroupManager object - I think that will result in a periodic watcher task being fired by a timer for every stack on every worker, which is not what we want.

Working on patches for both issues.

Reviewed: https://review.openstack.org/94737
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=507555a585d22a6dc276344b7b97fa9a015e81e5
Submitter: Jenkins
Branch: master

commit 507555a585d22a6dc276344b7b97fa9a015e81e5
Author: Steven Hardy <email address hidden>
Date: Tue May 27 16:18:06 2014 +0100

    Move Engine initialization into service start()

    Currently we create the ThreadGroupManager and EngineListener
    objects in the service constructor, which is not necessarily
    going to work if multiple worker processes are specified in
    the config file (which fork multiple workers after the constructor).

    The ThreadGroupManager appears to work, even when it is created before
    the fork, but this is due to some magic in the oslo ProcessLauncher
    implementation which decouples parent/child use of the eventlet hub.

    So instead, we move all service startup code into the start() method,
    which is the entry point for services, triggered via the oslo Services
    class run_service method:

    - Don't create anything common to the workers in the constructor
    - Move ThreadGroup and EngineListener creation into start()
    - Create the periodic tasks from bin/heat-engine, which means
      the periodic tasks will only be created by the parent, not
      duplicated in every worker process.

    These changes should mean we work correctly with both the ServiceLauncher
    (num_engine_workers==1) and ProcessLauncher(num_engine_workers>1) oslo
    abstractions, and solves the issues observed when running multiple
    workers with the impl_qpid rpc_backend.

    Change-Id: If3a11050a03660560a364dec871f85c4b56c1c25
    Closes-Bug: #1321303

Changed in heat:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2014-06-11
Changed in heat:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2014-10-16
Changed in heat:
milestone: juno-1 → 2014.2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers