Periodic tasks are performed on all workers

Bug #1702349 reported by Mark Goddard
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Magnum
Fix Released
Undecided
Spyros Trigazis

Bug Description

Magnum conductor performs periodic tasks in order to synchronise magnum state with that of other services, such as heat. This occurs with interval as set by [DEFAULT] periodic_interval_max, default 60.

The magnum conductor service forks a number of worker processes to handle RPC requests. The number of workers is determined by [conductor] workers, or the number of CPU cores if unset.

Based on the magnum conductor logs, each of the workers runs periodic tasks, and they all seem to be temporally aligned. For example, if a cluster fails during creation due to the famous 'no valid hosts found' failure, there will generally be many of these messages in the logs:

2017-07-04 19:41:04.454 53 ERROR magnum.drivers.heat.driver [req-a9a2942d-ef71-41b3-83b0-f8a4ac5f90f5 - - - - -] Cluster error, stack status: CREATE_FAILED, stack_id: 6549d2e2-d4a8-47ad-b07e-bb28a0b5a0a1, reason: Resource CREATE failed: ResourceInError: resources.kube_masters.resources[0].resources.kube-master: Went to status ERROR due to "Message: No valid host was found. There are not enough hosts available., Code: 500"

Clearly this is not ideal, as there are many workers performing the same task at the same time, all with the same outcome, which includes spamming the logs with multiple identical error messages.

The periodic task execution framework should be updated to distribute tasks among conductor services on multiple hosts, and multiple worker processes within each conductor service.

Revision history for this message
Mark Goddard (mgoddard) wrote :

This does not occur to the same degree on ocata, as multi-worker support[1] was added during the pike development cycle. In ocata, conductor services running on multiple hosts will still all run the same periodic update jobs, which seems inefficient, but not nearly as bad as having multiple workers running periodic updates.

A simple solution that would at least revert to the ocata behaviour would be to only execute periodic updates in the parent process, and not the workers.

[1] https://blueprints.launchpad.net/magnum/+spec/magnum-multiple-process-workers

Changed in magnum:
assignee: nobody → Spyros Trigazis (strigazi)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to magnum (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/505130

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to magnum (master)

Reviewed: https://review.openstack.org/499739
Committed: https://git.openstack.org/cgit/openstack/magnum/commit/?id=8ce15c4510b34c1cb659afc6a454561c21b7b8a8
Submitter: Jenkins
Branch: master

commit 8ce15c4510b34c1cb659afc6a454561c21b7b8a8
Author: Mohammed Naser <email address hidden>
Date: Thu Aug 31 13:15:26 2017 -0400

    Avoid running periodic processes inside each worker process

    The periodic jobs are currently getting registered per each worker
    which means that in cases with large number of workers, the APIs
    for services such as Heat and Keystone will be hit very hard.

    This patch resolves this issue by registering the jobs only to the
    main process, ensuring that they run once per instance (or group
    of workers).

    Closes-Bug: #1702349

    Change-Id: If9e13effc14fd35e646d02bb4f568e79786aa958

Changed in magnum:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to magnum (stable/pike)

Reviewed: https://review.openstack.org/505130
Committed: https://git.openstack.org/cgit/openstack/magnum/commit/?id=9c7a0c4b8ad8a8780f8188a1f8601e7b63f18fa9
Submitter: Jenkins
Branch: stable/pike

commit 9c7a0c4b8ad8a8780f8188a1f8601e7b63f18fa9
Author: Mohammed Naser <email address hidden>
Date: Thu Aug 31 13:15:26 2017 -0400

    Avoid running periodic processes inside each worker process

    The periodic jobs are currently getting registered per each worker
    which means that in cases with large number of workers, the APIs
    for services such as Heat and Keystone will be hit very hard.

    This patch resolves this issue by registering the jobs only to the
    main process, ensuring that they run once per instance (or group
    of workers).

    Closes-Bug: #1702349

    Change-Id: If9e13effc14fd35e646d02bb4f568e79786aa958
    (cherry-pick from 867369f85869813545c71e256ee44500ad63fb75)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to magnum (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/539240

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to magnum (master)

Reviewed: https://review.openstack.org/539240
Committed: https://git.openstack.org/cgit/openstack/magnum/commit/?id=d11f87d0ca42cb9dc3cb0923d02c9049b9b5b8bd
Submitter: Zuul
Branch: master

commit d11f87d0ca42cb9dc3cb0923d02c9049b9b5b8bd
Author: Spyros Trigazis <email address hidden>
Date: Tue Jan 30 15:31:04 2018 +0000

    Start RPC service before waiting

    Stoping magnum-cond without having invoke start(),
    results in "WARNING oslo_messaging.server Possible
    hang: stop is waiting for start to complete".

    A magnum instance with 16 workers was taking 1m to stop
    with this change it takes 1 to 10 seconds. This change
    doesn't break the fix in [1].

    [1] If9e13effc14fd35e646d02bb4f568e79786aa958

    Related-Bug: #1702349
    Related issue in sahara:
    Related-Bug: #1546119

    Change-Id: Ied7ab43398d4e499514fa0bd5dba64971d1956bf

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/magnum 6.0.0

This issue was fixed in the openstack/magnum 6.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/magnum 5.0.2

This issue was fixed in the openstack/magnum 5.0.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.