networking-odl

Journal multi threading causes neutron server to become unresponsive

Bug #1683797 reported by Mike Kolesnik on 2017-04-18

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	networking-odl	Fix Released	Critical	Mike Kolesnik

Bug Description

We have noticed in scale testing that was made that the amount of journal threads is unlimited, so for a service running on a node with a high amount of cores the API workers and RPC workers default to the amount of cores, so for a 50 core machine we will get 100 neutron server processes. In each such process there will be at least 1 journal thread, but at most there could be several depending on how much V2 drivers are used since each one instantiates a thread.

While this is already not optimal, the influx of journal threads causes the DB to misbehave due to multiple threads either querying the journal table all the time or exhausting all available DB connections (on the server side).

Each time an operation occurs it is written to the DB and the thread gets awoken to take care of it right after. In case there's a post commit hook the journal entry will be processed immediately. In case there's no such hook point, there might be a race where the journal will "miss" the entry.
Each 5 seconds (default config) a timer will awaken the thread to take care of any such missed entries, and also other entries that weren't handled due to conditions such as network connectivity loss (which halts the journal processing, to avoid a busy loop).

There's no need to have more than one thread per process as a single thread will either get awoken by the operation callback or by the timer, and process all journal entries it can process.
Also considering that python doesn't have true parallelism, it makes little sense to have virtual multi-threading in this context.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-07-25: Change abandoned on networking-odl (master)

Change abandoned by Mike Kolesnik (<email address hidden>) on branch: master
Review: https://review.openstack.org/444648
Reason: Alternative solution was agreed upon

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-07-31: Fix merged to networking-odl (master)

Reviewed: https://review.openstack.org/486606
Committed: https://git.openstack.org/cgit/openstack/networking-odl/commit/?id=f0708d838cdee61c5469002bad00e6bfa826c433
Submitter: Jenkins
Branch: master

commit f0708d838cdee61c5469002bad00e6bfa826c433
Author: Mike Kolesnik <email address hidden>
Date: Mon Jul 24 16:14:08 2017 +0300

Move journal periodic processing to a worker

    Since it's not necessary to run a plethora of timers all the time,
    moving them out to a worker which will be spawned in it's own process.
    This solves the bug and will allow in the future, should the need arise,
    to spawn multiple such workers.

Closes-Bug: #1683797
Change-Id: Ia5f9607c28e8c446f3260d9367b3df970fb1a19c

Changed in networking-odl:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-11: Fix included in openstack/networking-odl 11.0.0.0rc1

This issue was fixed in the openstack/networking-odl 11.0.0.0rc1 release candidate.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.