nailgund bogs down on large number of nodes

Bug #1274614 reported by Andrew Woodward
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Medium
Nikolay Markov

Bug Description

nailgund bogs down around 50 nodes, after 90 nodes the process is always at 100% and sometimes doesn't update nodes in a timely manner.

We need to reconfigure nailgund to run multiple worker threads. This value should be configurable and scale up to ~4x cores

Changed in fuel:
milestone: none → 4.1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-web (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/70264

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/70270

Revision history for this message
Evgeniy L (rustyrobot) wrote : Re: nailgund needs to be running multiple threads

I don't think that it's ok to make several instances of nailgun, it may cause several problems
1. in each instance of nailgun we have keep_alive thread, we don't need 8 instances of this thread
2. in each instance of nailgun we have rpc thread which listen on rabbit and receive messages from orchestrator

So, it's ok to configure several instances of nailgun as a fast hack.

How to solve it
1. refactor nodes collection handler, e.g. make separate handler for agent and not update db state if node wasn't changed
2. to reduce overhead which I've described above we can use this patch from services (but we need to test it and fix puppet manifests) https://review.openstack.org/#/c/54930/

On the first item our new Engineer already started to work, the status you cat track here https://blueprints.launchpad.net/fuel/+spec/nailgun-agent-handler

Revision history for this message
Ryan Moe (rmoe) wrote :

I agree completely. This is just a short-term workaround that has been successfully used on two large deployments. At least now the workaround is documented somewhere.

tags: added: customer-found
Changed in fuel:
importance: Undecided → Medium
summary: - nailgund needs to be running multiple threads
+ nailgund bogs down on large number of nodes
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Assigning this to Dmitry, as he is doing the "right" implementation for this issue now. I hope we can get it in 4.1. Dmitry - please make sure your patch contains "Closes-Bug: #1274614" in git commit message.

Changed in fuel:
assignee: Ryan Moe (rmoe) → Dmitry Sokolov (demon-mhm)
Mike Scherbakov (mihgen)
Changed in fuel:
milestone: 4.1 → 5.0
Revision history for this message
Andrew Woodward (xarses) wrote :

The patch (even if fixed) is too massive to merge this late in the cycle. Since the handlers where separated that caused https://review.openstack.org/#/c/70270/ to be -1'd I've asked Ryan to look too see if we can add this back in so that we can have a usable workaround for the load in 5.0

tags: removed: multi-l3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/76831

Revision history for this message
Dmitry Sokolov (demon-mhm) wrote :

I've performed some stress tests for my upcoming patch https://review.openstack.org/#/c/76831/ which introduces dedicated agents handler for Nailgun and some improvements for agent code. It sad to say but new handler didn't show speed growth. Conversely speed even with caching lowered for 10% in comparison with old handler. We have only one improvement with this patch - agents will try to update node state first, then register if update attempt returns 404. This will reduce number of requests to master node approx 2 times. Also separated handler for agents will allow us to optimize agent requests processing with no fear to harm other handlers.

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/76831
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=59ed8c081d5065046ee638ebfc74e4dab2ff0677
Submitter: Jenkins
Branch: master

commit 59ed8c081d5065046ee638ebfc74e4dab2ff0677
Author: demon.mhm <email address hidden>
Date: Thu Feb 27 15:38:21 2014 +0400

    Reduced database overhead from agents

     - added dedicated handler for node agents update only requests
     - caching data from agents to avoid db update with same data
     - nailgun responses with appropriate http statuses
     - changed agent update logic. Now it tries to update first and respects
       nailgun response statuses

    Change-Id: I2658cf7561cd8c9116acced2443d072d471f3bdb
    Implements: blueprint nailgun-agent-handler
    Closes-Bug: #1274614

Dmitry Pyzhov (dpyzhov)
tags: added: backports-4.1.1
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 5.0 → 4.1.1
status: Fix Committed → Triaged
assignee: Dmitry Sokolov (demon-mhm) → Fuel Python Team (fuel-python)
Revision history for this message
Ihor Kalnytskyi (ikalnytskyi) wrote :

Does someone profile our WSGI instance? Have we know bottleneck of our app?
The Werkzeug project has some tools to profile it. I can perform some test, if we don't know the slowest part of the node handler.

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Postponed till 4.1.2

Mike Scherbakov (mihgen)
tags: added: release-notes
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Nikolay Markov (nmarkov)
Revision history for this message
Meg McRoberts (dreidellhasa) wrote :

Marked as Fixed Issue in 5.0 Release Notes

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (stable/4.1)

Fix proposed to branch: stable/4.1
Review: https://review.openstack.org/94178

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (stable/4.1)

Reviewed: https://review.openstack.org/94178
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=318af37b9d9b1b14fc2e938029bb2cdc1421c4ef
Submitter: Jenkins
Branch: stable/4.1

commit 318af37b9d9b1b14fc2e938029bb2cdc1421c4ef
Author: demon.mhm <email address hidden>
Date: Thu Feb 27 15:38:21 2014 +0400

    Reduced database overhead from agents

     - added dedicated handler for node agents update only requests
     - caching data from agents to avoid db update with same data
     - nailgun responses with appropriate http statuses
     - changed agent update logic. Now it tries to update first and respects
       nailgun response statuses

    Implements: blueprint nailgun-agent-handler
    Closes-Bug: #1274614

    Conflicts:
     nailgun/nailgun/test/unit/test_node_nic_handler.py

    Change-Id: I2658cf7561cd8c9116acced2443d072d471f3bdb

Dmitry Pyzhov (dpyzhov)
Changed in fuel:
status: Triaged → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.