on restart, contrail-control hogs CPU for hours on scale setup

Bug #1463958 reported by Vedamurthy Joshi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
Medium
Tapan Karwa
Trunk
Fix Committed
Medium
Tapan Karwa

Bug Description

R2.20 Build 45 Ubuntu 14.04 Juno multi-node setup

In this scale setup, there are 64 tor-agents on 4 tor-agent-nodes
On contrail-control restart on the three nodes..it was seen that one of the control nodes which serves xmpp config for large number of tor-agents hogs CPU for hours.

if there are no agents/tor-agents subscribed, control-node stabilizes in < 10mins
Then, Starting the tor-agents later also stabilizes the setup quickly (<10 min)

Per Nischal, most of the time is being spent in graph walk in control node

env.roledefs = {
    'all': [host2, host3, host4, host5, host6, host7, host8, host9],
    'cfgm': [host2, host3, host4],
    'openstack': [host2, host3, host4],
    'webui': [host3],
    'control': [host2, host3, host4],
    'compute': [host5, host6, host7, host8, host9],
    'collector': [host2, host3, host4],
    'database': [host2, host3, host4],
    'toragent': [host5, host6, host7, host9 ],
    'tsn': [host5, host6, host7,host9 ],
    'build': [host_build],
}

env.hostnames = {
    'all': ['nodei34', 'nodei35', 'nodei36', 'nodei37', 'nodei38', 'nodei28', 'nodei27', 'nodei30']
}

tags: added: quench
Revision history for this message
Nischal Sheth (nsheth) wrote :

A partial fix is to modify control node behavior to not accept xmpp
connections till it has received all nodes and links from irond. This
heuristic/behavior is tracked by bug 1446869. This should prevent
multiple repeated graph walks.

Further improvements will require optimizing the graph walker to
reduce the amount of CPU used.

Both of these are non-trivial changes and will be addressed in the
next release i.e. 2.21.

Revision history for this message
Vedamurthy Joshi (vedujoshi) wrote :

As of 2.20 Build 84, on config restart , with 2 control nodes active, one of the nodes took about 7 hrs to settle down, another took 10 mins.

Revision history for this message
Nischal Sheth (nsheth) wrote :

We've implemented fixes for bug 1446869, bug 1488278, bug 1488636
bug 1474165 and bug 1483849. The net result should be that CN will
not accept xmpp connections till it has received entire IFMap database.

This ought to bring down the time required for CN to go into quiescent
state from many hours to tens of minutes. Some of the fixes have gone
in over the last 2-3 days, so can you retry with the latest build? Let's test
restart of 1 CN (which serves config for lots of agents) and get repeatable
good results for this scenario.

We haven't been able to implement any graph walker optimization yet.
Those will most likely not make it for 2.22 either.

Revision history for this message
Vedamurthy Joshi (vedujoshi) wrote :

With the latest 2.20 95 image, the control node is seen to stabilize in ~20 mins

tags: added: contrail-control
removed: bgp
Revision history for this message
Nischal Sheth (nsheth) wrote :

Will update status to Fix Committed.
Any further improvements will be tracked via new bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.