On Centos, api-server hogs CPU during tests affecting sanity runs

Bug #1385658 reported by Vedamurthy Joshi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.0
Fix Committed
High
Hampapur Ajay
R2.1
Won't Fix
High
Hampapur Ajay
Trunk
Fix Committed
High
Hampapur Ajay

Bug Description

Build 2.0 2413 Centos 6.5 Single node icehouse setup

During parallel sanity, contrail-api continues to hog 100% CPU affecting most of the cases... Ajay debugged the issue. Below is the thread :

From: Hampapur Ajay <email address hidden>
Date: Saturday, October 25, 2014 at 12:28 AM
To: Nagabhushana R <email address hidden>, Vedamurthy Joshi <email address hidden>
Cc: Contrail Systems Configuration Team <email address hidden>, Sandip Dey <email address hidden>, Ganesha H V <email address hidden>
Subject: Re: Is anyone available to debug an API Server issue now ?

Another issue with centOS 6.5 leading to api-server CPU hog (like nodec27 that we debugged now, Vedu) is due to reading all HTTP header 1 byte at a time from python/gevent instead of within libevent [1] (Backport evbuffer_readln()). we are using libevent 1.4.13-4 which is latest available on centOS 6.5. Issue doesn't seem to be in ubuntu.

thanks
ajay

[1] https://raw.githubusercontent.com/libevent/libevent/release-1.4.14b-stable/ChangeLog

recvfrom(111, "H", 1, 0, NULL, NULL) = 1
recvfrom(111, "T", 1, 0, NULL, NULL) = 1
recvfrom(111, "T", 1, 0, NULL, NULL) = 1
recvfrom(111, "P", 1, 0, NULL, NULL) = 1
recvfrom(111, "/", 1, 0, NULL, NULL) = 1
recvfrom(111, "1", 1, 0, NULL, NULL) = 1
recvfrom(111, ".", 1, 0, NULL, NULL) = 1
recvfrom(111, "1", 1, 0, NULL, NULL) = 1
recvfrom(111, " ", 1, 0, NULL, NULL) = 1
recvfrom(111, "2", 1, 0, NULL, NULL) = 1
recvfrom(111, "0", 1, 0, NULL, NULL) = 1
recvfrom(111, "0", 1, 0, NULL, NULL) = 1
recvfrom(111, " ", 1, 0, NULL, NULL) = 1
recvfrom(111, "O", 1, 0, NULL, NULL) = 1
recvfrom(111, "K", 1, 0, NULL, NULL) = 1
recvfrom(111, "\r", 1, 0, NULL, NULL) = 1
recvfrom(111, "\n", 1, 0, NULL, NULL) = 1
recvfrom(111, "C", 1, 0, NULL, NULL) = 1
recvfrom(111, "o", 1, 0, NULL, NULL) = 1
recvfrom(111, "n", 1, 0, NULL, NULL) = 1
recvfrom(111, "t", 1, 0, NULL, NULL) = 1
recvfrom(111, "e", 1, 0, NULL, NULL) = 1
recvfrom(111, "n", 1, 0, NULL, NULL) = 1
recvfrom(111, "t", 1, 0, NULL, NULL) = 1

On Oct 20, 2014, at 10:57 AM, Nagabhushana R wrote:

also some bugs which points to similar behavior are

#1372858 Mainline 2338:VM interface taking almost 40 sec to come up after vmi was created - causing some intermittent vm verification failure in sanity
#1373831 Main line 2340 Centos havana: Its taking more than 20 secs to update ipam objects in if map

On Oct 18, 2014, at 9:55 PM, Hampapur Ajay <email address hidden> wrote:

One difference between the 2 systems is the slower one (nodec27) is running centOS so python 2.6 and faster one (nodec3) is running ubuntu so python 2.7. This seems to make quite a bit of difference see below (to validate further we can test the 2 images on same distribution).

python 2.6
[root@nodec27 ~]# python -m timeit "import json; json.dumps({'a': 1, 'b': 2})"
100000 loops, best of 3: 13.4 usec per loop
[root@nodec27 ~]#

python 2.7
root@nodec3:~# python -m timeit "import json; json.dumps({'a': 1, 'b': 2})"
100000 loops, best of 3: 4.7 usec per loop
root@nodec3:~#

thanks
ajay

On Oct 18, 2014, at 8:38 AM, Vedamurthy Ananth Joshi wrote:

Ajay,
Its a little bit better now with ‘net-list’ taking 1s.
I setup another single node setup (nodec3 with same HW config as nodec27) running 1.20 Build 59. ‘net-list’ took ~0.4 sec.
The crud tests took 5 mins on that node. CPU usage of contrail-api never went > 60% during this time

On nodec27(2.0 build) , the same crud tests took 21 mins. CPU of contrail-api was always >90% during this time.

From: Hampapur Ajay <email address hidden>
Date: Saturday, October 18, 2014 at 4:59 AM
To: Vedamurthy Joshi <email address hidden>
Cc: Contrail Systems Configuration Team <email address hidden>, Sandip Dey <email address hidden>, Ganesha H V <email address hidden>
Subject: Re: Is anyone available to debug an API Server issue now ?

Still looking at it and running the CRUD test but so far:

I think the 2s is due to memory pressure + swapping. I reduced the # of workers of nova-api and nova-conductor to 4 instead of 40 and restarted them and net-list took ~1s (similar to what i see in my system).

while running CRUD test I saw *lots* of get_port APIs being issued. Will try to sync up w you on Skype to understand and debug further.

thanks
ajay
On Oct 17, 2014, at 9:04 AM, Hampapur Ajay wrote:

will look in few mins and let u know once done.

Thanks
Ajay

On Oct 17, 2014, at 8:06 AM, "Vedamurthy Ananth Joshi" <email address hidden> wrote:

nodec27 is in same state

Let me know if you are using the setup.

[root@nodec27 ~]# time neutron net-list > /dev/null

real 0m2.411s << On a normal setup, it is ~0.5s
user 0m0.238s
sys 0m0.034s
[root@nodec27 ~]#

The router crud tests which usually take 6 mins is taking more than 30 mins …
To run it

Cd ~/github/mine4/contrail-test
Export PYTHONPATH=$PATH:$PWD:$PWD/scripts:$PWD/fixtures
Python –m testtools.run scripts.neutron.test_crud.TestCRUD

While running these test, contrail-api is always close to 100% CPU

From: Vedamurthy Joshi <email address hidden>
Date: Friday, October 17, 2014 at 11:51 AM
To: Contrail Systems Configuration Team <email address hidden>
Cc: Sandip Dey <email address hidden>
Subject: Is anyone available to debug an API Server issue now ?

Is anyone available to debug an API Server issue now ? After running some parallel tests, API Server responses are very slow

Just a net-list cmd with 10 vns and 10 projects is taking ~3 seconds and API Server hogs close to 100% CPU every few seconds..

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Sandip to try with increasing HA proxy timout by factor of 10.
Ajay will send 1 line patch as well

tags: added: releasenote
removed: blocker
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

This has been improved via many commits. Pl restest

Changed in juniperopenstack:
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.