On Centos, api-server hogs CPU during tests affecting sanity runs
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Juniper Openstack | Status tracked in Trunk | |||||
R2.0 |
Fix Committed
|
High
|
Hampapur Ajay | |||
R2.1 |
Won't Fix
|
High
|
Hampapur Ajay | |||
Trunk |
Fix Committed
|
High
|
Hampapur Ajay |
Bug Description
Build 2.0 2413 Centos 6.5 Single node icehouse setup
During parallel sanity, contrail-api continues to hog 100% CPU affecting most of the cases... Ajay debugged the issue. Below is the thread :
From: Hampapur Ajay <email address hidden>
Date: Saturday, October 25, 2014 at 12:28 AM
To: Nagabhushana R <email address hidden>, Vedamurthy Joshi <email address hidden>
Cc: Contrail Systems Configuration Team <email address hidden>, Sandip Dey <email address hidden>, Ganesha H V <email address hidden>
Subject: Re: Is anyone available to debug an API Server issue now ?
Another issue with centOS 6.5 leading to api-server CPU hog (like nodec27 that we debugged now, Vedu) is due to reading all HTTP header 1 byte at a time from python/gevent instead of within libevent [1] (Backport evbuffer_readln()). we are using libevent 1.4.13-4 which is latest available on centOS 6.5. Issue doesn't seem to be in ubuntu.
thanks
ajay
[1] https:/
recvfrom(111, "H", 1, 0, NULL, NULL) = 1
recvfrom(111, "T", 1, 0, NULL, NULL) = 1
recvfrom(111, "T", 1, 0, NULL, NULL) = 1
recvfrom(111, "P", 1, 0, NULL, NULL) = 1
recvfrom(111, "/", 1, 0, NULL, NULL) = 1
recvfrom(111, "1", 1, 0, NULL, NULL) = 1
recvfrom(111, ".", 1, 0, NULL, NULL) = 1
recvfrom(111, "1", 1, 0, NULL, NULL) = 1
recvfrom(111, " ", 1, 0, NULL, NULL) = 1
recvfrom(111, "2", 1, 0, NULL, NULL) = 1
recvfrom(111, "0", 1, 0, NULL, NULL) = 1
recvfrom(111, "0", 1, 0, NULL, NULL) = 1
recvfrom(111, " ", 1, 0, NULL, NULL) = 1
recvfrom(111, "O", 1, 0, NULL, NULL) = 1
recvfrom(111, "K", 1, 0, NULL, NULL) = 1
recvfrom(111, "\r", 1, 0, NULL, NULL) = 1
recvfrom(111, "\n", 1, 0, NULL, NULL) = 1
recvfrom(111, "C", 1, 0, NULL, NULL) = 1
recvfrom(111, "o", 1, 0, NULL, NULL) = 1
recvfrom(111, "n", 1, 0, NULL, NULL) = 1
recvfrom(111, "t", 1, 0, NULL, NULL) = 1
recvfrom(111, "e", 1, 0, NULL, NULL) = 1
recvfrom(111, "n", 1, 0, NULL, NULL) = 1
recvfrom(111, "t", 1, 0, NULL, NULL) = 1
On Oct 20, 2014, at 10:57 AM, Nagabhushana R wrote:
also some bugs which points to similar behavior are
#1372858 Mainline 2338:VM interface taking almost 40 sec to come up after vmi was created - causing some intermittent vm verification failure in sanity
#1373831 Main line 2340 Centos havana: Its taking more than 20 secs to update ipam objects in if map
On Oct 18, 2014, at 9:55 PM, Hampapur Ajay <email address hidden> wrote:
One difference between the 2 systems is the slower one (nodec27) is running centOS so python 2.6 and faster one (nodec3) is running ubuntu so python 2.7. This seems to make quite a bit of difference see below (to validate further we can test the 2 images on same distribution).
python 2.6
[root@nodec27 ~]# python -m timeit "import json; json.dumps({'a': 1, 'b': 2})"
100000 loops, best of 3: 13.4 usec per loop
[root@nodec27 ~]#
python 2.7
root@nodec3:~# python -m timeit "import json; json.dumps({'a': 1, 'b': 2})"
100000 loops, best of 3: 4.7 usec per loop
root@nodec3:~#
thanks
ajay
On Oct 18, 2014, at 8:38 AM, Vedamurthy Ananth Joshi wrote:
Ajay,
Its a little bit better now with ‘net-list’ taking 1s.
I setup another single node setup (nodec3 with same HW config as nodec27) running 1.20 Build 59. ‘net-list’ took ~0.4 sec.
The crud tests took 5 mins on that node. CPU usage of contrail-api never went > 60% during this time
On nodec27(2.0 build) , the same crud tests took 21 mins. CPU of contrail-api was always >90% during this time.
From: Hampapur Ajay <email address hidden>
Date: Saturday, October 18, 2014 at 4:59 AM
To: Vedamurthy Joshi <email address hidden>
Cc: Contrail Systems Configuration Team <email address hidden>, Sandip Dey <email address hidden>, Ganesha H V <email address hidden>
Subject: Re: Is anyone available to debug an API Server issue now ?
Still looking at it and running the CRUD test but so far:
I think the 2s is due to memory pressure + swapping. I reduced the # of workers of nova-api and nova-conductor to 4 instead of 40 and restarted them and net-list took ~1s (similar to what i see in my system).
while running CRUD test I saw *lots* of get_port APIs being issued. Will try to sync up w you on Skype to understand and debug further.
thanks
ajay
On Oct 17, 2014, at 9:04 AM, Hampapur Ajay wrote:
will look in few mins and let u know once done.
Thanks
Ajay
On Oct 17, 2014, at 8:06 AM, "Vedamurthy Ananth Joshi" <email address hidden> wrote:
nodec27 is in same state
Let me know if you are using the setup.
[root@nodec27 ~]# time neutron net-list > /dev/null
real 0m2.411s << On a normal setup, it is ~0.5s
user 0m0.238s
sys 0m0.034s
[root@nodec27 ~]#
The router crud tests which usually take 6 mins is taking more than 30 mins …
To run it
Cd ~/github/
Export PYTHONPATH=
Python –m testtools.run scripts.
While running these test, contrail-api is always close to 100% CPU
From: Vedamurthy Joshi <email address hidden>
Date: Friday, October 17, 2014 at 11:51 AM
To: Contrail Systems Configuration Team <email address hidden>
Cc: Sandip Dey <email address hidden>
Subject: Is anyone available to debug an API Server issue now ?
Is anyone available to debug an API Server issue now ? After running some parallel tests, API Server responses are very slow
Just a net-list cmd with 10 vns and 10 projects is taking ~3 seconds and API Server hogs close to 100% CPU every few seconds..
tags: |
added: releasenote removed: blocker |
Sandip to try with increasing HA proxy timout by factor of 10.
Ajay will send 1 line patch as well