Job gate-rally-dsvm-zaqar-zaqar fails since the recent Rally patch

Bug #1544839 reported by Eva Balycheva
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Rally
Fix Released
Undecided
Unassigned
python-keystoneclient
Invalid
Undecided
Unassigned

Bug Description

Hello.

Recently gate-rally-dsvm-zaqar-zaqar job ("rally-jobs/zaqar-zaqar.yaml" inside Zaqar repository) started to fail because of connection errors. See Rally log here, for example: https://review.openstack.org/#/c/276129/.

I'm able to reproduce the problem locally.
And I found out that the problem only occurs since this patch to Rally: https://review.openstack.org/#/c/274977/. The Rally job runs fine, if I checkout any earlier patch.

Other zaqar-related jobs ("rally-jobs/rally-zaqar.yaml" inside Rally repository) do not fail, because they test keystone and Zaqar under much smaller load.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Can you debug it and find out what is in the credentials it's trying to use? There might be some extra parameters that are not being created in the Rally gate jobs for some reason. If you can dump the args that are passed to _create_keystone_client that would help.

Revision history for this message
Eva Balycheva (ubershy) wrote :

Here are args passed to "_create_keystone_client" on successful iterations:

{
   "username":"admin",
   "region_name":"RegionOne",
   "https_insecure":false,
   "tenant_name":"admin",
   "user_domain_name":"Default",
   "admin_domain_name":"Default",
   "endpoint_type":"public",
   "auth_url":"http://127.0.0.1:5000/v2.0",
   "cacert":null,
   "password":"tastysausagerolls",
   "endpoint":null,
   "https_cacert":null,
   "domain_name":null,
   "timeout":180.0,
   "insecure":false,
   "project_domain_name":"Default"
}

Sometimes "client.Client(**args)" inside "_create_keystone_client" throws an exception:
"ConnectFailure: Unable to establish connection to http://127.0.0.1:5000/v2.0/tokens"
I wrapped it inside try block to find which args were passed to "_create_keystone_client". Here are the args when exception happens:

{
   "username":"c_rally_cc4e52f4_zT07okxp",
   "region_name":"RegionOne",
   "https_insecure":false,
   "tenant_name":"c_rally_cc4e52f4_bO7B3oZq",
   "user_domain_name":"default",
   "admin_domain_name":"Default",
   "endpoint_type":"public",
   "auth_url":"http://127.0.0.1:5000/v2.0",
   "cacert":null,
   "password":"d3ee53aa-93ea-4c22-9357-510c0ed7167b",
   "endpoint":null,
   "https_cacert":null,
   "domain_name":null,
   "timeout":180.0,
   "insecure":false,
   "project_domain_name":"default"
}

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Ok one more thing, can you see what the value of auth_args is when it gets passed to identity.Password()

Revision history for this message
Eva Balycheva (ubershy) wrote :

Here are auth_args passed to identity.Password() on successful iterations:

{
   'username':'admin',
   'tenant_name':'admin',
   'password':'tastysausagerolls',
   'project_name':None,
   'auth_url':'http://127.0.0.1:5000/v2.0'
}

Here are auth_args on failed iterations (when "client.Client(**args)" throws an exception):

{
   'username':u'c_rally_e8b6dd35_oxkfKzAL',
   'tenant_name':u'c_rally_e8b6dd35_DGgd28L9',
   'password':'d75aaa07-394b-4176-9792-3c24692425ac',
   'project_name':None,
   'auth_url':u'http://127.0.0.1:5000/v2.0'
}

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Ok thanks for the info. Unless the credentials getting passed are wrong, I cannot see anything wrong with this. Can you double check the credentials are correct? Also check which version of keystoneclient you're using and compare with Rally's.

This obviously passes the gate in Rally and is the recommended way to construct clients now.

Revision history for this message
Eva Balycheva (ubershy) wrote :

I use python-keystoneclient==2.1.2. This is the latest version available in pypi.

> This obviously passes the gate in Rally and is the recommended way to construct clients now.
As I said Zaqar-related jobs in Rally project pass because they don't use high load like in "zaqar-zaqar.yaml". Once you increase the numbers in them, they will fail too.
Any of them will not fail, if Rally is downgraded to patch older than https://review.openstack.org/#/c/274977/.
It's easy not to trust me. But it's also easy to reproduce the problem locally in DevStack:
1. Enable Rally and Zaqar plugins in your "local.conf".
2. Also to download fresh versions of software, add "RECLONE=yes" and "PIP_UPGRADE=True" lines to your "local.conf".
2. Start DevStack.
3. In the terminal run: "rally task start /opt/stack/zaqar/rally-jobs/zaqar-zaqar.yaml". The job will fail.
4. cd to "/opt/stack/rally" and run: "git review -d 274700". This will download Rally just before https://review.openstack.org/#/c/274977/ patch.
5. In the terminal run: "rally task start /opt/stack/zaqar/rally-jobs/zaqar-zaqar.yaml". The job will pass.

If the problem occurs only on high load, I guess, there might be a race condition involved somewhere.

Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: [Bug 1544839] Re: Job gate-rally-dsvm-zaqar-zaqar fails since the recent Rally patch

On Sun, 14 Feb 2016 07:32:36 you wrote:
> If the problem occurs only on high load, I guess, there might be a race
> condition involved somewhere.

Ok this is a great clue, thanks. I'll see if there might be a problem there.

Revision history for this message
Eva Balycheva (ubershy) wrote :

Thank you

Changed in rally:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to rally (master)

Reviewed: https://review.openstack.org/281935
Committed: https://git.openstack.org/cgit/openstack/rally/commit/?id=4bac4bdfe055c8f60eff87fe9682463ae626496a
Submitter: Jenkins
Branch: master

commit 4bac4bdfe055c8f60eff87fe9682463ae626496a
Author: Andrey Kurilin <email address hidden>
Date: Thu Feb 18 18:22:46 2016 +0200

    Revert "Make Rally cope with unversioned keystone URL"

    This reverts commit 43028c9bbae60b4fbdf31272e420e1096af363e6, which broke
    several features:
     - existing deployemnt with given endpoint
     - insecure mode

    Closes-Bug: #1545889
    Closes-Bug: #1547092
    Closes-Bug: #1544839

    Change-Id: I12123a021fcff6a48d46fa14d5148d3e7e98e207

Changed in rally:
status: In Progress → Fix Released
Revision history for this message
Julian Edwards (julian-edwards) wrote :

I've recreated it and debugged it a bit. The server is returning ConnectionRefused, so it's not obvious how the change I made caused this problem. I'll take it up with the keystone devs.

Revision history for this message
Julian Edwards (julian-edwards) wrote :
Download full text (7.3 KiB)

From the Rally log:

2016-02-19 07:03:14.358 31008 DEBUG keystoneclient.session [-] REQ: curl -g -i --insecure -X GET http://10.0.3.17:5000/v2.0 -H "Accept: application/json" -H "User
-Agent: python-keystoneclient" _http_log_request /usr/local/lib/python2.7/dist-packages/keystoneclient/session.py:198
2016-02-19 07:03:14.359 31007 DEBUG keystoneclient.session [-] REQ: curl -g -i --insecure -X GET http://10.0.3.17:5000/v2.0 -H "User-Agent: python-keystoneclient"
 -H "Accept: application/json" -H "X-Auth-Token: {SHA1}6d22f944767649edc309d56bdf1a5df952a63bdb" _http_log_request /usr/local/lib/python2.7/dist-packages/keystone
client/session.py:198
2016-02-19 07:03:14.361 31012 DEBUG keystoneclient.session [-] RESP: [200] Content-Length: 335 Vary: X-Auth-Token Keep-Alive: timeout=5, max=100 Server: Apache/2.
4.7 (Ubuntu) Connection: Keep-Alive Date: Fri, 19 Feb 2016 07:03:14 GMT Content-Type: application/json x-openstack-request-id: req-b89b39fd-d9cb-42bf-94b0-49d8a65
02238
RESP BODY: {"version": {"status": "stable", "updated": "2014-04-17T00:00:00Z", "media-types": [{"base": "application/json", "type": "application/vnd.openstack.ide
ntity-v2.0+json"}], "id": "v2.0", "links": [{"href": "http://10.0.3.17:5000/v2.0/", "rel": "self"}, {"href": "http://docs.openstack.org/", "type": "text/html", "r
el": "describedby"}]}}
 _http_log_response /usr/local/lib/python2.7/dist-packages/keystoneclient/session.py:216
2016-02-19 07:03:14.361 31012 DEBUG keystoneclient.auth.identity.v2 [-] Making authentication request to http://10.0.3.17:5000/v2.0/tokens get_auth_ref /usr/local/lib/python2.7/dist-packages/keystoneclient/auth/identity/v2.py:86
2016-02-19 07:03:14.362 31007 ERROR rally.task.runner [-] Unable to establish connection to http://10.0.3.17:5000/v2.0
2016-02-19 07:03:14.362 31007 ERROR rally.task.runner Traceback (most recent call last):
2016-02-19 07:03:14.362 31007 ERROR rally.task.runner File "/opt/stack/rally/rally/task/runner.py", line 66, in _run_scenario_once
2016-02-19 07:03:14.362 31007 ERROR rally.task.runner deprecated_output = getattr(scenario_inst, method_name)(**kwargs)
2016-02-19 07:03:14.362 31007 ERROR rally.task.runner File "/opt/stack/rally/rally/common/logging.py", line 266, in wrapper
2016-02-19 07:03:14.362 31007 ERROR rally.task.runner result = f(*args, **kwargs)
2016-02-19 07:03:14.362 31007 ERROR rally.task.runner File "/opt/stack/rally/rally/plugins/openstack/scenarios/zaqar/basic.py", line 35, in create_queue
2016-02-19 07:03:14.362 31007 ERROR rally.task.runner self._queue_create(**kwargs)
2016-02-19 07:03:14.362 31007 ERROR rally.task.runner File "/opt/stack/rally/rally/task/atomic.py", line 84, in func_atomic_actions
2016-02-19 07:03:14.362 31007 ERROR rally.task.runner f = func(self, *args, **kwargs)
2016-02-19 07:03:14.362 31007 ERROR rally.task.runner File "/opt/stack/rally/rally/plugins/openstack/scenarios/zaqar/utils.py", line 31, in _queue_create
2016-02-19 07:03:14.362 31007 ERROR rally.task.runner return self.clients("zaqar").queue(name, **kwargs)
2016-02-19 07:03:14.362 31007 ERROR rally.task.runner File "/opt/stack/rally/rally/plugins/openstack/scenario.py", line 71, in clients
2016-...

Read more...

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Keystone guys, can you comment on the above please? I changed the Rally client code to use sessions with the generic Client and it starts coming up with about a 60% failure rate compared to no problems for the old approach of passing user/password into a v2/v3 client. Rally is highly threaded so perhaps there is a threadsafe issue?

Revision history for this message
Julian Edwards (julian-edwards) wrote :

OK so I think this is a problem with the apache config. The new client side seems to be quite a lot faster at sending requests so Apache ends up with:

[Fri Feb 19 06:51:29.325370 2016] [mpm_event:error] [pid 19917:tid 139958145763200] AH00484: server reached MaxRequestWorkers setting, consider raising the MaxRequestWorkers setting

Revision history for this message
Morgan Fainberg (mdrnstm) wrote :

this looks like a case where the apache config needs adjusting as you said Julian. I'm not surprised rally is able to crush the devstack apache basic config when using session (creating session quickly).

This also may be a case of connections being held open longer than needed. The two places I would look is

* see if request pipelining is being used (not sure) with the sessions and the session objects are lingering around after it is no longer needed.

* the apache config can be tuned for rally when used at higher concurrency.

Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: [Bug 1544839] Re: Job gate-rally-dsvm-zaqar-zaqar fails since the recent Rally patch

On Thu, 3 Mar 2016 07:34:18 you wrote:
> this looks like a case where the apache config needs adjusting as you
> said Julian. I'm not surprised rally is able to crush the devstack
> apache basic config when using session (creating session quickly).

I think the old code was caching clients but the new code needs to cache
sessions instead. Just a guess but worth checking out to see if sessions are
being reused properly where necessary.

Revision history for this message
Morgan Fainberg (mdrnstm) wrote :

I'm going to mark this as "incomplete" since it seems like this is not a keystoneclient issue but either a devstack or rally issue (apache config needs tuning, etc) see the last few comments for more details

Changed in python-keystoneclient:
status: New → Incomplete
Revision history for this message
Steve Martinelli (stevemar) wrote :

looks like morgan's assessment was correct, marking as invalid

Changed in python-keystoneclient:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.