timed out OCF monitor action because of high CPU load

Bug #1464131 reported by Artem Hrechanychenko
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Bogdan Dobrelya
5.1.x
Won't Fix
High
Fuel Library (Deprecated)
6.0.x
Won't Fix
High
Fuel Library (Deprecated)

Bug Description

{"build_id": "2015-06-08_06-13-27", "build_number": "521", "release_versions": {"2014.2.2-6.1": {"VERSION": {"build_id": "2015-06-08_06-13-27", "build_number": "521", "api": "1.0", "fuel-library_sha": "f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d", "nailgun_sha": "4340d55c19029394cd5610b0e0f56d6cb8cb661b", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "4fc55db0265bbf39c369df398b9dc7d6469ba13b", "astute_sha": "7766818f079881e2dbeedb34e
1f67e517ed7d479", "fuel-ostf_sha": "7c938648a246e0311d05e2372ff43ef1eb2e2761", "release": "6.1", "fuelmain_sha": "bcc909ffc5dd5156ba54cae348b6a07c1b607b24"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d", "nailgun_sha": "4340d55c19029394cd5610b0e0f56d6cb8cb661b", "feature_groups": ["mirantis"], "openstack_version": "2014.2.2-6.1", "production": "docker", "python-fuelclient_sha": "4fc55db0265bbf39c369df398b9dc7d6469ba13b", "astute_sha": "7766818f079881e2dbeedb34e1f67e517ed7d479", "fuel-ostf_sha": "7c938648a246e0311d05e2372ff43ef1eb2e2761", "release": "6.1", "fuelmain_sha": "bcc909ffc5dd5156ba54cae348b6a07c1b607b24"}

Testrail test - https://mirantis.testrail.com/index.php?/cases/view/20853
Steps to reproduce:
1. Deployment with 5 controllers, NeutronGRE, with both Ceph, with Ceilometer
    Add 5 controllers + mongo
   Add 3 compute + ceph
2. Run OSTF tests

Actual Result:

All Functional test fails with "Keystone client is not available" error

(nose_storage_plugin) fuel_health.tests.smoke.test_user_create.TestUserTenantRole.test_create_user
Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/unittest2/case.py", line 333, in run
    self.setUp()
  File "/usr/lib/python2.6/site-packages/fuel_health/nmanager.py", line 1051, in setUp
    self.check_clients_state()
  File "/usr/lib/python2.6/site-packages/fuel_health/nmanager.py", line 411, in check_clients_state
    self.fail("Keystone client is not available. Please, refer"
  File "/usr/lib/python2.6/site-packages/unittest2/case.py", line 415, in fail
    raise self.failureException(msg)
AssertionError: Keystone client is not available. Please, refer to OpenStack logs to fix this problem
-------------------- >> begin captured logging << --------------------
fuel_health.config: INFO: INSTANCE {693: <fuel_health.config.NailgunConfig object at 0x5261810>}
fuel_health.config: INFO: INSTANCE {693: <fuel_health.config.NailgunConfig object at 0x5261810>}
keystoneclient.auth.identity.v2: DEBUG: Making authentication request to http://10.109.1.3:5000/v2.0/tokens
urllib3.connectionpool: INFO: Starting new HTTP connection (1): 10.109.0.10
urllib3.connectionpool: DEBUG: Setting read timeout to None
urllib3.connectionpool: DEBUG: "POST http://10.109.1.3:5000/v2.0/tokens HTTP/1.1" 500 143
keystoneclient.session: DEBUG: Request returned failure status: 500
fuel_health.nmanager: DEBUG: Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/fuel_health/nmanager.py", line 82, in __init__
    self.identity_client = self._get_identity_client()
  File "/usr/lib/python2.6/site-packages/fuel_health/nmanager.py", line 217, in _get_identity_client
    insecure=dscv)
  File "/usr/lib/python2.6/site-packages/keystoneclient/v2_0/client.py", line 147, in __init__
    self.authenticate()
  File "/usr/lib/python2.6/site-packages/keystoneclient/utils.py", line 318, in inner
    return func(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/keystoneclient/httpclient.py", line 425, in authenticate
    resp = self.get_raw_token_from_identity_service(**kwargs)
  File "/usr/lib/python2.6/site-packages/keystoneclient/v2_0/client.py", line 190, in get_raw_token_from_identity_service
    "%s" % e)
AuthorizationFailure: Authorization Failed: An unexpected error prevented the server from fulfilling your request. (HTTP 500)
fuel_health.nmanager: DEBUG: Unable to initialize Keystone client: Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/fuel_health/nmanager.py", line 82, in __init__
    self.identity_client = self._get_identity_client()
  File "/usr/lib/python2.6/site-packages/fuel_health/nmanager.py", line 217, in _get_identity_client
    insecure=dscv)
  File "/usr/lib/python2.6/site-packages/keystoneclient/v2_0/client.py", line 147, in __init__
    self.authenticate()
  File "/usr/lib/python2.6/site-packages/keystoneclient/utils.py", line 318, in inner
    return func(*args, **kwargs)
  File "/usr/lib/python2.6/site-packages/keystoneclient/httpclient.py", line 425, in authenticate
    resp = self.get_raw_token_from_identity_service(**kwargs)
  File "/usr/lib/python2.6/site-packages/keystoneclient/v2_0/client.py", line 190, in get_raw_token_from_identity_service
    "%s" % e)
AuthorizationFailure: Authorization Failed: An unexpected error prevented the server from fulfilling your request. (HTTP 500)

Many of platform test fails with "Time limit exceeded while waiting for stack creation to finish. Please refer to OpenStack logs for more details." error

(nose_storage_plugin) fuel_health.tests.platform_tests.test_heat.HeatSmokeTests.test_update
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/unittest2/case.py", line 340, in run
testMethod()
File "/usr/lib/python2.6/site-packages/fuel_health/tests/platform_tests/test_heat.py", line 310, in test_update
template, parameters=parameters)
File "/usr/lib/python2.6/site-packages/fuel_health/common/test_mixins.py", line 182, in verify
" Please refer to OpenStack logs for more details.")
File "/usr/lib/python2.6/site-packages/unittest2/case.py", line 415, in fail
raise self.failureException(msg)
AssertionError: Step 1 failed: Time limit exceeded while waiting for stack creation to finish. Please refer to OpenStack logs for more details.

Tags: rabbitmq
Revision history for this message
Artem Hrechanychenko (agrechanichenko) wrote :
Revision history for this message
Artem Hrechanychenko (agrechanichenko) wrote :

KVM+Ubuntu+6Gb per slave

Changed in fuel:
importance: Undecided → High
description: updated
Revision history for this message
Boris Bobrov (bbobrov) wrote :

I see http://paste.openstack.org/show/283640/ in keystone logs. Looks like some oslo.messaging issue.

ruhe (ruhe)
Changed in fuel:
milestone: none → 6.1
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

The errors in ostf.log start and end at these moments:
2015-06-11 06:17:34
2015-06-11 06:17:42

All the tests failed in 8 seconds due to the same reason:
AssertionError: Keystone client is not available. Please, refer to OpenStack logs to fix this problem

If we look through the keystone-all.log of various nodes, we see errors like
DBConnectionError: (OperationalError) (1047, 'Unknown command') 'SELECT 1' ()

They happened in exactly these 8 seconds, you can see full list here:
http://paste.openstack.org/show/283715/

It means that all the requests to keystone failed due to the same reason: MySQL was not operational at the moment.

Changed in fuel:
assignee: nobody → Sergii Golovatiuk (sgolovatiuk)
tags: added: galera
Changed in fuel:
status: New → Invalid
Revision history for this message
Sergii Golovatiuk (sgolovatiuk) wrote :
Download full text (7.1 KiB)

There are no any indications of Galera failures at all. However, I see that rabbit cluster was misbehaving at that time.

2015-06-11T06:17:16.005336+00:00 info: INFO: p_rabbitmq-server: get_monitor(): CHECK LEVEL IS: 30
2015-06-11T06:17:28.161196+00:00 info: INFO: p_rabbitmq-server: get_monitor(): get_status() returns 0.
2015-06-11T06:17:28.618981+00:00 info: INFO: p_rabbitmq-server: get_monitor(): also checking if we are master.
2015-06-11T06:17:38.949560+00:00 info: INFO: p_rabbitmq-server: get_monitor(): master attribute is 1
2015-06-11T06:17:50.592003+00:00 info: INFO: p_rabbitmq-server: get_monitor(): checking if rabbit app is running
2015-06-11T06:17:51.816700+00:00 info: INFO: p_rabbitmq-server: get_monitor(): rabbit app is running. checking if we are the part of healthy cluster
2015-06-11T06:18:26.888498+00:00 info: INFO: p_rabbitmq-server: unjoin_nodes_from_cluster(): Execute forget_cluster_node with timeout: 60
2015-06-11T06:18:31.636681+00:00 info: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 2: /usr/sbin/rabbitmqctl forget_cluster_node rabbit@node-3
2015-06-11T06:18:31.910259+00:00 warning: WARNING: p_rabbitmq-server: unjoin_nodes_from_cluster(): unjoining node 'rabbit@node-3' failed.
2015-06-11T06:18:35.973519+00:00 info: INFO: false ...done.

Here is a log of Galera cluster at that time

sgolovatiuk:fuel-snapshot-2015-06-11_06-23-42 sgolovatiuk$ grep -h '06:1[6-9]' ./10.109.0.2/var/log/docker-logs/remote/node-*.test.domain.local/mysql-wss.log
2015-06-10T22:06:18.402083+00:00 info: INFO: GTID OK: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:21105
2015-06-10T22:06:18.413264+00:00 info: INFO: Galera GTID: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:21105
2015-06-11T03:06:17.390311+00:00 info: INFO: GTID OK: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:135742
2015-06-11T03:06:17.616029+00:00 info: INFO: Galera GTID: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:135742
2015-06-11T04:06:19.293258+00:00 info: INFO: GTID OK: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:158316
2015-06-11T04:06:19.306775+00:00 info: INFO: Galera GTID: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:158316
2015-06-11T06:17:18.259035+00:00 info: INFO: GTID OK: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:206926
2015-06-11T06:17:19.076541+00:00 info: INFO: Galera GTID: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:206926
2015-06-11T06:18:34.366866+00:00 info: INFO: GTID OK: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:207115
2015-06-11T06:18:34.855082+00:00 info: INFO: Galera GTID: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:207115
2015-06-11T06:19:50.984036+00:00 info: INFO: GTID OK: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:207478
2015-06-11T06:19:51.728955+00:00 info: INFO: Galera GTID: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:207478
2015-06-11T02:06:17.689214+00:00 info: INFO: GTID OK: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:112981
2015-06-11T02:06:17.719697+00:00 info: INFO: Galera GTID: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:112981
2015-06-11T06:06:18.562617+00:00 info: INFO: GTID OK: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:203271
2015-06-11T06:06:18.632529+00:00 info: INFO: Galera GTID: c9a73ce1-0fad-11e5-a35d-c2dea09a6aa5:203271
2015-06-11T06:16:40.764993+00:00 info: INFO: GTID OK: c9a73ce1-0fad-...

Read more...

tags: added: rabbitmq
removed: galera
Changed in fuel:
assignee: Sergii Golovatiuk (sgolovatiuk) → Bogdan Dobrelya (bogdando)
status: Invalid → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

RC was a timed out OCF monitor action because of high CPU load:
2015-06-11T06:00:14.263188+00:00 err: error: process_lrm_event: Operation p_rabbitmq-server_monitor_27000: Timed Out (node=node-1.test.domain.loca
l, call=178, timeout=60000ms)
2015-06-11T06:00:14.724839+00:00 warning: warning: update_failcount: Updating failcount for p_rabbitmq-server on node-1.test.domain.local after fail
ed monitor: rc=1 (update=value++, time=1434002414)
2015-06-11T06:00:14.730029+00:00 notice: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTE
RNAL origin=abort_transition_graph ]
2015-06-11T06:00:18.338047+00:00 notice: notice: throttle_handle_load: High CPU load detected: 15.580000
2015-06-11T06:00:19.430717+00:00 notice: notice: te_rsc_command: Initiating action 394: notify p_rabbitmq-server_pre_notify_demote_0 on node-1.test
.domain.local (local)
2015-06-11T06:00:19.445908+00:00 notice: notice: te_rsc_command: Initiating action 396: notify p_rabbitmq-server_pre_notify_demote_0 on node-2.test.domain.local
2015-06-11T06:00:20.213930+00:00 notice: notice: te_rsc_command: Initiating action 398: notify p_rabbitmq-server_pre_notify_demote_0 on node-3.test.domain.local
2015-06-11T06:00:21.464895+00:00 notice: notice: te_rsc_command: Initiating action 400: notify p_rabbitmq-server_pre_notify_demote_0 on node-4.test.domain.local
2015-06-11T06:00:22.113360+00:00 notice: notice: te_rsc_command: Initiating action 402: notify p_rabbitmq-server_pre_notify_demote_0 on node-5.test.domain.local

summary: - Keystone client is not available with Neutron GRE, HA, Celio and Ceph
+ timed out OCF monitor action because of high CPU load
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/191715

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

the reported LA 16 for 1 core makes this bug invalid

Changed in fuel:
status: In Progress → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: master
Review: https://review.openstack.org/191715
Reason: related bug is invalid

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

According to the logs Load average is almost 16 for one CPU node. This type of load cannot be tolerated easily by cluster. So I support Bogdan setting this bug to invalid. We need either to provide more resources to the nodes or investigate why these resources is not enough. It could also happen to wrong scheduling made by test runner - we should have enough memory to run such tests and some of slaves may have less than sufficient memory. Here I see that total should be something like 42 Gigs of RAM. Do we have this amount of memory on the node running the test?

Changed in fuel:
status: Invalid → Incomplete
Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :

Another reproduce of the same test with exactly the same configuration on the RC3 ISO

Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

As I told you, the LA is almost 10 for single cpu machine there.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Incomplete for 4 weeks without updates, marking as invalid.

Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.