[10.0][BVT] object-storage service is down on one of the nodes

Bug #1625227 reported by Roman Podoliaka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Confirmed
High
MOS Ceph

Bug Description

One of the recent builds of 10.0 community BVT (https://ci.fuel-infra.org/job/10.0-community.main.ubuntu.bvt_2/618/) failed with:

======================================================================
FAIL: Deploy ceph HA with RadosGW for objects
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 296, in testng_method_mistake_capture_func
    compatability.capture_type_error(s_func)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/compatability/exceptions_2_6.py", line 27, in capture_type_error
    func()
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/proboscis/case.py", line 350, in func
    func(test_case.state.get_state())
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 120, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/fuelweb_test/tests/test_ceph.py", line 511, in ceph_rados_gw
    self.fuel_web.deploy_cluster_wait(cluster_id)
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 462, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 447, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 498, in wrapper
    return func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 505, in wrapper
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/fuelweb_test/helpers/decorators.py", line 389, in wrapper
    return func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/fuelweb_test/models/fuel_web_client.py", line 953, in deploy_cluster_wait
    self.check_deploy_state(cluster_id, check_services, check_tasks)
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/fuelweb_test/models/fuel_web_client.py", line 903, in check_deploy_state
    self.assert_ha_services_ready(cluster_id)
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/core/helpers/log_helpers.py", line 32, in wrapped
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/fuelweb_test/models/fuel_web_client.py", line 205, in assert_ha_services_ready
    interval=20, timeout=timeout)
  File "/home/jenkins/venv-nailgun-tests-2.9/local/lib/python2.7/site-packages/devops/helpers/helpers.py", line 126, in wait_pass
    return raising_predicate()
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/fuelweb_test/models/fuel_web_client.py", line 204, in <lambda>
    should_fail=should_fail),
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/core/helpers/log_helpers.py", line 32, in wrapped
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/fuelweb_test/models/fuel_web_client.py", line 1351, in run_ostf
    failed_test_name=failed_test_name, test_sets=test_sets)
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/core/helpers/log_helpers.py", line 32, in wrapped
    result = func(*args, **kwargs)
  File "/home/jenkins/workspace/10.0-community.main.ubuntu.bvt_2/fuelweb_test/models/fuel_web_client.py", line 305, in assert_ostf_run
    indent=1)))
AssertionError: Failed 1 OSTF tests; should fail 0 tests. Names of failed tests:
  - Check state of haproxy backends on controllers (failure) Dead backends ['object-storage node-3 Status: DOWN/L7STS Sessions: 0 Rate: 0 ']. Please refer to OpenStack logs for more details.

According to haproxy logs it eventually went up on two of three controllers:

<133>Sep 17 18:32:32 node-5 haproxy[26861]: Proxy object-storage started.
<132>Sep 17 18:32:32 node-5 haproxy[26089]: Stopping proxy object-storage in 0 ms.
<132>Sep 17 18:32:32 node-5 haproxy[26089]: Proxy object-storage stopped (FE: 0 conns, BE: 0 conns).
<129>Sep 17 18:32:33 node-5 haproxy[26865]: Server object-storage/node-5 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Sep 17 18:32:35 node-5 haproxy[26865]: Server object-storage/node-2 is DOWN, reason: Layer4 timeout, check duration: 2000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<129>Sep 17 18:32:35 node-5 haproxy[26865]: Server object-storage/node-3 is DOWN, reason: Layer4 timeout, check duration: 2001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
<128>Sep 17 18:32:35 node-5 haproxy[26865]: proxy object-storage has no server available!
<133>Sep 17 19:12:37 node-5 haproxy[26865]: Server object-storage/node-5 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 4ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
<133>Sep 17 19:15:21 node-5 haproxy[26865]: Server object-storage/node-2 is UP, reason: Layer7 check passed, code: 200, info: "OK", check duration: 8ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

In syslog there is just one entree:

<27>Sep 17 19:11:58 node-3 systemd[1]: Failed to start Ceph rados gateway.

diagnostic snapshot: https://drive.google.com/open?id=0B2db-pBC_yblYVI1MFZhUjlUYkk

Tags: area-ceph
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :
tags: added: area-ceph
Changed in mos:
milestone: none → 10.0
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Elena Ezhova (eezhova) wrote :

Another failure https://ci.fuel-infra.org/job/10.0-community.main.ubuntu.bvt_2/924

The symptoms are the same:

* Only two out of three object-storage servers went UP according to HAProxy logs
  https://paste.mirantis.net/show/2842/
* `<27>Dec 6 02:51:08 node-2 systemd[1]: Failed to start Ceph rados gateway.` in syslog
* In radosgw.log there is the following error: https://paste.mirantis.net/show/2841/

Revision history for this message
Alexei Sheplyakov (asheplyakov) wrote :

> 2016-12-06T02:51:05.316334+00:00 debug: 2016-12-06 02:51:05.274612 7f3f5614da00 0 librados: client.radosgw.gateway authentication error (1) Operation not permitted

Please check if the keyring (/etc/ceph/ceph.client.radosgw.gateway) is available and has correct permission (the pseudouser running radosgw should be able to read it)

Revision history for this message
Alexander Ignatov (aignatov) wrote :

One more failure is here:

https://product-ci.infra.mirantis.net/view/10.0/job/10.0.main.ubuntu.bvt_2/1238/console

in haproxy logs node-5 is constantly was in DOWN state.

Revision history for this message
Ivan Udovichenko (iudovichenko) wrote :

Another failure https://ci.fuel-infra.org/job/11.0-community.main.ubuntu.bvt_2/659/

node-4-10.109.5.6/var/log/haproxy.log
"""
2017-04-11T20:14:43.095944+00:00 node-4 haproxy[4265]: Server object-storage/node-4 is DOWN, reason: Layer7 wrong status, code: 503, info: "Service Unavailable", check duration: 0ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
"""

And never got UP status.

fuel/var/log/remote/10.109.5.6/radosgw.log

Shows the same symptoms:
"""
2017-04-11T20:09:42.151654+00:00 info: 2017-04-11 20:09:42.093482 7f4b656cca00 0 librados: client.radosgw.gateway authentication error (1) Operation not permitted
2017-04-11T20:09:42.151654+00:00 info: 2017-04-11 20:09:42.094659 7f4b656cca00 -1 Couldn't init storage provider (RADOS)
2017-04-11T20:09:42.151654+00:00 info: 2017-04-11 20:09:42.094659 7f4b656cca00 -1 Couldn't init storage provider (RADOS)
...
2017-04-11T20:14:38.907272+00:00 info: 2017-04-11 20:14:38.906112 7f9e83fdf700 1 handle_sigterm
2017-04-11T20:14:38.907272+00:00 info: 2017-04-11 20:14:38.906179 7f9e83fdf700 1 handle_sigterm set alarm for 120
2017-04-11T20:14:38.923333+00:00 info: 2017-04-11 20:14:38.919726 7f9efa287a00 -1 shutting down
2017-04-11T20:14:38.932722+00:00 info: 2017-04-11 20:14:38.919726 7f9efa287a00 -1 shutting down
2017-04-11T20:14:38.932722+00:00 info: 2017-04-11 20:14:38.926414 7f9e83fdf700 1 handle_sigterm
2017-04-11T20:14:38.932722+00:00 info: 2017-04-11 20:14:38.931703 7f9e837de700 0 ERROR: FCGX_Accept_r returned -4
2017-04-11T20:14:39.172599+00:00 info: 2017-04-11 20:14:39.169010 7f9efa287a00 1 final shutdown
"""

http://paste.openstack.org/show/606555/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.