Ceilometer service stops to work in HA ubuntu environment after failover of primary controller

Bug #1399724 reported by Tatyanka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
MOS Ceilometer
5.1.x
Fix Committed
High
MOS Ceilometer
6.0.x
Fix Committed
High
MOS Ceilometer
6.1.x
Fix Released
High
MOS Ceilometer

Bug Description

Steps:
1. create cluster with the next configuration
OS: Ubuntu
net: nova vlan
3 controllers with mongo roles, 1 compute +1 cinder, ceilometer service is enabled
2. When cluster ready - run ostf tests (all test include ceilometer are passed)
3. Shutdown Primary controller
4. Wait for 15 minutes, and run ostf ha (passed)
5. Run sanity / smoke and platform tests

Actual result:
all ceilometer test are failed
http://paste.openstack.org/show/145603/

manual operations finish with the same result
3 Nodes configured, 3 expected votes
19 Resources configured

Online: [ node-17 node-18 ]
OFFLINE: [ node-16 ]

 vip__management_old (ocf::mirantis:ns_IPaddr2): Started node-17
 vip__public_old (ocf::mirantis:ns_IPaddr2): Started node-17
 Clone Set: clone_ping_vip__public_old [ping_vip__public_old]
     Started: [ node-17 node-18 ]
 p_ceilometer-alarm-evaluator (ocf::mirantis:ceilometer-alarm-evaluator): Started node-18
 p_ceilometer-agent-central (ocf::mirantis:ceilometer-agent-central): Started node-18
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-17 ]
     Slaves: [ node-18 ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-17 node-18 ]
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-17 node-18 ]
 Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-17 node-18 ]
root@node-17:~#

from node-18
text/html
cache-control: no-cache

<html><body><h1>502 Bad Gateway</h1>
The server returned an invalid or incomplete response.
</body></html>

WARNING (http:173) Request returned failure status.
Traceback (most recent call last):
  File "/usr/bin/ceilometer", line 10, in <module>
    sys.exit(main())
  File "/usr/lib/python2.7/dist-packages/ceilometerclient/shell.py", line 335, in main
    CeilometerShell().main(args)
  File "/usr/lib/python2.7/dist-packages/ceilometerclient/shell.py", line 289, in main
    args.func(client, args)
  File "/usr/lib/python2.7/dist-packages/ceilometerclient/v2/shell.py", line 691, in do_event_list
    events = cc.events.list(q=options.cli_to_array(args.query))
  File "/usr/lib/python2.7/dist-packages/ceilometerclient/v2/events.py", line 30, in list
    return self._list(options.build_url(path, q))
  File "/usr/lib/python2.7/dist-packages/ceilometerclient/common/base.py", line 58, in _list
    resp, body = self.api.json_request('GET', url)
  File "/usr/lib/python2.7/dist-packages/ceilometerclient/common/http.py", line 191, in json_request
    resp, body_iter = self._http_request(url, method, **kwargs)
  File "/usr/lib/python2.7/dist-packages/ceilometerclient/common/http.py", line 174, in _http_request
    raise exc.from_response(resp, ''.join(body_iter))
ceilometerclient.exc.HTTPBadGateway: HTTPBadGateway (HTTP 502)

Errno 113] EHOSTUNREACH. Trying again in 10 seconds.
2014-12-05 16:58:48.187 19327 WARNING ceilometer.storage.pymongo_base [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.3.3:27017: [Errno 113] EHOSTUNREACH. Trying again in 10 seconds.
2014-12-05 16:59:00.187 19327 WARNING ceilometer.storage.pymongo_base [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.3.3:27017: [Errno 113] EHOSTUNREACH. Trying again in 10 seconds.
2014-12-05 16:59:12.201 19327 WARNING ceilometer.storage.pymongo_base [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.3.3:27017: [Errno 113] EHOSTUNREACH. Trying again in 10 seconds.
2014-12-05 16:59:25.187 19327 WARNING ceilometer.storage.pymongo_base [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.3.3:27017: [Errno 113] EHOSTUNREACH. Trying again in 10 seconds.
2014-12-05 16:59:37.187 19327 WARNING ceilometer.storage.pymongo_base [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.3.3:27017: [Errno 113] EHOSTUNREACH. Trying again in 10 seconds.
2014-12-05 16:59:49.187 19327 WARNING ceilometer.storage.pymongo_base [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.3.3:27017: [Errno 113] EHOSTUNREACH. Trying again in 10 seconds.
2014-12-05 17:00:01.187 19327 WARNING ceilometer.storage.pymongo_base [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.3.3:27017: [Errno 113] EHOSTUNREACH. Trying again in 10 seconds.
2014-12-05 17:00:13.190 19327 WARNING ceilometer.storage.pymongo_base [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.3.3:27017: [Errno 113] EHOSTUNREACH. Trying again in 10 seconds.
2014-12-05 17:00:26.188 19327 WARNING ceilometer.storage.pymongo_base [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.3.3:27017: [Errno 113] EHOSTUNREACH. Trying again in 10 seconds.
2014-12-05 17:00:38.187 19327 WARNING ceilometer.storage.pymongo_base [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.3.3:27017: [Errno 113] EHOSTUNREACH. Trying again in 10 seconds.

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → MOS Ceilometer (mos-ceilometer)
status: New → Confirmed
tags: added: on-verification
Revision history for this message
Anastasia Palkina (apalkina) wrote :

Cannot reproduce on ISO #49 for 6.0

"build_id": "2014-12-09_22-41-06", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "49", "auth_required": true, "api": "1.0", "nailgun_sha": "22bd43b89a17843f9199f92d61fc86cb0f8772f1", "production": "docker", "fuelmain_sha": "3aab16667f47dd8384904e27f70f7a87ba15f4ee", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-12-09_22-41-06", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "49", "api": "1.0", "nailgun_sha": "22bd43b89a17843f9199f92d61fc86cb0f8772f1", "production": "docker", "fuelmain_sha": "3aab16667f47dd8384904e27f70f7a87ba15f4ee", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "fuellib_sha": "2c99931072d951301d395ebd5bf45c8d401301bb"}}}, "fuellib_sha": "2c99931072d951301d395ebd5bf45c8d401301bb"

1. Create new environment (Ubuntu, HA mode)
2. Choose nova-network, vlan manager
3. Choose ceilometer
4. Add 3 controller+mongo, 1 compute+cinder
5. Start deployment. It was successful
6. Start OSTF tests. It was successful
7. Power off primary controller
8. Wait 15 minutes
9. Start OSTF tests. It was successful

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Reproduced on 6.0-58
{"build_id": "2014-12-26_14-25-46", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "58", "auth_required": true, "api": "1.0", "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90", "production": "docker", "fuelmain_sha": "81d38d6f2903b5a8b4bee79ca45a54b76c1361b8", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-12-26_14-25-46", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "58", "api": "1.0", "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90", "production": "docker", "fuelmain_sha": "81d38d6f2903b5a8b4bee79ca45a54b76c1361b8", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "fuellib_sha": "fde8ba5e11a1acaf819d402c645c731af450aff0"}}}, "fuellib_sha": "fde8ba5e11a1acaf819d402c645c731af450aff0"}
urllib3.connectionpool: DEBUG: "GET http://10.108.1.3:8777/v2/meters HTTP/1.1" 500 128
ceilometerclient.openstack.common.apiclient.client: DEBUG: Request returned failure status: 500
fuel_health.common.test_mixins: DEBUG: Traceback (most recent call last):
  File "/usr/lib/python2.6/site-packages/fuel_health/common/test_mixins.py", line 186, in verify
    result = func(*args, **kwargs)

the same if i request event list manually
http://paste.openstack.org/show/155159/

Steps

  File "/usr/lib/python2.6/site-packages/ceilometerclient/v2/meters.py", line 30, in list
    return self._list(options.build_url(path, q))
  File "/usr/lib/python2.6/site-packages/ceilometerclient/common/base.py", line 59, in _list
    resp = self.api.get(url)
  File "/usr/lib/python2.6/site-packages/ceilometerclient/openstack/common/apiclient/client.py", line 328, in get
    return self.client_request("GET", url, **kwargs)
  File "/usr/lib/python2.6/site-packages/ceilometerclient/openstack/common/apiclient/client.py", line 322, in client_request
    self, method, url, **kwargs)
  File "/usr/lib/python2.6/site-packages/ceilometerclient/openstack/common/apiclient/client.py", line 242, in client_request
    method, self.concat_url(endpoint, url), **kwargs)
  File "/usr/lib/python2.6/site-packages/ceilometerclient/openstack/common/apiclient/client.py", line 182, in request
    raise exceptions.from_response(resp, method, url)
InternalServerError: Internal Server Error (HTTP 500)

1. Create new environment (Ubuntu, HA mode)
2. Choose neutron gre manager
3. Choose ceilometer
4. Add 3 controller+mongo, 1 compute, 1 cinder
5. Start deployment. It was successful
6. Start OSTF tests. It was successful
7. Power off primary controller
8. Wait 15 minutes
9. Start OSTF tests.

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Also a lot of message like
2014-12-30 12:34:50.799 17898 WARNING ceilometer.storage.mongo.utils [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.4.3:27017: [Errno 110] ETIMEDOUT. Trying again in 10 seconds.
2014-12-30 12:35:00.810 17898 WARNING ceilometer.storage.mongo.utils [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.4.3:27017: [Errno 110] ETIMEDOUT. Trying again in 10 seconds.
2014-12-30 12:35:10.824 17898 WARNING ceilometer.storage.mongo.utils [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.4.3:27017: [Errno 110] ETIMEDOUT. Trying again in 10 seconds.
2014-12-30 12:35:20.834 17898 WARNING ceilometer.storage.mongo.utils [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.4.3:27017: [Errno 110] ETIMEDOUT. Trying again in 10 seconds.
2014-12-30 12:35:30.846 17898 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server 10.108.4.7:5673 closed the connection. Check login credentials: Socket closed
2014-12-30 12:35:41.857 17898 INFO oslo.messaging._drivers.impl_rabbit [-] Delaying reconnect for 1.0 seconds ...
2014-12-30 12:35:42.858 17898 INFO oslo.messaging._drivers.impl_rabbit [-] Connecting to AMQP server on 127.0.0.1:5673
2014-12-30 12:35:42.912 17898 WARNING ceilometer.storage.mongo.utils [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.4.3:27017: [Errno 110] ETIMEDOUT. Trying again in 10 seconds.
2014-12-30 12:35:52.923 17898 WARNING ceilometer.storage.mongo.utils [-] Unable to reconnect to the primary mongodb: could not connect to 10.108.4.3:27017: [Errno 110] ETIMEDOUT. Trying again in 10 seconds. in ceilometer-collector.log

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Ivan Berezovskiy (iberezovskiy) wrote :

Can't reproduce.
{"build_id": "2014-12-26_14-25-46", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "58", "auth_required": true, "api": "1.0", "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90", "production": "docker", "fuelmain_sha": "81d38d6f2903b5a8b4bee79ca45a54b76c1361b8", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "release_versions": {"2014.2-6.0": {"VERSION": {"build_id": "2014-12-26_14-25-46", "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4", "build_number": "58", "api": "1.0", "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90", "production": "docker", "fuelmain_sha": "81d38d6f2903b5a8b4bee79ca45a54b76c1361b8", "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91", "feature_groups": ["mirantis"], "release": "6.0", "fuellib_sha": "fde8ba5e11a1acaf819d402c645c731af450aff0"}}}, "fuellib_sha": "fde8ba5e11a1acaf819d402c645c731af450aff0"}

We have fixes for mongo reconnect:
https://review.fuel-infra.org/#/c/2168/
https://review.fuel-infra.org/#/c/2166/

So this problem shouldn't happen anymore

Changed in fuel:
status: Confirmed → Invalid
Revision history for this message
Ivan Berezovskiy (iberezovskiy) wrote :
Changed in fuel:
status: Invalid → Fix Committed
status: Fix Committed → Won't Fix
status: Won't Fix → Fix Committed
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Fix was actually committed to both 6.0.1 and 6.1

Revision history for this message
Dave Johnston (dave-johnston) wrote :

Are the patches available?
I am currently running Fuel 6.0 and appear to see this issue.

the alarm-evaluator and agent-central marked as stopped when I run crm resource list.

crm resource list
 vip__public (ocf::mirantis:ns_IPaddr2): Started
 Clone Set: clone_ping_vip__public [ping_vip__public]
     Started: [ node-38 node-39 node-46 ]
 vip__management (ocf::mirantis:ns_IPaddr2): Started
 Clone Set: clone_p_heat-engine [p_heat-engine]
     Started: [ node-38 node-39 node-46 ]
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-38 ]
     Slaves: [ node-39 node-46 ]
 Clone Set: clone_p_neutron-plugin-openvswitch-agent [p_neutron-plugin-openvswitch-agent]
     Started: [ node-38 node-39 node-46 ]
 p_neutron-dhcp-agent (ocf::mirantis:neutron-agent-dhcp): Started
 Clone Set: clone_p_neutron-metadata-agent [p_neutron-metadata-agent]
     Started: [ node-38 node-39 node-46 ]
 Clone Set: clone_p_neutron-l3-agent [p_neutron-l3-agent]
     Started: [ node-38 node-39 node-46 ]
 Clone Set: clone_p_mysql [p_mysql]
     Started: [ node-38 node-39 node-46 ]
 p_ceilometer-alarm-evaluator (ocf::mirantis:ceilometer-alarm-evaluator): Stopped
 p_ceilometer-agent-central (ocf::mirantis:ceilometer-agent-central): Stopped
 Clone Set: clone_p_haproxy [p_haproxy]
     Started: [ node-38 node-39 node-46 ]

Revision history for this message
Ivan Berezovskiy (iberezovskiy) wrote :

Dave Johnston, You have found another issue with pacemaker resources:
https://bugs.launchpad.net/fuel/+bug/1418984

Fixes are:
https://review.openstack.org/#/c/154906/
https://review.openstack.org/#/c/154361/

Revision history for this message
Anastasia Palkina (apalkina) wrote :

Verified on ISO #202

"build_id": "2015-03-16_22-54-44", "ostf_sha": "e86c961ceacfa5a8398b6cbda7b70a5f06afb476", "build_number": "202", "release_versions": {"2014.2-6.1": {"VERSION": {"build_id": "2015-03-16_22-54-44", "ostf_sha": "e86c961ceacfa5a8398b6cbda7b70a5f06afb476", "build_number": "202", "api": "1.0", "nailgun_sha": "874df0d06e32f14db77746cfeb2dd74d4a6e528c", "production": "docker", "python-fuelclient_sha": "2509c9b72cdcdbe46c141685a99b03cd934803be", "astute_sha": "93e427ac49109fa3fd8b0e1d0bb3d14092be2e8c", "feature_groups": ["mirantis"], "release": "6.1", "fuelmain_sha": "608b72a6f79a719cf01c35a19d0091fe20c8288a", "fuellib_sha": "924d73ae4766646e1c3a44d7b59c4120985e45f0"}}}, "auth_required": true, "api": "1.0", "nailgun_sha": "874df0d06e32f14db77746cfeb2dd74d4a6e528c", "production": "docker", "python-fuelclient_sha": "2509c9b72cdcdbe46c141685a99b03cd934803be", "astute_sha": "93e427ac49109fa3fd8b0e1d0bb3d14092be2e8c", "feature_groups": ["mirantis"], "release": "6.1", "fuelmain_sha": "608b72a6f79a719cf01c35a19d0091fe20c8288a", "fuellib_sha": "924d73ae4766646e1c3a44d7b59c4120985e45f0"

tags: removed: on-verification
Revision history for this message
Maksym Strukov (unbelll) wrote :

{
  "build_id": "2015-03-16_22-54-44",
  "ostf_sha": "e86c961ceacfa5a8398b6cbda7b70a5f06afb476",
  "build_number": "202",
  "release_versions": {
    "2014.2-6.1": {
      "VERSION": {
        "build_id": "2015-03-16_22-54-44",
        "ostf_sha": "e86c961ceacfa5a8398b6cbda7b70a5f06afb476",
        "build_number": "202",
        "api": "1.0",
        "nailgun_sha": "874df0d06e32f14db77746cfeb2dd74d4a6e528c",
        "production": "docker",
        "python-fuelclient_sha": "2509c9b72cdcdbe46c141685a99b03cd934803be",
        "astute_sha": "93e427ac49109fa3fd8b0e1d0bb3d14092be2e8c",
        "feature_groups": [
          "mirantis"
        ],
        "release": "6.1",
        "fuelmain_sha": "608b72a6f79a719cf01c35a19d0091fe20c8288a",
        "fuellib_sha": "924d73ae4766646e1c3a44d7b59c4120985e45f0"
      }
    }
  },
  "auth_required": true,
  "api": "1.0",
  "nailgun_sha": "874df0d06e32f14db77746cfeb2dd74d4a6e528c",
  "production": "docker",
  "python-fuelclient_sha": "2509c9b72cdcdbe46c141685a99b03cd934803be",
  "astute_sha": "93e427ac49109fa3fd8b0e1d0bb3d14092be2e8c",
  "feature_groups": [
    "mirantis"
  ],
  "release": "6.1",
  "fuelmain_sha": "608b72a6f79a719cf01c35a19d0091fe20c8288a",
  "fuellib_sha": "924d73ae4766646e1c3a44d7b59c4120985e45f0"
}

Cant't repro with original repro steps.

Revision history for this message
Roman Rufanov (rrufanov) wrote :

customer found on MOS 6.0

tags: added: customer-found support
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.