os::aodh::alarm via heat stable/pike ubuntu & centos http 503

Bug #1750236 reported by Marek Grudzinski
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
Undecided
Mohammed Naser

Bug Description

#when using heat to deploy an autoscaling group with os::aodh::alarm to trigger scaling aodh responds to heat with 503 no service available.

#from heat
Resource Create Failed: Clientexception: Resources.Cpu Alarm Low: <Html><Body><H1>503 Service Unavailable</H1> No Server Is Available To Handle This Request. </Body></Html> (Http 503)

#from heatengine.log
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource Traceback (most recent call last):
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource File "/openstack/venvs/heat-16.0.6/lib/python2.7/site-packages/heat/engine/resource.py", line 831, in _action_reco
rder
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource yield
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource File "/openstack/venvs/heat-16.0.6/lib/python2.7/site-packages/heat/engine/resource.py", line 939, in _do_action
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args)
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource File "/openstack/venvs/heat-16.0.6/lib/python2.7/site-packages/heat/engine/scheduler.py", line 334, in wrapper
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource step = next(subtask)
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource File "/openstack/venvs/heat-16.0.6/lib/python2.7/site-packages/heat/engine/resource.py", line 884, in action_handler_task
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource handler_data = handler(*args)
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource File "/openstack/venvs/heat-16.0.6/lib/python2.7/site-packages/heat/engine/resources/openstack/aodh/alarm.py", line 178, in handle_create
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource alarm = self.client().alarm.create(props)
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource File "/openstack/venvs/heat-16.0.6/lib/python2.7/site-packages/aodhclient/v2/alarm.py", line 100, in create
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource data=jsonutils.dumps(alarm)).json()
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource File "/openstack/venvs/heat-16.0.6/lib/python2.7/site-packages/aodhclient/v2/base.py", line 41, in _post
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource return self.client.api.post(*args, **kwargs)
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource File "/openstack/venvs/heat-16.0.6/lib/python2.7/site-packages/keystoneauth1/adapter.py", line 294, in post
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource return self.request(url, 'POST', **kwargs)
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource File "/openstack/venvs/heat-16.0.6/lib/python2.7/site-packages/aodhclient/client.py", line 38, in request
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource raise exceptions.from_response(resp, url, method)
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource ClientException: <html><body><h1>503 Service Unavailable</h1>
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource No server is available to handle this request.
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource </body></html>
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource (HTTP 503)
Feb 18 13:30:25 infra1-heat-engine-container-e9de764c heat-engine: 2018-02-18 13:30:23.738 361 ERROR heat.engine.resource

#aodh logs are mostly empty
#gnocchi logs don't call out any errors
#ceilometer agent notification log is ridden with these type of errors, polling logs seems fine:
Feb 18 13:34:33 infra1-ceilometer-central-container-276155f1 ceilometer-agent-notification: 2018-02-18 13:34:30.122 393 ERROR ceilometer.dispatcher.gnocchi [-] Error creating resource instance-00000a25-32898cd2-ee2d-46ec-8984-2646e1c37b21
-tapab06918d-41: Resource type instance_network_interface does not exist (HTTP 404): ResourceTypeNotFound: Resource type instance_network_interface does not exist (HTTP 404)
Feb 18 13:34:33 infra1-ceilometer-central-container-276155f1 ceilometer-agent-notification: 2018-02-18 13:34:30.145 393 ERROR ceilometer.dispatcher.gnocchi [-] Error creating resource instance-000008ab-2e1a6fda-ac99-4714-9b5d-0077e8ae1188
-tap0b0000d1-1c: Resource type instance_network_interface does not exist (HTTP 404): ResourceTypeNotFound: Resource type instance_network_interface does not exist (HTTP 404)
Feb 18 13:34:33 infra1-ceilometer-central-container-276155f1 ceilometer-agent-notification: 2018-02-18 13:34:30.158 396 ERROR ceilometer.dispatcher.gnocchi [-] Error creating resource a0564f3b-d9fc-4710-99a7-052ed7f28896: Resource type instance does not exist (HTTP 404): ResourceTypeNotFound: Resource type instance does not exist (HTTP 404)
Feb 18 13:34:33 infra1-ceilometer-central-container-276155f1 ceilometer-agent-notification: 2018-02-18 13:34:30.172 393 ERROR ceilometer.dispatcher.gnocchi [-] Error creating resource instance-000009dd-c44a3f24-41e9-4c91-816c-eb85d447553f-tap4b03c715-8a: Resource type instance_network_interface does not exist (HTTP 404): ResourceTypeNotFound: Resource type instance_network_interface does not exist (HTTP 404)
Feb 18 13:34:33 infra1-ceilometer-central-container-276155f1 ceilometer-agent-notification: 2018-02-18 13:34:30.190 396 ERROR ceilometer.dispatcher.gnocchi [-] Error creating resource 2e1a6fda-ac99-4714-9b5d-0077e8ae1188: Resource type instance does not exist (HTTP 404): ResourceTypeNotFound: Resource type instance does not exist (HTTP 404)
Feb 18 13:34:33 infra1-ceilometer-central-container-276155f1 ceilometer-agent-notification: 2018-02-18 13:34:30.196 393 ERROR ceilometer.dispatcher.gnocchi [-] Error creating resource instance-000009e6-a0564f3b-d9fc-4710-99a7-052ed7f28896-tap782845d7-54: Resource type instance_network_interface does not exist (HTTP 404): ResourceTypeNotFound: Resource type instance_network_interface does not exist (HTTP 404)
Feb 18 13:34:33 infra1-ceilometer-central-container-276155f1 ceilometer-agent-notification: 2018-02-18 13:34:30.214 396 ERROR ceilometer.dispatcher.gnocchi [-] Error creating resource 32898cd2-ee2d-46ec-8984-2646e1c37b21: Resource type instance does not exist (HTTP 404): ResourceTypeNotFound: Resource type instance does not exist (HTTP 404)
Feb 18 13:34:33 infra1-ceilometer-central-container-276155f1 ceilometer-agent-notification: 2018-02-18 13:34:30.225 393 ERROR ceilometer.dispatcher.gnocchi [-] Error creating resource instance-000009ef-f24de1f6-d176-4169-a38d-ff22240653eb-tapc44a28a4-84: Resource type instance_network_interface does not exist (HTTP 404): ResourceTypeNotFound: Resource type instance_network_interface does not exist (HTTP 404)
Feb 18 13:34:33 infra1-ceilometer-central-container-276155f1 ceilometer-agent-notification: 2018-02-18 13:34:30.241 396 ERROR ceilometer.dispatcher.gnocchi [-] Error creating resource c44a3f24-41e9-4c91-816c-eb85d447553f: Resource type instance does not exist (HTTP 404): ResourceTypeNotFound: Resource type instance does not exist (HTTP 404)

#an example alarm from the template

  cpu_alarm_high:
    type: OS::Aodh::Alarm
    properties:
      meter_name: cpu_util
      description: scale up if cpu is >70% for an entire minute
      statistic: avg
      period: 60
      evaluation_periods: 1
      threshold: 70
      alarm_actions:
        - {get_attr: [ scaleup_policy, alarm_url ] }
      comparison_operator: gt
      matching_metadata: { 'metadata.user_metadata.stack': { get_param: "OS::stack_id" } }

#other observations:

# gnocchi status
Unable to establish connection to http://localhost:8041/v1/status?details=False: HTTPConnectionPool(host='localhost', port=8041): Max retries exceeded with url: /v1/status?details=False (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2eec650>: Failed to establish a new connection: [Errno 111] Connection refused',))

# aodh alarm list
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>
 (HTTP 503)

(openstack) alarm list
Unable to establish connection to https://192.168.10.3:8042/v2/alarms: HTTPSConnectionPool(host='192.168.10.3', port=8042): Max retries exceeded with url: /v2/alarms (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),))

(openstack) metric status
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>500 Internal Server Error</title>
</head><body>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error or
misconfiguration and was unable to complete
your request.</p>
<p>Please contact the server administrator at
 root@localhost to inform them of the time this error occurred,
 and the actions you performed just before this error.</p>
<p>More information about this error may be available
in the server error log.</p>
</body></html>
 (HTTP 500)

#environments are deployed with: orchestration_hosts, metering-infra_hosts, metering-alarm_hosts,metrics_hosts, metering-compute_hosts as per some of the templates. nothing special.

Revision history for this message
Jean-Philippe Evrard (jean-philippe-evrard) wrote :

Hey Gokhan, could you have a look at this, please? Thank you.

Changed in openstack-ansible:
assignee: nobody → Gökhan (skylightcoder)
Revision history for this message
Mohammed Naser (mnaser) wrote :

Hi,

# gnocchi status
Unable to establish connection to http://localhost:8041/v1/status?details=False: HTTPConnectionPool(host='localhost', port=8041): Max retries exceeded with url: /v1/status?details=False (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x2eec650>: Failed to establish a new connection: [Errno 111] Connection refused',))

You need to make sure that you use the correct openrc file settings.: https://gnocchi.xyz/gnocchiclient/shell.html#openstack-keystone-authentication

# (openstack) alarm list
Unable to establish connection to https://192.168.10.3:8042/v2/alarms: HTTPSConnectionPool(host='192.168.10.3', port=8042): Max retries exceeded with url: /v2/alarms (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_server_certificate', 'certificate verify failed')],)",),))

This has to do with the fact that the https or http does not seem to be properly configured, I've pushed up a patch to clean that up.

Also, can you please share a copy of your haproxy.cfg?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (master)

Fix proposed to branch: master
Review: https://review.openstack.org/556985

Changed in openstack-ansible:
assignee: Gökhan (skylightcoder) → Mohammed Naser (mnaser)
status: New → In Progress
Revision history for this message
Marek Grudzinski (ivve) wrote :

Here comes the haproxy for gnocchi & aodh. Yell here or IRC if you need more!
What worries me most is the "503 Service Unavailable" from aodh :(

# Ansible managed

frontend gnocchi-front-1
    bind 192.168.9.9:8041 ssl crt /etc/ssl/private/haproxy.pem ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!MD5:!DSS
    option httplog
    option forwardfor except 127.0.0.0/8
    option http-server-close
    reqadd X-Forwarded-Proto:\ https
    mode http
    default_backend gnocchi-back

frontend gnocchi-front-2
    bind 192.168.9.8:8041
    option httplog
    option forwardfor except 127.0.0.0/8
    option http-server-close
    mode http
    default_backend gnocchi-back

backend gnocchi-back
    mode http
    balance leastconn
    stick store-request src
    stick-table type ip size 256k expire 30m
    option forwardfor
    option httplog
    option httpchk /healthcheck

    server infra1_gnocchi_container-03bd647c 192.168.9.106:8041 check port 8041 inter 12000 rise 3 fall 3
    server infra2_gnocchi_container-bded8c51 192.168.9.169:8041 check port 8041 inter 12000 rise 3 fall 3
    server infra3_gnocchi_container-1d6c327d 192.168.9.149:8041 check port 8041 inter 12000 rise 3 fall 3

# Ansible managed

frontend aodh_api-front-1
    bind 192.168.9.9:8042 ssl crt /etc/ssl/private/haproxy.pem ciphers ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:RSA+AESGCM:RSA+AES:!aNULL:!MD5:!DSS
    option httplog
    option forwardfor except 127.0.0.0/8
    option http-server-close
    reqadd X-Forwarded-Proto:\ https
    mode http
    default_backend aodh_api-back

frontend aodh_api-front-2
    bind 192.168.9.8:8042
    option httplog
    option forwardfor except 127.0.0.0/8
    option http-server-close
    mode http
    default_backend aodh_api-back

backend aodh_api-back
    mode http
    balance leastconn
    stick store-request src
    stick-table type ip size 256k expire 30m
    option forwardfor
    option httplog
    option httpchk HEAD /
    http-check expect status 401

    server infra1_aodh_container-88075c3d 192.168.9.186:8042 check port 8042 inter 12000 rise 3 fall 3
    server infra2_aodh_container-58f1e697 192.168.9.59:8042 check port 8042 inter 12000 rise 3 fall 3
    server infra3_aodh_container-d9d9e9f8 192.168.9.191:8042 check port 8042 inter 12000 rise 3 fall 3

Revision history for this message
Gökhan (skylightcoder) wrote :

Hi, for aodh I think Problem is about haproxy http-check. Can you remove 'http-check expect status 401' from backend aodh_api-back and then restart haproxy. After that run 'openstack alarm list --insecure'

Revision history for this message
Marek Grudzinski (ivve) wrote :

Hi, yup that does it. I'll apply the fix and try heat ASAP.

Revision history for this message
Mohammed Naser (mnaser) wrote :

Can you please update us if that has fixed things?

Revision history for this message
Marek Grudzinski (ivve) wrote :

Hi sorry for the delay, easter.
I'll post the result tomorrow!

Revision history for this message
Marek Grudzinski (ivve) wrote :

Heya,

Out of the box stable/pike + haproxy fix + aodh (https://review.openstack.org/556985) fix.

openstack alarm list --insecure works

But when trying to submit a heat-template with the aodh stanza above I still get ssl handshake errors. It's trying to use internal with https, should be http?
In this test-openstack 192.168.80.3 is external and 192.168.80.2 is internal.

b7471d1e39174377a223b01780488f5c | RegionOne | aodh | alarming | True | public | https://192.168.80.3:8042

1bc85675b0d54633930cf7dbf0632460 | RegionOne | aodh | alarming | True | internal | https://192.168.80.2:8042

test_stack01
Create Failed

Resource Create Failed: Connectfailure: Resources.Cpu Alarm High: Unable To Establish Connection To Https://192.168.80.2:8042/V2/Alarms: Httpsconnectionpool(Host='192.168.80.2', Port=8042): Max Retries Exceeded With Url: /V2/Alarms (Caused By Sslerror(Sslerror("Bad Handshake: Error([('Ssl Routines', 'Ssl23 Get Server Hello', 'Unknown Protocol')],)",),))

Revision history for this message
Marek Grudzinski (ivve) wrote :

After extensive troubleshooting I have come to the following conclusion:

Regarding the deployment:
  - storage driver: file for gnocchi doesn't work when deploying 3x controllers and having it in HA. I haven't been able to deduct the exact nature of the problem. Lots of things among logging, ceilometer, aodh, gnocchi failed(was skipped during playbook runs) and all logs contained tons of errors that didn't make sense. Containers were created and the proper stuff was put into them but simply didn't work after that. I decided to go with ceph as I use that for everything else. Then node2 & node3 started to behave properly (after some minor fixes).
  - in master the following command is being run for ceilometer: ceilometer-upgrade --skip-metering-database. Had to run that manually in pike/stable in all 3 containers and restarted ceilometer services.
  - If the deployment is default then selfsigned certs will be used. This must be added to aodh.conf for aodh alarms to work properly (else they fail with a ssl verify error). rest_notifier_ssl_verify = false. I propose this to be default during self-signed certs otherwise true. (default is true/commented out).

After these modifications and using this type of template instead stuff started to work properly!
The only working threshold alarms are these, the rest are deprecated:

gnocchi_resources_threshold_rule, gnocchi_aggregation_by_metrics_threshold_rule, or gnocchi_aggregation_by_resources_threshold_rule

Also described here: https://docs.openstack.org/aodh/pike/admin/telemetry-alarms.html

  cpu_alarm_high:
    type: OS::Aodh::GnocchiAggregationByResourcesAlarm
    properties:
      description: Scale up if CPU > 80%
      metric: cpu_util
      aggregation_method: mean
      granularity: 300
      evaluation_periods: 1
      threshold: 0.8
      resource_type: instance
      comparison_operator: gt
      alarm_actions:
        - str_replace:
            template: trust+url
            params:
              url: {get_attr: [scaleup_policy, signal_url]}
      query:
        list_join:
          - ''
          - - {'=': {server_group: {get_param: "OS::stack_id"}}}

I also recommend to somehow highlight https://docs.openstack.org/openstack-ansible-ceph_client/pike/configure-ceph.html#configure-os-gnocchi-with-ceph-client. Perhaps another example user_variables.yml

Just a suggestion :)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (master)

Reviewed: https://review.openstack.org/556985
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=0e5b6cbd96054a25f89a12f8ff50033e04c7dfca
Submitter: Zuul
Branch: master

commit 0e5b6cbd96054a25f89a12f8ff50033e04c7dfca
Author: Mohammed Naser <email address hidden>
Date: Tue Mar 27 10:22:01 2018 -0700

    Add missing service URLs for AODH

    The service URLs for internal and admin were not being properly
    configured so they were using the default values which means
    that any service which was making calls on the internal or
    admin URL endpoints would fail.

    This patch adds them in order to make them accessible and
    have the correct configuration (SSL or no SSL).

    Change-Id: I851893c005fc4c91a7aa9a9a979ec315e1fc500f
    Closes-Bug: #1750236

Changed in openstack-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/openstack-ansible 18.0.0.0b3

This issue was fixed in the openstack/openstack-ansible 18.0.0.0b3 development milestone.

Revision history for this message
bel (varr) wrote :

hello

I think the problem is still exist but with SSL issue

I use below link to deploy openstack with LVM cinder backedn
URL: https://docs.openstack.org/openstack-ansible/rocky/user/prod/example.html

but when i check aodh service it give me these errors

# openstack alarm list
SSL exception connecting to https://10.205.61.25:8042/v2/alarms: HTTPSConnectionPool(host='10.205.61.25', port=8042): Max retries exceeded with url: /v2/alarms (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')],)",),))

# aodh alarm list
SSL exception connecting to https://10.192.129.173:8042/v2/alarms: HTTPSConnectionPool(host='10.192.129.173', port=8042): Max retries exceeded with url: /v2/alarms (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'ssl3_get_record', 'wrong version number')],)",),))

so please any help

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.