ceilometer fails talking to keystone during metric-service-relation-changed hook; keystone is not ready

Bug #1749280 reported by Jason Hobbs
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Gnocchi Charm
Fix Released
High
David Ames
OpenStack Ceilometer Charm
Fix Released
High
David Ames
OpenStack HA Cluster Charm
Invalid
Medium
David Ames
OpenStack Keystone Charm
Invalid
Medium
David Ames

Bug Description

the ceilometer charm threw an error in the metric-service-relation-changed hook, apparently after trying to talk to keystone and failing.

log of the failure: http://paste.ubuntu.com/p/8XDj2pxNyy/

That makes sense, because keystone is not yet ready; its status is "Incomplete relations: database".

http://paste.ubuntu.com/p/yrKvYZCt8K/

juju status: http://paste.ubuntu.com/p/YzkN2nH5V9/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
James Page (james-page) wrote :

I see the issue here (cascade of ready states needed before things will actually work); ceilometer-upgrade creates resource-types in gnocchi, which obviously requires gnocchi + mysql + ceph + memcache + keystone; gnocchi units will only present a URL when they think they have everything they need - however keystone looks to be listening on ports, but not actually functional due to missing DB configuration.

Three potential routes here

a) Move ceilometer-upgrade into an action to be completed post deployment (as we did for heat's domain-setup action).

b) Rework the keystone charm to disable services until its completely up and ready, so that only units which are 'complete' will listen for connections - this has added complexity in that we'll want haproxy + VIP to be configured correctly for this scenario as well and due to the async nature of the hacluster configuration, some of that might not happen straight away.

c) Introduce the 'don't give out credentials until I'm clustered' optimization we've done in other charms to the keystone charm as well (it does not have this atm). This will delay gnocchi from going complete until keystone is fully clustered, limiting the scope for this deployment race scenario.

Changed in charm-ceilometer:
status: New → Triaged
Changed in charm-keystone:
status: New → Triaged
Changed in charm-ceilometer:
importance: Undecided → Medium
Changed in charm-keystone:
importance: Undecided → Medium
Revision history for this message
James Page (james-page) wrote :

NOTE - this will be resolvable using juju resolve <unit> once deployment completes as the services will then be in an up state.

Revision history for this message
Chris Gregan (cgregan) wrote :

We ran into this race condition again in a deploy today. Attaching crashdump.

Revision history for this message
James Page (james-page) wrote :

Adding gnocchi charm task; both the keystone and gnocchi charms need to not give out access information or URL's until clustering is complete when a vip is provided via configuration.

Changed in charm-gnocchi:
status: New → Triaged
importance: Undecided → Medium
David Ames (thedac)
Changed in charm-gnocchi:
assignee: nobody → David Ames (thedac)
Changed in charm-keystone:
assignee: nobody → David Ames (thedac)
Changed in charm-gnocchi:
milestone: none → 18.02
Changed in charm-keystone:
milestone: none → 18.02
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-gnocchi (master)

Fix proposed to branch: master
Review: https://review.openstack.org/545491

Changed in charm-gnocchi:
status: Triaged → In Progress
Revision history for this message
David Ames (thedac) wrote :

The root cause was the stable version of the hacluster charm not handling json serialized relation data.

James Page landed this fix.

https://review.openstack.org/#/c/545521/

My change above is still worth landing. Currently testing both.

--- Root Cause ---

root@juju-a8f1ad-mojo-26:/var/lib/juju/agents/unit-gnocchi-0/charm# relation-get -r ha:44 - gnocchi/0
corosync_bindiface: eth0
corosync_mcastport: "4440"
egress-subnets: 10.5.0.18/32
ingress-address: 10.5.0.18
json_clones: '{"cl_res_gnocchi_haproxy": "res_gnocchi_haproxy"}'
json_groups: '{"grp_gnocchi_vips": "res_gnocchi_ens3_vip"}'
json_init_services: '["haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy",
  "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy",
  "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy",
  "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy",
  "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy",
  "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy", "haproxy"]'
json_resource_params: '{"res_gnocchi_ens3_vip": " params ip=\"10.5.150.86\" nic=\"ens3\"
  cidr_netmask=\"255.255.0.0\"", "res_gnocchi_haproxy": " op monitor interval=\"5s\""}'
json_resources: '{"res_gnocchi_ens3_vip": "ocf:heartbeat:IPaddr2", "res_gnocchi_haproxy":
  "lsb:haproxy"}'
private-address: 10.5.0.18

root@juju-a8f1ad-mojo-26:/var/lib/juju/agents/unit-gnocchi-0/charm# crm status
Last updated: Sat Feb 17 00:17:12 2018 Last change: Fri Feb 16 21:45:55 2018 by root via cibadmin on juju-a8f1ad-mojo-26
Stack: corosync
Current DC: juju-a8f1ad-mojo-26 (version 1.1.14-70404b0) - partition with quorum
3 nodes and 0 resources configured

Online: [ juju-a8f1ad-mojo-26 juju-a8f1ad-mojo-27 juju-a8f1ad-mojo-28 ]

Full list of resources:

Changed in charm-keystone:
status: Triaged → Fix Committed
status: Fix Committed → New
Changed in charm-hacluster:
status: New → Fix Committed
importance: Undecided → Critical
assignee: nobody → James Page (james-page)
milestone: none → 18.02
status: Fix Committed → Fix Released
Chris Gregan (cgregan)
tags: added: on-site
tags: added: cpe-onsite
removed: on-site
Revision history for this message
Chris Gregan (cgregan) wrote :

Bumped Field Critical as is is now blocking a field deployment

Revision history for this message
Chris Gregan (cgregan) wrote :
Revision history for this message
Ryan Beisner (1chb1n) wrote :

FYI: cs:hacluster-40 (current stable) should have the fix in it; the field deploy uses cs:hacluster-39.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Tested the following scenarios with -next charms:

1) deploy a multi network space cloud with 17.11 reactive openstack charms and hacluster charm (before cs:hacluster-40 upload mentioned in #10): aodh, gnocchi, panko (self-built version), designate;
2) observe missing resources configured per `sudo crm status`
3) upgrade charms from stable to -next;
4) observe vip-related resources configured.

Revision history for this message
David Ames (thedac) wrote :

The hacluster VIP issue was resolved by the stable back port of hacluster.

We still have a race condition with ceilometer-> gnocchi -> keystone. I'll be adding more gating to keystone to resolve that race.

In the meantime, there is a simple work around. After things, have settled resolve the ceilometer unit in error state:

 juju resolved ceilometer/$N

Ryan Beisner (1chb1n)
Changed in charm-gnocchi:
importance: Medium → High
Changed in charm-ceilometer:
importance: Medium → High
Changed in charm-keystone:
importance: Medium → High
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Note that our CI and field deployments use 'juju wait --workload' to let us know when things have settled.

If ceilometer enters an error state, juju wait will bail, causing apparent errors in field runs and failures in CI. For manual runs in the field, you can just watch until things look steady in juju status I guess?

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Completely understand. We created --workload in the juju wait for the same use. Unfortunately, there are race conditions (like this bug), where all of the best "is my deployment really ready" forethought in the world will occasionally fail. This bug is one of them.

Revision history for this message
Ashley Lai (alai) wrote :

Please see the bundle attached. We don't run any 'juju config' on the deployment.

Revision history for this message
Ashley Lai (alai) wrote :

juju resolved did not resolve the error.

I ran 'juju resolved ceilometer/2' twice on this deployment. The crashdump is attached.

David Ames (thedac)
Changed in charm-hacluster:
status: Fix Released → Triaged
assignee: James Page (james-page) → David Ames (thedac)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-hacluster (master)

Fix proposed to branch: master
Review: https://review.openstack.org/546806

Changed in charm-hacluster:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-keystone (master)

Fix proposed to branch: master
Review: https://review.openstack.org/546816

Changed in charm-keystone:
status: New → In Progress
Revision history for this message
Ashley Lai (alai) wrote :

Redeployed with -next charm and hit the same error. 'juju resolved' did not fix the error. apache2 are not running on aodh and gnocchi https://bugs.launchpad.net/charm-gnocchi/+bug/1750915.

juju status: https://pastebin.canonical.com/p/MmR9nRdsrz/
bundle: https://pastebin.canonical.com/p/rC4yyKhW6j/

overlay-hostnames.yaml: https://pastebin.canonical.com/p/fszx8HTGQ6/
overlay-ssl.yaml: https://pastebin.canonical.com/p/YptYYc7XDw/
overlay-mysql-noha.yaml: https://pastebin.canonical.com/p/938fM5xn6q/

deploy command:
juju deploy -m foundations-maas:admin/ssl ./bundle.yaml --overlay ./overlay-hostnames.yaml --overlay ./overlay-ssl.yaml --overlay ./overlay-mysql-noha.yaml

Revision history for this message
David Ames (thedac) wrote :

@Ashely would you mind adding comment #19 to https://bugs.launchpad.net/charm-gnocchi/+bug/1750915. The two bugs look alike but are actually unrelated. I'd like to handle the SSL issues there.

Revision history for this message
David Ames (thedac) wrote :

Update: This bug has really been multiple race conditions.
There was a race with the Database being synced in gnocchi.
There was a race with hacluster being formed with gnocchi.
All of these present when ceilometer runs ceilometer-upgrade.

Now in what I hope is the final race condition, there is a race with keystone catalog and the availability of the metric catalog entry.

Now gnocchi has synced its db, has a CRM cluster and receives information from keystone. Gnocchi informs ceilometer. Ceilometer attempts to run ceilometer-upgrade but keystone has not yet put the metric service in the catalog.

CRITICAL ceilometer [-] EndpointNotFound: internalURL endpoint for metric service not found

Next areas to investigate:

The interface-keystone has multiple data_complete options. Is it possible we have some but not all of the data from kesytone. i.e. identity-service.available vs identity-service.available.auth

A cursory look at keystone seems that it registers the catalog entry contemporaneously with setting relation data. Need to confirm this is the case.

Note: Keystone is very busy executing hooks.
Could there be a delay between running the commands to publish the catalog entry and when it is actually available?

Revision history for this message
James Page (james-page) wrote :

Looking at ceilometer code @ pike:

    if conf.skip_gnocchi_resource_types:
        LOG.info("Skipping Gnocchi resource types upgrade")
    else:
        LOG.debug("Upgrading Gnocchi resource types")
        from ceilometer import gnocchi_client
        gnocchi_client.upgrade_resource_types(conf)

and what the gnocchi_client module does in terms of using the configured endpoint for gnocchi vs using the service catalog - the service catalog is used so if the endpoint was not registered at the point in time that the upgrade call is made, then I would expect that error; however I'm confused as to why this is the case; gnocchi should not give out any URL before the identity-service relation is complete, which would infer that the catalog entries have been created for gnocchi.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We hit what I believe is a manifestion of this bug again last night, but with a different traceback:

https://pastebin.canonical.com/p/5Nwh98ny3k/

It looks like the keystone charm is trying to register a service with keystone before keystone is setup.?field.comment=We hit what I believe is a manifestion of this bug again last night, but with a different traceback:

https://pastebin.canonical.com/p/5Nwh98ny3k/

It looks like the keystone charm is trying to register a service with keystone before keystone is setup.

bundle and overlay: https://paste.ubuntu.com/p/BzjqmWYCb6/

David Ames (thedac)
Changed in charm-ceilometer:
assignee: nobody → David Ames (thedac)
Revision history for this message
David Ames (thedac) wrote :

Jason,

The previous comment may be yet another bug.

Can you please test with the following for a few runs and see if there is a difference in the original issue?

cs:~thedac/ceilometer-0
cs:~thedac/gnocchi-0

This is based on the work with gnocchi and ceilometer:
https://review.openstack.org/#/c/545491/
https://review.openstack.org/#/c/547513/

But not on changes to hacluster and keystone which have introduced race conditions of their own.

What I want to find out is if this reduces the number of failures.

Revision history for this message
David Ames (thedac) wrote :

See my last comment on the Gnocchi SSL bug which is relevant here:
https://bugs.launchpad.net/charm-gnocchi/+bug/1750915/comments/5

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We tested with the two charms from comment #24 about 20 times over the weekend and didn't hit this bug any, so it looks like the patches in those are effective.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-gnocchi (master)

Reviewed: https://review.openstack.org/545491
Committed: https://git.openstack.org/cgit/openstack/charm-gnocchi/commit/?id=4666b8024ca5330d1990281b19fa30bdb53007d1
Submitter: Zuul
Branch: master

commit 4666b8024ca5330d1990281b19fa30bdb53007d1
Author: David Ames <email address hidden>
Date: Fri Feb 16 23:36:10 2018 +0000

    Do not set gnocchi URL until clustering complete

    Gnocchi was providing its URL to client charms before its VIP was
    completely setup. This change checks that an hacluster relation
    exists and if so waits to provide its URL until the hacluster setup
    is complete.

    Depends-On: I23eb5e70537a62d5b9e5e24d09f37519b63a1717
    Change-Id: I3a6991ecb4eca8659c08d5c5d00d35b8d22bf79e
    Closes-Bug: #1749280

Changed in charm-gnocchi:
status: In Progress → Fix Committed
David Ames (thedac)
Changed in charm-ceilometer:
status: Triaged → Fix Committed
Changed in charm-hacluster:
importance: Critical → Medium
Changed in charm-keystone:
importance: High → Medium
Changed in charm-ceilometer:
milestone: none → 18.02
Revision history for this message
Chris Gregan (cgregan) wrote :

Ran into this issue again over the weekend.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

The ultimate fix for this will take time. The current fix reduces the occurrence rate. Our conversation with QA was to that effect: expect a dramatic reduction in occurrence, but it will still surface. If this is not in line with expectations or if that rate is too high for a release, please notify us today. Thank you.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceilometer (master)

Fix proposed to branch: master
Review: https://review.openstack.org/550412

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceilometer (master)

Reviewed: https://review.openstack.org/550412
Committed: https://git.openstack.org/cgit/openstack/charm-ceilometer/commit/?id=f3148b9bd74bb08b840dcb45e5ce17736dcab981
Submitter: Zuul
Branch: master

commit f3148b9bd74bb08b840dcb45e5ce17736dcab981
Author: David Ames <email address hidden>
Date: Wed Mar 7 10:33:47 2018 +0100

    Run ceilometer-upgrade as an action

    The ceilometer-upgrade command needs to be run to update back end
    ceilometer data stores. When attempting to run this command during
    deploy time due to the number of required relations many inherent
    race conditions exist leading to Bug#1749280.

    This change allows the ceilometer-upgrade command to be run as an action
    post-deploy.

    Change-Id: I64a56d9a38532476b8a01df6227231a1276c708f
    Closes-Bug: #1749280

David Ames (thedac)
Changed in charm-hacluster:
status: In Progress → Invalid
Changed in charm-keystone:
status: In Progress → Invalid
Ryan Beisner (1chb1n)
Changed in charm-ceilometer:
status: Fix Committed → Fix Released
Changed in charm-gnocchi:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-keystone (master)

Change abandoned by David Ames (<email address hidden>) on branch: master
Review: https://review.openstack.org/546816
Reason: Invalid

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-hacluster (master)

Change abandoned by David Ames (<email address hidden>) on branch: master
Review: https://review.openstack.org/546806
Reason: Invalid

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.