keystone is stuck repeatedly running cluster-relation-changed hook

Bug #1755897 reported by Ashley Lai
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Keystone Charm
Fix Released
High
Frode Nordahl

Bug Description

keystone is running cluster-relation-changed hook for more than 3 hours.

2018-03-14 09:43:45 INFO juju-log Rendering from template: /etc/apache2/sites-enabled/wsgi-openstack-api.conf
2018-03-14 09:43:45 DEBUG worker.uniter.jujuc server.go:180 running hook tool "juju-log"
2018-03-14 09:43:45 INFO juju-log Wrote template /etc/apache2/sites-enabled/wsgi-openstack-api.conf.
2018-03-14 09:43:46 DEBUG worker.uniter.jujuc server.go:180 running hook tool "config-get"
2018-03-14 09:43:46 DEBUG juju.worker.uniter.remotestate watcher.go:464 got a relation units change: {76 {map[designate/2:{0}] []}}
2018-03-14 09:43:46 DEBUG worker.uniter.jujuc server.go:180 running hook tool "juju-log"
2018-03-14 09:43:46 INFO juju-log CA cert is the same as installed version
2018-03-14 09:43:46 DEBUG leader-settings-changed Considering dependency setenvif for ssl:
2018-03-14 09:43:46 DEBUG leader-settings-changed Module setenvif already enabled
2018-03-14 09:43:46 DEBUG leader-settings-changed Considering dependency mime for ssl:
2018-03-14 09:43:46 DEBUG leader-settings-changed Module mime already enabled
2018-03-14 09:43:46 DEBUG leader-settings-changed Considering dependency socache_shmcb for ssl:
2018-03-14 09:43:46 DEBUG leader-settings-changed Module socache_shmcb already enabled
2018-03-14 09:43:46 DEBUG leader-settings-changed Module ssl already enabled
2018-03-14 09:43:46 DEBUG leader-settings-changed Module proxy already enabled
2018-03-14 09:43:46 DEBUG leader-settings-changed Considering dependency proxy for proxy_http:
2018-03-14 09:43:46 DEBUG leader-settings-changed Module proxy already enabled
2018-03-14 09:43:46 DEBUG leader-settings-changed Module proxy_http already enabled
......

# more similar logs with different timestamps in between

......
2018-03-14 13:25:38 INFO juju-log cluster:24: Rendering from template: /etc/apache2/sites-enabled/wsgi-openstack-api.conf
2018-03-14 13:25:38 DEBUG worker.uniter.jujuc server.go:180 running hook tool "juju-log"
2018-03-14 13:25:38 INFO juju-log cluster:24: Wrote template /etc/apache2/sites-enabled/wsgi-openstack-api.conf.
2018-03-14 13:25:38 DEBUG worker.uniter.jujuc server.go:180 running hook tool "juju-log"
2018-03-14 13:25:38 INFO juju-log cluster:24: CA cert is the same as installed version
2018-03-14 13:25:38 DEBUG cluster-relation-changed Considering dependency setenvif for ssl:
2018-03-14 13:25:38 DEBUG cluster-relation-changed Module setenvif already enabled
2018-03-14 13:25:38 DEBUG cluster-relation-changed Considering dependency mime for ssl:
2018-03-14 13:25:38 DEBUG cluster-relation-changed Module mime already enabled
2018-03-14 13:25:38 DEBUG cluster-relation-changed Considering dependency socache_shmcb for ssl:
2018-03-14 13:25:38 DEBUG cluster-relation-changed Module socache_shmcb already enabled
2018-03-14 13:25:38 DEBUG cluster-relation-changed Module ssl already enabled
2018-03-14 13:25:38 DEBUG cluster-relation-changed Module proxy already enabled
2018-03-14 13:25:38 DEBUG cluster-relation-changed Considering dependency proxy for proxy_http:
2018-03-14 13:25:38 DEBUG cluster-relation-changed Module proxy already enabled
2018-03-14 13:25:38 DEBUG cluster-relation-changed Module proxy_http already enabled

      keystone/0:
        workload-status:
          current: active
          message: Unit is ready
          since: 14 Mar 2018 10:20:04Z
        juju-status:
          current: executing
          message: running cluster-relation-changed hook
          since: 14 Mar 2018 11:38:21Z
          version: 2.4-beta1
        machine: 0/lxd/5

https://solutions.qa.canonical.com/#/qa/testRun/d4b23e5b-520f-4325-9f0b-58e018d09c0e

Revision history for this message
Ashley Lai (alai) wrote :
Revision history for this message
Ashley Lai (alai) wrote :
Revision history for this message
Ashley Lai (alai) wrote :
Revision history for this message
Ashley Lai (alai) wrote :
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

saw this again last night on a queens deploy. We've actually been seeing this for a couple of months at least, and only recently were able to capture adequate logs for it.

bundle:
http://paste.ubuntu.com/p/GcNFpbH7V6/

summary: - keystone is stucked in running cluster-relation-changed hook
+ keystone is stuck repeatedly running cluster-relation-changed hook
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

excerpt of keystone/0 unit log: http://paste.ubuntu.com/p/Qz5Fjp3Vj4/

Why is run relation-changed (24; keystone/2) hook repeatedly being fired? This doesn't seem at all related to the optimizations mentioned in https://review.openstack.org/#/c/520047/

Revision history for this message
Ante Karamatić (ivoks) wrote :

# grep 'Loading a previously generated admin token from' /var/log/juju/unit-keystone-2.log | wc -l
1418

And environment is still deploying...

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

In my case it's been doing this for 2 days now.

juju status keystone
Model Controller Cloud/Region Version SLA
default samaas samaas 2.4-beta1 unsupported

App Version Status Scale Charm Store Rev OS Notes
hacluster-keystone active 3 hacluster jujucharms 42 ubuntu
keystone 13.0.0 active 3 keystone jujucharms 276 ubuntu

Unit Workload Agent Machine Public address Ports Message
keystone/0 active executing 0/lxd/4 10.232.4.234 5000/tcp Unit is ready
  hacluster-keystone/0* active idle 10.232.4.234 Unit is ready and clustered
keystone/1 active executing 1/lxd/4 10.232.45.108 5000/tcp Unit is ready
  hacluster-keystone/1 active idle 10.232.45.108 Unit is ready and clustered
keystone/2* active executing 2/lxd/4 10.232.45.99 5000/tcp Unit is ready
  hacluster-keystone/2 active idle 10.232.45.99 Unit is ready and clustered
...

for i in 0 1 2 ; do juju ssh keystone/$i uptime ; done
 18:32:47 up 2 days, 13:30, 1 user, load average: 2.64, 2.97, 3.20
Connection to 10.232.45.101 closed.
 18:32:48 up 2 days, 13:29, 1 user, load average: 2.83, 3.11, 3.32
Connection to 10.232.45.108 closed.
 18:32:54 up 2 days, 13:30, 1 user, load average: 6.03, 5.42, 5.36
Connection to 10.232.5.12 closed.

At least apache2 is not restarted along with the WSGI daemon so I was able to use the deployed environment.

systemctl status apache2
● apache2.service - LSB: Apache2 web server
   Loaded: loaded (/etc/init.d/apache2; bad; vendor preset: enabled)
  Drop-In: /lib/systemd/system/apache2.service.d
           └─apache2-systemd.conf
   Active: active (running) since Thu 2018-03-29 06:07:03 UTC; 2 days ago <-- 👁👁

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

My bundle did not have any of the ops stuff, only core services:

https://paste.ubuntu.com/p/hmZvCz4pTW/

Ryan Beisner (1chb1n)
Changed in charm-keystone:
assignee: nobody → Frode Nordahl (fnordahl)
milestone: none → 18.05
importance: Undecided → High
status: New → In Progress
Frode Nordahl (fnordahl)
Changed in charm-keystone:
status: In Progress → Triaged
Frode Nordahl (fnordahl)
Changed in charm-keystone:
status: Triaged → In Progress
Revision history for this message
Frode Nordahl (fnordahl) wrote :

The symptoms are also observable in a more simplified bundle like this one: https://pastebin.ubuntu.com/p/JWCvCHWy7v/

I believe there is no looping or endlessness to this as such, but the operations done by the charm are proportional to the number of identity-relations it has and is further amplified by having SSL enabled and hacluster subordinate related.

Bear in mind that the keystone charm has been around for quite some time and that all operations it does has been added for a reason.

It is also clear that there is a high level of redundancy in how many times said operations are currently done, and that we need to unwind and optimize this.

To set some expectations right now, there is no single line of code we can change to improve this quickly. And we will need to improve and fix this gradually by identifying and removing redundant calls without breaking existing functionality.

We are currently working on this and there will be a set of changes proposed as a result of this bug report.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Frode,

hmm, charm-keystone behavior with the bundle I used for a while before never resulted in looping for 2 days with previous versions https://bugs.launchpad.net/charm-keystone/+bug/1755897/comments/8

There has to be something new that causes this.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

If you have any numbers or recent comparisons that support that this is being caused by recent changes that would be great to have.

In the mean time I am running tracing and profiling of the Keystone charm from stable branch. A bundle with just two glance instances takes somewhere around two hours to settle. Some of that is surely due to the profiling code itself but there is mischief going on here for sure.

Will update as soon as I have any conclusions to draw from the profiling and patch sets for review.

Revision history for this message
Frode Nordahl (fnordahl) wrote :
Revision history for this message
Frode Nordahl (fnordahl) wrote :

A thing to note about how this can play out in different scenarios:
If one of the units slated to join the cluster relation are delayed in finishing their other relations for whatever reason, the Keystone units waiting for it to join will continuously run their cluster-relation-changed hook until it is able to complete joining.

Revision history for this message
Frode Nordahl (fnordahl) wrote : Re: deployment time regression

The bulk of the regression in deployment time appears to come from juju itself. From observation it seems that each execution of juju hook tools take significantly longer time with recent versions of juju compared to older versions.

Data:
Model contains 3-node keystone cluster, mysql and two instances of glance. (bundle: https://pastebin.ubuntu.com/p/jrpcMs7Qq8/)
Deployment runtime measured on non-leader unit from execution of first hook after `install` hook.
Deployment deemed completed when execution of final `cluster-relation-changed` hook completes.

Baseline: juju 2.3.5, stable/18.02 charms, juju model-config logging-config=<root>=WARNING;unit=DEBUG
1. run
  * start `2018-04-10 11:01:21`
  * end `2018-04-10 11:21:38`
  * time 20m17s (1217s)
2. run
  * start `2018-04-10 11:56:07`
  * end `2018-04-10 12:22:04`
  * time 26m57s (1617s)
 3. run
  * start `2018-04-10 12:35:18`
  * end `2018-04-10 12:58:47`
  * time 23m29s (1409s)

 juju 2.2.6, stable/18.02 charms, juju model-config logging-config=<root>=WARNING;unit=DEBUG
 1. run
   * start `2018-04-11 06:01:21`
   * end `2018-04-11 06:14:05`
   * time 13m16s (796s)
 2. run
   * start `2018-04-11 06:29:06`
   * end `2018-04-11 06:40:30`
   * time 11m24s (684s)
 3. run
   * start `2018-04-11 06:56:06`
   * end `2018-04-11 07:07:12`
   * time 11m6s (666s)

summary: - keystone is stuck repeatedly running cluster-relation-changed hook
+ deployment time regression
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Crashdump collected from deployment done with juju 2.2.6

tags: added: cdo-qa cdo-qa-blocker foundations-engine
summary: - deployment time regression
+ keystone is stuck repeatedly running cluster-relation-changed hook
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This bug is not about a regression in deployment time, it's specifically about keystone getting stuck repeatedly running cluster-relation-changed hook. It does not complete! That's not a regression in deployment time, that's a failure to complete deployment.

I don't doubt that there may have been regressions in hook execution time, but that is not what this bug is about. It's about large variances, where in certain deployments it appears to be stuck and will never complete.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Removed juju task and will report those part of the findings in a separate bug report.

no longer affects: juju
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1755897] Re: deployment time regression

Is this saying it takes 20 minutes to execute:
 juju model-config

Or is this saying full deployment that you are doing is taking a long time
and you're just listing model-config as "this is how the system is set up" ?

It sounds like from Jason that the issue is that keystone is causing
excessive load on the system due to spinning in cluster-relation-changed
hook. (and thus slowing down everything else)

On Wed, Apr 11, 2018 at 12:44 PM, Jason Hobbs <email address hidden>
wrote:

> This bug is not about a regression in deployment time, it's specifically
> about keystone getting stuck repeatedly running cluster-relation-changed
> hook. It does not complete! That's not a regression in deployment time,
> that's a failure to complete deployment.
>
> I don't doubt that there may have been regressions in hook execution
> time, but that is not what this bug is about. It's about large
> variances, where in certain deployments it appears to be stuck and will
> never complete.
>
> ** Tags added: cdo-qa cdo-qa-blocker foundations-engine
>
> ** Summary changed:
>
> - deployment time regression
> + keystone is stuck repeatedly running cluster-relation-changed hook
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1755897
>
> Title:
> keystone is stuck repeatedly running cluster-relation-changed hook
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/charm-keystone/+bug/1755897/+subscriptions
>

Revision history for this message
Frode Nordahl (fnordahl) wrote :

20 minutes for full deployment of bundle specimen in #16, model-config just listed as log levels are relevant for execution time.

The reason for going down that path was that on-going analysis of profiling of Keystone charm showed elevated execution times for juju hook tools, which could explain some of the symptoms.

The differences for deployment time in juju 2.2.6 and 2.3.5 are evident, but I have later found that times have improved again in 2.4-beta1 so I guess we can look away from this.

Execution times alone does also not explain all observed and reported behavior so I will keep digging. Sorry about the noise.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

When this occurs again, could you record the output of a couple of runs of this command from the deployment along with fresh unit logs:
juju run --application keystone -- sh -c 'for unit in $(relation-list -r $(relation-ids cluster)); do relation-get -r $(relation-ids cluster) - $unit;done'

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-keystone (stable/18.02)

Fix proposed to branch: stable/18.02
Review: https://review.openstack.org/560911

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-keystone (master)

Fix proposed to branch: master
Review: https://review.openstack.org/560915

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-keystone (stable/18.02)

Change abandoned by Frode Nordahl (<email address hidden>) on branch: stable/18.02
Review: https://review.openstack.org/560911
Reason: Posted on wrong branch, see https://review.openstack.org/#/c/560915/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Frode, when this occurs, it's in our CI setup. We don't get a chance to run extra commands like that. Is there some additional logging that needs to be added to the charms?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
David Ames (thedac)
Changed in charm-keystone:
milestone: 18.05 → 18.08
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-keystone (master)

Reviewed: https://review.openstack.org/560915
Committed: https://git.openstack.org/cgit/openstack/charm-keystone/commit/?id=17b24e7fde8e4c8c276a4f392cbae0d1d0ed2615
Submitter: Zuul
Branch: master

commit 17b24e7fde8e4c8c276a4f392cbae0d1d0ed2615
Author: Frode Nordahl <email address hidden>
Date: Thu Apr 12 12:10:58 2018 +0200

    Remove support for PKI tokens and legacy charm managed certificates

    These features are disabled by default, a majority of our
    users provide certificates through configuration.

    At present the cluster relation carries information required
    for these features even when they are not enabled. This makes
    processing of cluster relation changes unnecessarily heavy
    and vulnerable to bugs.

    Notice of deprecation and removal in next release was given
    as part of the 18.05 release notes.

    Change-Id: I8b07c7e0d5c2c623c115c83dc8aff230b554a986
    Closes-Bug: #1755897
    Related-Bug: #1744990

Changed in charm-keystone:
status: In Progress → Fix Committed
David Ames (thedac)
Changed in charm-keystone:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.