Charm requests keystone data before keystone is in ready state

Bug #1818113 reported by Alexander Balderson
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Keystone Charm
Fix Released
High
Unassigned
OpenStack Octavia Charm
Fix Released
High
David Ames
charm-interface-keystone
Fix Released
High
David Ames
charm-openstack-service-checks
Fix Released
Medium
Drew Freiberger

Bug Description

On a deploy with a clustered keystone, one keystone unit is ready, the other 2 are waiting. The charm requested data from keystone resulting in a http 500 error.

  keystone:
    charm: cs:keystone-292
    series: bionic
    os: ubuntu
    charm-origin: jujucharms
    charm-name: keystone
    charm-rev: 292
    exposed: false
    application-status:
      current: waiting
      message: 'Incomplete relations: database'
      since: 28 Feb 2019 04:58:47Z
    relations:
      cluster:
      - keystone
      ha:
      - hacluster-keystone
      identity-credentials:
      - ceilometer
      - openstack-service-checks
      - prometheus-openstack-exporter
      identity-service:
      - aodh
      - ceph-radosgw
      - cinder
      - designate
      - glance
      - gnocchi
      - heat
      - neutron-api
      - nova-cloud-controller
      - openstack-dashboard
      juju-info:
      - filebeat
      - telegraf
      nrpe-external-master:
      - nrpe-container
      shared-db:
      - mysql

2019-02-28 05:35:23 ERROR juju-log Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-openstack-service-checks-0/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 73, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-openstack-service-checks-0/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-openstack-service-checks-0/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-openstack-service-checks-0/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-openstack-service-checks-0/charm/reactive/service_checks.py", line 184, in render_config
    render_checks()
  File "/var/lib/juju/agents/unit-openstack-service-checks-0/charm/reactive/service_checks.py", line 162, in render_checks
    endpoint_checks = create_endpoint_checks()
  File "/var/lib/juju/agents/unit-openstack-service-checks-0/charm/reactive/service_checks.py", line 236, in create_endpoint_checks
    endpoints = keystone_client.endpoints.list()
  File "/usr/lib/python3/dist-packages/keystoneclient/v3/endpoints.py", line 125, in list
    **kwargs)
  File "/usr/lib/python3/dist-packages/keystoneclient/base.py", line 75, in func
    return f(*args, **new_kwargs)
  File "/usr/lib/python3/dist-packages/keystoneclient/base.py", line 397, in list
    self.collection_key)
  File "/usr/lib/python3/dist-packages/keystoneclient/base.py", line 125, in _list
    resp, body = self.client.get(url, **kwargs)
  File "/usr/lib/python3/dist-packages/keystoneauth1/adapter.py", line 304, in get
    return self.request(url, 'GET', **kwargs)
  File "/usr/lib/python3/dist-packages/keystoneauth1/adapter.py", line 463, in request
    resp = super(LegacyJsonAdapter, self).request(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/keystoneauth1/adapter.py", line 189, in request
    return self.session.request(url, method, **kwargs)
  File "/usr/lib/python3/dist-packages/keystoneclient/session.py", line 343, in request
    auth_headers = self.get_auth_headers(auth)
  File "/usr/lib/python3/dist-packages/keystoneclient/session.py", line 660, in get_auth_headers
    return auth.get_headers(self, **kwargs)
  File "/usr/lib/python3/dist-packages/keystoneclient/auth/base.py", line 159, in get_headers
    token = self.get_token(session)
  File "/usr/lib/python3/dist-packages/keystoneclient/auth/identity/base.py", line 206, in get_token
    return self.get_access(session).auth_token
  File "/usr/lib/python3/dist-packages/keystoneclient/auth/identity/base.py", line 252, in get_access
    self.auth_ref = self.get_auth_ref(session)
  File "/usr/lib/python3/dist-packages/keystoneclient/auth/identity/v3/base.py", line 189, in get_auth_ref
    authenticated=False, log=False, **rkwargs)
  File "/usr/lib/python3/dist-packages/keystoneclient/session.py", line 543, in post
    return self.request(url, 'POST', **kwargs)
  File "/usr/lib/python3/dist-packages/keystoneclient/session.py", line 443, in request
    raise exceptions.from_response(resp, method, url)
keystoneauth1.exceptions.http.InternalServerError: An unexpected error prevented the server from fulfilling your request. (HTTP 500) (Request-ID: req-58a1d5ba-d15c-41ea-8f94-cb6b01a16356)

2019-02-28 05:35:23 ERROR juju.worker.uniter.operation runhook.go:132 hook "update-status" failed: exit status 1

Full crashdump is attached

Revision history for this message
Alexander Balderson (asbalderson) wrote :
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I suspect this is something to do with the recent updates to the charm.

It's not clear if keystone is ready or not. In Alex's run, it shows as not in juju status, but all of the units have said they are ready at some point:

http://paste.ubuntu.com/p/Pz64Yt8rBv/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Sub'd to field high as this is causing a sudden spike in failures

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Hi Alex, please, could you provide the bundles you are using so that we can chase this down a bit. Have those bundles changed recently (i.e. are they using stable or next, etc.)

Thanks.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This is happening in the new create_endpoint_check() that was added here:

https://git.launchpad.net/charm-openstack-service-checks/commit/?id=7895a97e4ee052c44971576a691e4ce23d6969f2

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

@ajkavanagh here is the bundle http://paste.ubuntu.com/p/b5YXFdyPfp/

We haven't changed the bundle recently; the change that triggered this was to the openstack service checks charm. It's not clear to me where root cause is.

Revision history for this message
Xav Paice (xavpaice) wrote :

Looks like the openstack-service-checks is calling the relation for keystone before keystone is ready - should keystone not be blocking the relation until it's ready?

The relation to Keystone has been there for a while, but the recent change actually uses that information to query Keystone immediately - so the relation shouldn't be made available until Keystone is actually ready to service requests.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

just after identity-credentials-relation-changed hook ran where we were passed authentication credentials from keystone, keystone is immediately not available for the next update-status for the charm to use keystone.

from openstack-service-checks-0.log:
2019-02-28 05:32:06 INFO juju-log identity-credentials:220: Reactive main running for hook identity-credentials-relation-changed
2019-02-28 05:32:12 INFO juju-log identity-credentials:220: render_config: Got credentials for username=nagios
2019-02-28 05:32:28 INFO juju.worker.uniter.operation runhook.go:138 ran "identity-credentials-relation-changed" hook
2019-02-28 05:33:35 INFO juju-log Reactive main running for hook update-status
2019-02-28 05:33:41 ERROR juju-log Hook error:
keystoneauth1.exceptions.http.InternalServerError: An unexpected error prevented the server from fulfilling your request. (HTTP 500) (Request-ID: req-058d844c-4b70-42f0-8bda-a7c3b6789bde)

So, somehow keystone charm is able to create the user account from the relation hooks, but the endpoint is not available to authenticate against. This seems either inconsistent or a race condition of clustering of keystone.

I also see a possible race condition where render_config may be called before keystone is ready if nrpe-external-master relation happens first.

Keystone's shared-db isn't even set up until 05:35 or later, so it appears that keystone is able to provide credentials for what "will" be created users when the db is online?

from keystone/0:
2019-02-28 05:35:38 INFO juju-log identity-service:41: database relation's interface, shared-db, is related awaiting the following data from the relationship: database_host, database_password.

Long term, this should probably handled with broker type handling as is done with ceph-mon:client relations. Short term, we'll capture the exception and work around it.

Changed in charm-openstack-service-checks:
assignee: nobody → Drew Freiberger (afreiberger)
importance: Undecided → Medium
status: New → In Progress
Xav Paice (xavpaice)
Changed in charm-openstack-service-checks:
status: In Progress → Fix Committed
status: Fix Committed → Fix Released
Revision history for this message
Xav Paice (xavpaice) wrote :

Update released in cs:~canonical-bootstack/openstack-service-checks-11

Perhaps we should be altering the Keystone charm to use goal state? The fact remains, that until the Keystone cluster is ready (even just one unit) it shouldn't be completing the identity credentials relation, and non-ready units shouldn't be serving up 500's to clients.

James Page (james-page)
Changed in charm-keystone:
status: New → Triaged
importance: Undecided → Wishlist
importance: Wishlist → Low
Revision history for this message
Drew Freiberger (afreiberger) wrote :

We found there was a race-condition between layer:apt installing packages and the reactive handlers kicking off in service_checks.py which lead to service_checks.py imports failing due to apt packages providing those dist-packages for python modules. We have updated the code to use layer:basic's package installs on bootstrap and are working on review/release.

Revision history for this message
Alexander Balderson (asbalderson) wrote :
Download full text (7.1 KiB)

Something very similar to this came up in another run, but at a slightly different place:

2019-04-23 12:39:50 ERROR juju-log identity-credentials:240: Hook error:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 383, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 266, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 440, in send
    timeout=timeout
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "/usr/lib/python3/dist-packages/urllib3/util/retry.py", line 357, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/var/lib/juju/agents/unit-openstack-service-checks-0/.venv/lib/python3.6/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 387, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 383, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.6/http/client.py", line 1331, in getresponse
    response.begin()
  File "/usr/lib/python3.6/http/client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python3.6/http/client.py", line 266, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/keystoneclient/session.py", line 462, in _send_request
    resp = self.session.request(method, url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 520, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 630, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/adapters.py", line 490, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('...

Read more...

Revision history for this message
Alexander Balderson (asbalderson) wrote :

Attaching logs

Revision history for this message
Alexander Balderson (asbalderson) wrote :

Saw a very similar issue with octavia today:
It requested data from keystone but keystone was not yet available.

1 keystone unit is active; the other 2 are waiting for database relation and running the "leader setting changed" hook.

Attaching the crashdump

Revision history for this message
Alexander Balderson (asbalderson) wrote :

attaching bundle

Revision history for this message
Alexander Balderson (asbalderson) wrote :

versioned overlay

Revision history for this message
Ryan Beisner (1chb1n) wrote :

The example bundles regarding comments 13 through 15 seem to be lacking a race-condition avoidance requirement of specifying the cluster_count on hacluster. Can you confirm that this problem still exists when that is set?

Changed in charm-octavia:
status: New → Incomplete
Revision history for this message
Michael Skalka (mskalka) wrote :

Ryan,

The hacluster charm defaults that value to 3: https://jaas.ai/hacluster#charm-config-cluster_count. As we do not override that value in our bundle and deploy 3 units of each service I don't see how that would come into play.

Changed in charm-octavia:
status: Incomplete → New
Changed in charm-keystone:
status: Triaged → New
Revision history for this message
Ryan Beisner (1chb1n) wrote :

Actually, the hacluster charm config default of 3 units happens to align with the deployed number of units, and I would expect that to squash my concern in my comment #16.

Revision history for this message
Ashley Lai (alai) wrote :

We hit it again. Crashdump is attached.

Revision history for this message
Nikolay Vinogradov (nikolay.vinogradov) wrote :

Happens from time to time in our deployment as well.

Revision history for this message
Nikolay Vinogradov (nikolay.vinogradov) wrote :

However it affects octavia charm, not the openstack-service-checks charm. Symptoms: https://paste.ubuntu.com/p/vjCD4vrV9z/

Revision history for this message
Nikolay Vinogradov (nikolay.vinogradov) wrote :

'juju resolved' fixes the problem

Revision history for this message
John George (jog) wrote :

Solutions QA continues to hit this bug. All occurrences are listed here:
https://solutions.qa.canonical.com/#/qa/bug/1818113

Revision history for this message
John George (jog) wrote :

SQA hit this bug 3 more times over the weekend.

Liam Young (gnuoy)
Changed in charm-keystone:
status: New → Confirmed
importance: Low → High
assignee: nobody → Liam Young (gnuoy)
milestone: none → 20.05
Revision history for this message
Liam Young (gnuoy) wrote :

The crashdump from comment #13 shows octavia trying to communicate with other
services once it has received credentials from keystone. Looking at the logs
all octavia units fail running setup_hm_port at around 11:48. But looking at
the keystone log at the same time keystone/0 and keystone/2 have yet to
setup their database connection and are still talking to their local sqllite
db, this results in them returning a 500 error.

Looking at the keystone charm it delays advertising client credentials until
it is ready. Unfortunately it only checks its own readiness and not that of its
peers. This can lead to the situation from comment #13 where only one keystone
unit is able to service requests so there is a 2 in 3 chance of getting a 500
back.

Revision history for this message
Michael Skalka (mskalka) wrote :
Revision history for this message
Alexander Balderson (asbalderson) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-keystone (master)

Fix proposed to branch: master
Review: https://review.opendev.org/721558

Changed in charm-keystone:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-keystone (master)

Reviewed: https://review.opendev.org/721558
Committed: https://git.openstack.org/cgit/openstack/charm-keystone/commit/?id=53bcfd0a064c11d90009d42f9fbe61d79f963e9c
Submitter: Zuul
Branch: master

commit 53bcfd0a064c11d90009d42f9fbe61d79f963e9c
Author: Liam Young <email address hidden>
Date: Tue Apr 21 09:26:13 2020 +0000

    Check peers before emitting identity data

    The checks that keystone was performing before emitting identity
    data were applicable to any Openstack api charm so the check
    definitions have been moved to charmhelpers so other charms can
    use them. The checks as they were are encapsulated in
    `check_api_unit_ready` *1

    Bug 1818113 was caused by keystone emitting identity data
    as soon as the leader was ready but ignoring the state of the
    peer units. This is now covered by a new check
    `check_api_application_ready` which performs all the local
    unit checks and then checks that all peers have reported
    as ready too.

    In addition `check_api_unit_ready` is now used when
    setting the units workload status and `check_api_application_ready`
    is used when setting the application workload status.

    *1 https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/openstack/utils.py#L2289
    *2 https://github.com/juju/charm-helpers/blob/master/charmhelpers/contrib/openstack/utils.py#L2330

    Change-Id: I99830ab2c2482e8beef174424820907ce96fd5d7
    Closes-Bug: #1818113

Changed in charm-keystone:
status: In Progress → Fix Committed
Revision history for this message
Ashley Lai (alai) wrote :

We are still hitting this bug using the keystone -next charm. The artifacts can be found on the bottom of the test run.

https://solutions.qa.canonical.com/#/qa/testRun/753c0272-f052-4450-8a5b-e205760774e6

Changed in charm-keystone:
status: Fix Committed → New
Revision history for this message
Ashley Lai (alai) wrote :

https://solutions.qa.canonical.com/#/qa/testRun/794aaeef-0d8d-438c-b21b-55faec2b8380

In the run above, one of the keystone unit is updating endpoint for octavia. Not sure if it helps in debugging.

keystone/2* maintenance executing 12/lxd/7 10.244.8.153 5000/tcp Updating endpoint for octavia

Revision history for this message
Ashley Lai (alai) wrote :

octavia/2 failed to connect to keystone in the run above.

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Basing this on the premise that this stble run is actually using the next charms as reported:
https://solutions.qa.canonical.com/#/qa/testRun/753c0272-f052-4450-8a5b-e205760774e6

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Looking at the artifacts referenced in #33

    units:
      keystone/0:
        workload-status:
          current: active
          message: Unit is ready
          since: 07 May 2020 20:14:44Z
        juju-status:
          current: executing
          message: running identity-service-relation-joined hook
          since: 07 May 2020 20:14:55Z
          version: 2.7.6
        leader: true
      keystone/1:
        workload-status:
          current: blocked
          message: Charm and its dependencies not yet at expected scale
          since: 07 May 2020 20:04:49Z
        juju-status:
          current: executing
          message: running identity-credentials-relation-changed hook
          since: 07 May 2020 20:15:13Z
      keystone/2:
        workload-status:
          current: blocked
          message: Charm and its dependencies not yet at expected scale
          since: 07 May 2020 20:03:55Z
        juju-status:
          current: executing
          message: running identity-service-relation-joined hook
          since: 07 May 2020 20:15:17Z

octavia_0/var/log/juju/unit-octavia-0.log:2020-05-07 20:09:25 DEBUG juju-log identity-service:135: tracer: set flag identity-service.available
octavia_0/var/log/juju/unit-octavia-0.log:2020-05-07 20:09:28 DEBUG juju-log identity-service:135: tracer: set flag identity-service.available.auth
octavia_1/var/log/juju/unit-octavia-1.log:2020-05-07 20:09:24 DEBUG juju-log identity-service:135: tracer: set flag identity-service.available
octavia_1/var/log/juju/unit-octavia-1.log:2020-05-07 20:09:27 DEBUG juju-log identity-service:135: tracer: set flag identity-service.available.auth
octavia_2/var/log/juju/unit-octavia-2.log:2020-05-07 20:09:23 DEBUG juju-log identity-service:135: tracer: set flag identity-service.available
octavia_2/var/log/juju/unit-octavia-2.log:2020-05-07 20:09:26 DEBUG juju-log identity-service:135: tracer: set flag identity-service.available.auth

So this could either be that the fix provided for Keystone does not cover all cases, or it could be something wrong with the detection of availability in the keystone reactive interface.

While we look for local a reproducer would it be possible to get a dump of relation data the next time this hits:

- from the peer relation from the perspective of all the keystone units
- from the identity-service relation from the perspective of all the octavia units

Changed in charm-keystone:
status: New → In Progress
assignee: Liam Young (gnuoy) → Aurelien Lourot (aurelien-lourot)
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

debug_log.txt contains lines like:

unit-keystone-2: 19:59:30 DEBUG unit.keystone/2.juju-log cluster:30: Telling peers this unit is: NOTREADY
unit-keystone-2: 20:12:50 INFO unit.keystone/2.juju-log Keystone charm unit not ready - deferring identity-relation updates

but never contains lines like:

DEBUG unit.keystone/?.juju-log Telling peers this unit is: READY
DEBUG unit.keystone/?.juju-log Application Ready
INFO unit.keystone/?.juju-log Firing identity_credentials_changed hook for all related services.

Based on the source code this proves that the keystone charm has correctly assessed that at least one unit wasn't ready and thus didn't mark the application as ready and also didn't update the identity-* relations. Debugging the charm locally shows that these things correctly happen later eventually.

However there is a special code-path [0] which sets 'service_password' on the identity-service relation with no regard for peer-readiness whatsoever. This path finally prints this line, which has been seen early in the logs:

unit-keystone-1: 19:57:06 INFO unit.keystone/1.juju-log identity-service:115: Deferring identity_changed() to service leader.

I don't have a precise solution yet but I'm now pretty sure this is the root cause.

0: https://github.com/openstack/charm-keystone/blob/master/hooks/keystone_hooks.py#L486

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-keystone (master)

Fix proposed to branch: master
Review: https://review.opendev.org/729781

David Ames (thedac)
Changed in charm-keystone:
milestone: 20.05 → 20.08
Revision history for this message
Michael Skalka (mskalka) wrote :

David,

Has there been any progress on this issue? We are hitting this extremely often in our stable and master testing: https://solutions.qa.canonical.com/#/qa/bug/1818113

Revision history for this message
David Ames (thedac) wrote :

Michael,

I believe Aurelien [0] is working on this. I will double check with him and have him update the bug.

[0] https://review.opendev.org/#/c/729781/

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Hi, this is currently blocked on this fix [0] to be packaged and published.

0: https://bugs.launchpad.net/ubuntu/+source/python-oslo.policy/+bug/1880959

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Still blocked. Temporarily published to https://jaas.ai/u/aurelien-lourot/keystone for testing.

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Still blocked as long as https://bugs.launchpad.net/ubuntu/+source/python-oslo.policy/+bug/1880959 hasn't been backported to Queens

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We've been testing with @aurelien-lourot's fix and appear to still hit this issue, here is an example test run:

https://solutions.qa.canonical.com/qa/testRun/b6233bf9-3d55-421d-a463-7e5465e29021

This uses cs:~aurelien-lourot/keystone-1

Revision history for this message
David Ames (thedac) wrote :

After some more analysis, we can also attack this bug from the Octavia side. The root cause is with Keystone but Octavia is also not handling the error condition gracefully.

A trivial change may get this bug unblocked:

https://review.opendev.org/741304

Changed in charm-octavia:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → David Ames (thedac)
milestone: none → 20.08
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

I confirm that the last failing run is running my charm and that it contains my fix, since I can see these new traces [0] in debug_log.txt.

Note that in the log I can see that there is one unit (keystone/1) that has never been considered ready and also that keystone has never advertised the application as ready. This is also reflected in juju_status.yaml:

  keystone:
    charm: cs:~aurelien-lourot/keystone-1
    series: bionic
    os: ubuntu
    charm-origin: jujucharms
    charm-name: keystone
    charm-rev: 1
    charm-version: 07de96c
    exposed: false
    application-status:
      current: waiting
      message: Some units are not ready
      since: 13 Jul 2020 16:42:12Z

So indeed I would now start looking at keystone's consumers as potential culprits.

0: https://github.com/juju/charm-helpers/pull/476/files#diff-935a00f40074c3c59d617c88f71e42efR2246

Revision history for this message
Michael Skalka (mskalka) wrote :

@aurelion-lourot

Not sure if this is related to your fix, but in https://solutions.qa.canonical.com/qa/testRun/5589fe2d-bac5-4f35-80a1-57f4a3e454c4 it took the keystone charm 45 minutes to recognize that percona was clustered and ready before starting any endpoint updates (those updates then ran for 1.5hr+ before the run timed out). This is not behavior we have previously seen from keystone before using your fork. This seems like an inordinate amount of time for a deployment which already takes 5+ hours.

/var/log/juju/unit-mysql-1.log
...
2020-07-15 20:29:50 DEBUG juju-log cluster:32: All 3 percona units reporting clustered
...

/var/log/juju/unit-keystone-1.log
...
2020-07-15 21:19:02 DEBUG juju-log identity-service:144: Not at expected scale, not enough units on shared-db relation
...
...
2020-07-15 21:23:12 INFO juju-log shared-db:80: Updating endpoint for gnocchi
...

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

We found the new culprit:

According to the logs Octavia fails in `setup_hm_port()` which on the condition `@reactive.when('identity-service.available')`. Also we know from the previous keystone fix [0] that the issue was keystone setting `service_password` too early on the `identity-service` interface, thus signaling application readiness too early. The implementation of the interface [1] however shows that `service_password` is taken into account in `identity-service.available.auth` and not in `identity-service.available`. So either:

a. Octavia is waiting on the wrong condition, or
b. the definition of `identity-service.available` should be fixed to wait for `service_password`.

We prefer `b` as waiting for `{relation_name}.available` for application readiness is a widely used pattern and we would rather not have a special case for keystone.

0: https://review.opendev.org/729781
1: https://github.com/openstack/charm-interface-keystone/blob/master/requires.py#L91

Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

Thanks @mskalka for noticing! In the previous run with my charm [0] it took only 5 minutes (not 45):

unit-mysql-2: 16:51:01 DEBUG unit.mysql/2.juju-log All 3 percona units reporting clustered
...
unit-keystone-0: 16:55:55 INFO unit.keystone/0.juju-log identity-service:88: Updating endpoint for gnocchi

I went through my patch [1] again and I don't think this can cause this but I might not be very objective. Checking now the latest mentioned run [2].

0: https://solutions.qa.canonical.com/qa/testRun/b6233bf9-3d55-421d-a463-7e5465e29021
1: https://review.opendev.org/729781
2: https://solutions.qa.canonical.com/qa/testRun/5589fe2d-bac5-4f35-80a1-57f4a3e454c4

David Ames (thedac)
Changed in charm-interface-keystone:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → David Ames (thedac)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-interface-keystone (master)

Fix proposed to branch: master
Review: https://review.opendev.org/741547

Changed in charm-interface-keystone:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-keystone (master)

Reviewed: https://review.opendev.org/729781
Committed: https://git.openstack.org/cgit/openstack/charm-keystone/commit/?id=8b46dfd637908adeb5a817f85fc97fad249be8a0
Submitter: Zuul
Branch: master

commit 8b46dfd637908adeb5a817f85fc97fad249be8a0
Author: Aurelien Lourot <email address hidden>
Date: Wed May 20 10:28:59 2020 +0200

    Fix peer readiness detection

    Sharing the admin password with peers over the 'cluster' relation
    was needed in case the leader would die and the next leader would
    then need that information. This was implemented years ago when
    the leader DB didn't exist.

    This led to a race condition described in the mentioned bug and
    can now be safely removed.

    Validated by deploying several keystone and glance units, then
    removing the keystone leader, then adding a glance unit and
    checking that this new unit gets its service credentials.

    Also added useful traces, made linter happy and increased fernet
    token expiration to avoid spurious test failures.

    Closes-Bug: #1818113
    Change-Id: I004903e50f51e190467d71691982de26518d7149

Changed in charm-keystone:
status: In Progress → Fix Committed
Changed in charm-keystone:
assignee: Aurelien Lourot (aurelien-lourot) → nobody
Revision history for this message
Aurelien Lourot (aurelien-lourot) wrote :

@mskalka I'm not sure why the latest run [0] timed out but from the debug_log.txt I see only 3 minutes (not 45):

unit-mysql-2: 22:49:36 DEBUG unit.mysql/2.juju-log All 3 percona units reporting clustered
...
unit-keystone-1: 22:52:55 INFO unit.keystone/1.juju-log identity-service:100: Updating endpoint for gnocchi

I suggest we continue this topic in a separate bug if needed because I doubt this is a regression introduced here. (I hope I'm not wrong.)

BTW you can now stop using ~aurelien-lourot/keystone as the fix [1] has now been published to ~openstack-charmers. And you were right, the issue is still not fully solved and we're working on the next fix [2].

0: https://solutions.qa.canonical.com/qa/testRun/5589fe2d-bac5-4f35-80a1-57f4a3e454c4
1: https://review.opendev.org/729781
2: https://review.opendev.org/741547

David Ames (thedac)
Changed in charm-octavia:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-interface-keystone (master)

Reviewed: https://review.opendev.org/741547
Committed: https://git.openstack.org/cgit/openstack/charm-interface-keystone/commit/?id=a48a1e826ac0a74ab05a16c441cda4e4b34b1241
Submitter: Zuul
Branch: master

commit a48a1e826ac0a74ab05a16c441cda4e4b34b1241
Author: David Ames <email address hidden>
Date: Thu Jul 16 14:14:44 2020 -0700

    Collapse available and available.auth

    Most reactive charms react on identity-service.available but the
    current interface distinguishes between available and available.auth. It
    is somewhat assumed by most charms that identity-service.available is
    equivalent to identity-service.available.auth, as what charms are
    concerned with is the ability to authenticate against the cloud.

    Collapse the difference between identity-service.available and
    identity-service.available.auth.

    Continue to set identity-service.available.auth for any charms that may
    rely on it.

    Change-Id: I494feea5f3ef8706140ce712b1e025e52b0dfbd1
    Closes-Bug: #1818113

Changed in charm-interface-keystone:
status: In Progress → Fix Released
James Page (james-page)
Changed in charm-octavia:
milestone: 20.08 → none
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-octavia (master)

Reviewed: https://review.opendev.org/741304
Committed: https://git.openstack.org/cgit/openstack/charm-octavia/commit/?id=7be811606cd9000d57ef4483b9b266e9f613f4dd
Submitter: Zuul
Branch: master

commit 7be811606cd9000d57ef4483b9b266e9f613f4dd
Author: David Ames <email address hidden>
Date: Wed Jul 15 11:37:41 2020 -0700

    Catch keystone InternalServerError in try

    Add keystoneauth1.exceptions.http.InternalServerError to exceptions
    caught gracefully when setting up Octiva during deploy time.

    Change-Id: I100cb1cfbff0b764932acccb0fdbb76ce6d1715c
    Closes-Bug: #1818113

Changed in charm-octavia:
status: In Progress → Fix Committed
David Ames (thedac)
Changed in charm-octavia:
milestone: none → 20.08
Changed in charm-keystone:
status: Fix Committed → Fix Released
Changed in charm-octavia:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.