admin_domain_id in policy.json is not populated on non-leader HA units (API v3)

Bug #1637453 reported by Trent Lloyd
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
keystone (Juju Charms Collection)
Fix Released
High
Frode Nordahl

Bug Description

When deploying keystone in a HA setup with multiple units and preferred-version=3, the admin_domain_id generated on the leader is not shared with the other units. This causes them to write the default string "admin_domain_id" into policy.json instead of the real value.

This data is generated and stored on disk, and is never transferred to peer units. It seems other information is transferred using peer storage such as in get_admin_password() and get_service_password.

Most likely the admin_domain_id and default_domain_id code should be updated to use peerstorage/leader_(set|get)

lathiat@ubuntu:~/src/charms/xenial/charm-keystone$ juju run --application keystone 'grep \"cloud_admin\" /etc/keystone/policy.json'
- Stdout: |2
"cloud_admin": "rule:admin_required and domain_id:50c240d2979f48f3a73e77b81eb19c60",
UnitId: keystone/0
- Stdout: |2
"cloud_admin": "rule:admin_required and domain_id:admin_domain_id",
UnitId: keystone/1

This is populated by {{ admin_domain_id }}(templates/liberty/policy.json) from KeystoneContext(hooks/keystone_context.py)

hooks/keystone_context.py:KeystoneContext#__call__
ctxt['admin_domain_id'] = (
get_admin_domain_id() or 'admin_domain_id')

hooks/keystone_hooks.py:
def get_admin_domain_id():
return get_file_stored_domain_id(STORED_ADMIN_DOMAIN_ID)

hooks/keystone_utils.py:
def get_file_stored_domain_id(backing_file):
domain_id = None
if os.path.isfile(backing_file):
log("Loading stored domain id from {}".format(backing_file),
level=INFO)
with open(backing_file, 'r') as fd:
domain_id = fd.readline().strip('\n')
return domain_id

tags: added: ks-v3 openstack
Changed in keystone (Juju Charms Collection):
milestone: none → 17.01
importance: Undecided → Critical
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Trent, having looked at this I think we need some more info. Specifically, what error does this actually produce on the service side? I presume that whatever error it is it will be only part of the time since requests are load-balanced and those that hit the leader that is configured properly will succeed. If you provide error logs from a service endpoint that is requiring these credentials that would be very useful.

Changed in keystone (Juju Charms Collection):
importance: Critical → Undecided
status: New → Incomplete
Revision history for this message
Trent Lloyd (lathiat) wrote :

Tested this out on a xenial-mitaka cloud with latest charm versions and two keystone units (no hacluster subordinate).

Incidentally I tested this after rebooting my cloud, which resulted in the leader unit switching and both keystone units ended up with a 'correct' configuration file with the domain_id populated - this happens because the code checks for an existing domain and returns that ID instead of creating a new one if it already exists - and writes that ID out to the file on disk. So if you test this yourself and see that both have it populated, likely your leader switched at some point.

generally on the keystone node with domain_id:admin_domain_id the admin user is not authorised to do most admin items, as they all require rule:cloud_admin.

If you stop apache2 on the 'good' node with the actual domain_id, you can still login but are restricted from cloud_admin duties. For example you cannot see a list of users in your domain (list is blank and permission error), and you can load the routers page but get an error in the top right that it could not retrieve the list of projects, etc. Users is the easiest item to check and test.

If you have both nodes running, then the individual services and horizon itself get different permission views. This appears to result in every second horizon page load (roughly speaking) giving an 'oops something went wrong' error and not rendering at all, presumably because horizon expected an operation to work that then failed at the remote API end. I could see requests hitting both nodes in these cases and failing RBAC denied on one and working on the other.

Trent Lloyd (lathiat)
Changed in keystone (Juju Charms Collection):
status: Incomplete → Opinion
status: Opinion → Confirmed
Revision history for this message
Trent Lloyd (lathiat) wrote :

Draft patch for this issue, not quite ready to submit to gerrit but will do shortly.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/#/c/393056/

Changed in keystone (Juju Charms Collection):
assignee: nobody → Trent Lloyd (lathiat)
status: Confirmed → In Progress
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-keystone (master)

Change abandoned by Edward Hope-Morley (<email address hidden>) on branch: master
Review: https://review.openstack.org/393056
Reason: This is now replaced by https://review.openstack.org/#/c/403601/

Frode Nordahl (fnordahl)
Changed in keystone (Juju Charms Collection):
assignee: Trent Lloyd (lathiat) → Frode Nordahl (fnordahl)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-keystone (master)

Reviewed: https://review.openstack.org/403601
Committed: https://git.openstack.org/cgit/openstack/charm-keystone/commit/?id=4d2ab6668f8601a17bbda1bf89ab7633f22d8d3d
Submitter: Jenkins
Branch: master

commit 4d2ab6668f8601a17bbda1bf89ab7633f22d8d3d
Author: Frode Nordahl <email address hidden>
Date: Mon Nov 28 09:33:55 2016 +0100

    Replace local storage of domain UUIDs with leader storage

    Currently the Keystone leader charm creates new domains and stores
    the UUIDs locally on disk. This approach predates charm relation-/
    leader- storage, is error prone, and causes problems in HA setups.

    Move to leader storage and remove old interfaces. There is no need
    to migrate the on-disk stored data as it is read from the deployment
    and stored as a part of the upgrade process.

    Do not set default values for service_tenant_id, admin_domain_id and
    default_domain_id. This will cause context to be incomplete on peer
    units until the values are actually available.

    Change functional tests to run on Keystone cluster to verify contents of
    configuration and operation of services in clustered environment.

    Closes-Bug: 1637453
    Change-Id: Id0eaf7bfceead627cc691e9b52dd889d60c05fa9

Changed in keystone (Juju Charms Collection):
status: In Progress → Fix Committed
Felipe Reyes (freyes)
tags: added: backport-potential sts
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-keystone (stable/16.10)

Fix proposed to branch: stable/16.10
Review: https://review.openstack.org/410822

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-keystone (stable/16.10)

Reviewed: https://review.openstack.org/410822
Committed: https://git.openstack.org/cgit/openstack/charm-keystone/commit/?id=f2f395f565e66708c2bc364f7cd525c4e363ea2f
Submitter: Jenkins
Branch: stable/16.10

commit f2f395f565e66708c2bc364f7cd525c4e363ea2f
Author: Frode Nordahl <email address hidden>
Date: Mon Nov 28 09:33:55 2016 +0100

    Replace local storage of domain UUIDs with leader storage

    Currently the Keystone leader charm creates new domains and stores
    the UUIDs locally on disk. This approach predates charm relation-/
    leader- storage, is error prone, and causes problems in HA setups.

    Move to leader storage and remove old interfaces. There is no need
    to migrate the on-disk stored data as it is read from the deployment
    and stored as a part of the upgrade process.

    Do not set default values for service_tenant_id, admin_domain_id and
    default_domain_id. This will cause context to be incomplete on peer
    units until the values are actually available.

    Change functional tests to run on Keystone cluster to verify contents of
    configuration and operation of services in clustered environment.

    (also fixup stable amulet tests to use python from venv).

    Closes-Bug: 1637453
    Change-Id: Id0eaf7bfceead627cc691e9b52dd889d60c05fa9
    (cherry picked from commit 4d2ab6668f8601a17bbda1bf89ab7633f22d8d3d)

Changed in keystone (Juju Charms Collection):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.