Deployment Fails With "Fetching Ceph Keyrings"

Bug #1736692 reported by JieTang
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Expired
Undecided
Unassigned

Bug Description

if Ceph is enabled, a deployment can fail during the Fetching Ceph keyrings task with the error No JSON object could be decoded,like this:
fatal: [control01]: FAILED! => {"failed": true, "msg": "The conditional check '{{ (ceph_files_json.stdout | from_json).changed }}' failed. The error was: No JSON object could be decoded"}
The error usually occurs after a failed deployment because the environment was not completely cleaned up.
Perform the following steps can deploy success:
1 kolla-ansible /data/multinode destroy --yes-i-really-really-mean-it
2 docker volume rm ceph_mon_config (on every control node)

I think this is a bug about kolla-ansible destroy which don't cleaned up environment completely.

JieTang (tangjie)
Changed in kolla-ansible:
assignee: nobody → JieTang (tangjie)
Revision history for this message
Eduardo Gonzalez (egonzalez90) wrote :

destroy currently remove all kolla config files, containers, volumes and images. Are you still having the issue?

Changed in kolla-ansible:
status: New → Incomplete
Revision history for this message
B E (bilgehan3) wrote :

This "Fetching Ceph Keyrings" has been a show stopper for us.
We found a singular combination of component versions that finally worked for us but we are now stuck with this ceph problem.
Unlike as many have mentioned in several places, this problem is not necessarily related to a failed deploy.
The problem happens every time we do kolla-ansible stop then deploy.

The config is as follows:
  OS Ubuntu 18.04
  kolla 7.0.3
  kalla-ansible 7.1.1
  openstack rocky
  control01 ceph
  compute01 storage (ceph-osd)
  compute02 storage (ceph-osd)
  compute03 storage (ceph-osd)

globals.yml is attached

We urgently need some guidance to recover from this without keep loosing all our platform configuration and data

Revision history for this message
B E (bilgehan3) wrote :

... multinode file for the previous post

Revision history for this message
Eric Miller (erickmiller) wrote :

I ran into this when rebuilding a controller node, where we have controller001 and controller002 running fine and controller003 is being rebuilt. All 3 controllers have ceph_mon deployed.

I reviewed the Ansible file here:
https://github.com/openstack/kolla-ansible/blob/stable/rocky/ansible/roles/ceph/tasks/bootstrap_mons.yml

where the delegate_host fact gets set.

This file:
https://github.com/openstack/kolla-ansible/blob/stable/rocky/ansible/roles/ceph/tasks/distribute_keyrings.yml

relies on a node "other than" new nodes being created to provide the existing keyrings. In our case, the script would have to rely on controller001 or controller002 to provide the keyring data. Keyring data is pulled by executing "docker exec ceph_mon fetch_ceph_keys.py" in the distribute_keyrings.yml script.

However, the logic used to determine which node is "existing" versus "new" simply looks at whether the ceph_mon_config volume was created as shown here:
https://github.com/openstack/kolla-ansible/blob/c13b8e243937ae54d2e42244cf78def8ff413ef3/ansible/roles/ceph/tasks/bootstrap_mons.yml#L14

The "ceph_mon_config_volume" flag is set if the volume exists on the node.

The issue is that if a previous deploy made it far enough to create this volume, but not far enough to create a new ceph_mon container, future deploys will fail randomly since the script can't distinguish between new and existing nodes and delegate_host can be incorrectly set to the hostname of the node we are trying to add! In our case, delegate_host was set to "controller003", the node we are trying to rebuild, whereas we needed it set to controller001 or controller002.

The command "docker exec ceph_mon fetch_ceph_keys.py" fails since there is no ceph_mon container on the new node (controller003).

The solution is to simply delete the orphaned volume on the new MON node (controller003).

See if it exists by running this on the new node (controller003):
docker volume ls

which should show the ceph_mon_config volume. If it exists, this is the issue causing the problem. So now delete the volume by running this on the new node (controller003):
docker volume rm ceph_mon_config

Afterwards, the logic to properly determine the delegate_host fact will be correct, as long as there isn't any other issue with the installation. So, delegate_host should be set to controller001 or controller002.

Note that this works fine with the --limit flag during deployment. Some comments I have seen incorrectly assume that possibly the --limit flag is at fault. It "can" be an issue, but only if you do not include a "good" ceph_mon node in the list, since the keyring must be pulled from an existing good node. So, if you are adding a new node, be sure to include a good node with the new node with your --limit argument. In our case, limiting the install to controller002 and controller003 should work fine.

Eric

Changed in kolla-ansible:
assignee: JieTang (tangjie) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for kolla-ansible because there has been no activity for 60 days.]

Changed in kolla-ansible:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.