Comment 54 for bug 1719436

Revision history for this message
Vern Hart (vern) wrote :

I have experienced my ceph-mon units seemingly getting stuck in "Bootstrapping MON cluster" even when the cluster is HEALTH_OK with the correct number of mons in the cluster. However, I don't think my issue is related since my keyrings (/etc/ceph/ceph.client.admin.keyring) exist and have appropriate content.

That said, I feel like this bug may be exacerbated by the fact that the hooks mon-relation-departed, mon-relation-changed, leader-settings-changed, and bootstrap-source-relation-departed all set the status to "Bootstrapping MON cluster" when we have at least "monitor-count" mon hosts -- regardless if the cluster is already bootstrapped or not. While this may not be a big deal, the mon_relation function that gets called will call notify_osds, notify_radosgws, and notify_client. That first one, notify_osds, also calls osd_relation for every osd host, and that function also calls notify_radosgws and notify_client. When you have a large number of osd hosts, this can take tens of minutes or even hours to complete -- all the while the status remains "Bootstrapping MON cluster" which often is not true.

Let me draw a picture of the function tree to help drive home my point:

mon_relation
  ├─notify_osds
  | ├─osd_relation node 1
  | | ├─notify_radosgws
  | | └─notify_client
  | ├─osd_relation node 2
  | | ├─notify_radosgws
  | | └─notify_client
  | ├─osd_relation node 3
  | | ├─notify_radosgws
  | | └─notify_client
  | ...
  | └─osd_relation node n
  | ├─notify_radosgws
  | └─notify_client
  ├─notify_radowgws
  └─notify_client

If we could eliminate the redundant notify functions within a single hook execution then these hooks will execute significantly faster resulting in the charm not hanging around in "Bootstrapping MON cluster". In my testing I commented out the notify_radosgw and notify_client lines in osd_relation and the leader-settings-changed hook completed in just a few minutes instead of the 1+ hours it was taking before.

For me, this was enough to get out of the "Bootstrapping MON cluster" purgatory.