I have experienced my ceph-mon units seemingly getting stuck in "Bootstrapping MON cluster" even when the cluster is HEALTH_OK with the correct number of mons in the cluster. However, I don't think my issue is related since my keyrings (/etc/ceph/ceph.client.admin.keyring) exist and have appropriate content.
That said, I feel like this bug may be exacerbated by the fact that the hooks mon-relation-departed, mon-relation-changed, leader-settings-changed, and bootstrap-source-relation-departed all set the status to "Bootstrapping MON cluster" when we have at least "monitor-count" mon hosts -- regardless if the cluster is already bootstrapped or not. While this may not be a big deal, the mon_relation function that gets called will call notify_osds, notify_radosgws, and notify_client. That first one, notify_osds, also calls osd_relation for every osd host, and that function also calls notify_radosgws and notify_client. When you have a large number of osd hosts, this can take tens of minutes or even hours to complete -- all the while the status remains "Bootstrapping MON cluster" which often is not true.
Let me draw a picture of the function tree to help drive home my point:
If we could eliminate the redundant notify functions within a single hook execution then these hooks will execute significantly faster resulting in the charm not hanging around in "Bootstrapping MON cluster". In my testing I commented out the notify_radosgw and notify_client lines in osd_relation and the leader-settings-changed hook completed in just a few minutes instead of the 1+ hours it was taking before.
For me, this was enough to get out of the "Bootstrapping MON cluster" purgatory.
I have experienced my ceph-mon units seemingly getting stuck in "Bootstrapping MON cluster" even when the cluster is HEALTH_OK with the correct number of mons in the cluster. However, I don't think my issue is related since my keyrings (/etc/ceph/ ceph.client. admin.keyring) exist and have appropriate content.
That said, I feel like this bug may be exacerbated by the fact that the hooks mon-relation- departed, mon-relation- changed, leader- settings- changed, and bootstrap- source- relation- departed all set the status to "Bootstrapping MON cluster" when we have at least "monitor-count" mon hosts -- regardless if the cluster is already bootstrapped or not. While this may not be a big deal, the mon_relation function that gets called will call notify_osds, notify_radosgws, and notify_client. That first one, notify_osds, also calls osd_relation for every osd host, and that function also calls notify_radosgws and notify_client. When you have a large number of osd hosts, this can take tens of minutes or even hours to complete -- all the while the status remains "Bootstrapping MON cluster" which often is not true.
Let me draw a picture of the function tree to help drive home my point:
mon_relation
├─notify_osds
| ├─osd_relation node 1
| | ├─notify_radosgws
| | └─notify_client
| ├─osd_relation node 2
| | ├─notify_radosgws
| | └─notify_client
| ├─osd_relation node 3
| | ├─notify_radosgws
| | └─notify_client
| ...
| └─osd_relation node n
| ├─notify_radosgws
| └─notify_client
├─notify_radowgws
└─notify_client
If we could eliminate the redundant notify functions within a single hook execution then these hooks will execute significantly faster resulting in the charm not hanging around in "Bootstrapping MON cluster". In my testing I commented out the notify_radosgw and notify_client lines in osd_relation and the leader- settings- changed hook completed in just a few minutes instead of the 1+ hours it was taking before.
For me, this was enough to get out of the "Bootstrapping MON cluster" purgatory.