ceph-mon unit stays in "Bootstrapping MON cluster" workload state

Bug #1719436 reported by Jason Hobbs
38
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Fix Released
High
Frode Nordahl
OpenStack Ceph Charm (Retired)
Won't Fix
High
Unassigned

Bug Description

Deploys of openstack-base occasionally fail with ceph-mon/0 ending in this state:

      ceph-mon/0:
        workload-status:
          current: maintenance
          message: Bootstrapping MON cluster
          since: 24 Sep 2017 01:15:49Z
        juju-status:
          current: executing
          message: running mon-relation-changed hook
          since: 24 Sep 2017 01:15:44Z
          version: 2.2.4
        machine: 1/lxd/0
        public-address: 10.245.214.8

There are some errors like this in the unit's log:
2017-09-24 01:16:31 DEBUG mon-relation-changed INFO:ceph-create-keys:Talking to monitor...
2017-09-24 01:16:31 DEBUG mon-relation-changed 2017-09-24 01:16:31.492799 7f0b43818700 0 librados: client.admin authentication error (22) Invalid argument
2017-09-24 01:16:31 DEBUG mon-relation-changed InvalidArgumentError does not take keyword arguments
2017-09-24 01:16:31 DEBUG mon-relation-changed INFO:ceph-create-keys:Cannot get or create bootstrap key for rbd, permission denied

I've attached a juju crashdump tarball with full logs

This was run with cs:openstack/base on September 24.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Ryan Beisner (1chb1n)
Changed in charm-ceph-mon:
assignee: nobody → Chris MacNaughton (chris.macnaughton)
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

The unit that is stuck in bootstrapping seems to have managed to fail to correctly create its admin keyring as the file `/etc/ceph/ceph.client.admin.keyring` is empty, rather than containing the keyring as it should. This file is created by the ceph command `ceph-authtool $keyring --create-keyring --name=mon. --add-key=$secret --cap 'mon' 'allow *'`

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Observation: the keyring file on disk is empty. Whereas it should contain something like:

[client.admin]
 key = FOO

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Bug reporter indicates this has occurred appx 1 in 30 runs.

Ryan Beisner (1chb1n)
Changed in charm-ceph-mon:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Ryan Beisner (1chb1n) wrote :

FYI There is a charms.ceph change under review, which if landed, will need to be proposed into the other ceph charm(s). https://review.openstack.org/#/c/507282/

Revision history for this message
Ryan Beisner (1chb1n) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (master)

Fix proposed to branch: master
Review: https://review.openstack.org/507545

Changed in charm-ceph-mon:
status: Triaged → In Progress
Changed in charm-ceph:
assignee: nobody → Chris MacNaughton (chris.macnaughton)
status: New → In Progress
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph (master)

Reviewed: https://review.openstack.org/507546
Committed: https://git.openstack.org/cgit/openstack/charm-ceph/commit/?id=08ac48c436b1a6aa03c42c99451aa2039d3b526f
Submitter: Jenkins
Branch: master

commit 08ac48c436b1a6aa03c42c99451aa2039d3b526f
Author: Chris MacNaughton <email address hidden>
Date: Tue Sep 26 09:46:57 2017 -0400

    Update from charms.ceph

    This change attempts to bandaid a bug with occaisonal failures
    to authenticate with a ceph quorum

    Change-Id: I454861c6d138497010221eeb68a5c51cd9edaccb
    Partial-Bug: #1719436
    Depends-On: I1b44f87522a283c9e6d06064687a2330ea23e354

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (master)

Reviewed: https://review.openstack.org/507545
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-mon/commit/?id=d628b5d24b593fae0880cd1283265c6c4ae16fe6
Submitter: Jenkins
Branch: master

commit d628b5d24b593fae0880cd1283265c6c4ae16fe6
Author: Chris MacNaughton <email address hidden>
Date: Tue Sep 26 09:43:48 2017 -0400

    Update from charms.ceph

    This change attempts to bandaid a bug with occaisonal failures
    to authenticate with a ceph quorum

    Change-Id: Ib6109c6201486985be12778240b02c2a0e6e4fd4
    Partial-Bug: #1719436
    Depends-On: I1b44f87522a283c9e6d06064687a2330ea23e354

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Marking this incomplete for now, as I think the provided fix will resolve the bug. If this happens again, please help us with more logs after this fix and I can continue to dig in

Changed in charm-ceph:
status: In Progress → Incomplete
Changed in charm-ceph-mon:
status: In Progress → Incomplete
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph (stable/17.08)

Fix proposed to branch: stable/17.08
Review: https://review.openstack.org/507633

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (stable/17.08)

Fix proposed to branch: stable/17.08
Review: https://review.openstack.org/507634

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (stable/17.08)

Reviewed: https://review.openstack.org/507634
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-mon/commit/?id=4cd888a35b0ea15879c91a2a051e4ded781d9b50
Submitter: Jenkins
Branch: stable/17.08

commit 4cd888a35b0ea15879c91a2a051e4ded781d9b50
Author: Chris MacNaughton <email address hidden>
Date: Tue Sep 26 09:43:48 2017 -0400

    Update from charms.ceph

    This change attempts to bandaid a bug with occaisonal failures
    to authenticate with a ceph quorum

    Change-Id: Ib6109c6201486985be12778240b02c2a0e6e4fd4
    Partial-Bug: #1719436
    Depends-On: I1b44f87522a283c9e6d06064687a2330ea23e354
    (cherry picked from commit d628b5d24b593fae0880cd1283265c6c4ae16fe6)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph (stable/17.08)

Reviewed: https://review.openstack.org/507633
Committed: https://git.openstack.org/cgit/openstack/charm-ceph/commit/?id=96b972ba16f7ee1071bb07c075307bddb99e37d5
Submitter: Jenkins
Branch: stable/17.08

commit 96b972ba16f7ee1071bb07c075307bddb99e37d5
Author: Chris MacNaughton <email address hidden>
Date: Tue Sep 26 09:46:57 2017 -0400

    Update from charms.ceph

    This change attempts to bandaid a bug with occaisonal failures
    to authenticate with a ceph quorum

    Change-Id: I454861c6d138497010221eeb68a5c51cd9edaccb
    Partial-Bug: #1719436
    Depends-On: I1b44f87522a283c9e6d06064687a2330ea23e354
    (cherry picked from commit 08ac48c436b1a6aa03c42c99451aa2039d3b526f)

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We hit this issue again yesterday, using the up to date charm from the store.

Here's our bundle:
http://paste.ubuntu.com/25661611/

I attached a crashdump from the run.

Changed in charm-ceph-mon:
status: Incomplete → New
Ryan Beisner (1chb1n)
Changed in charm-ceph:
status: Incomplete → New
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

In this latest recurrence, I can see that our new code path is ... affecting something, as ceph-mon/1 has no /etc/ceph/ceph.client.admin.keyring file. The hook will now try to go through the steps to generate that file 3 times before failing, so I once again suspect that there is *some* network issue between the containers containing the monitors. Going to continue to dig into the logs for the various units.

David Ames (thedac)
Changed in charm-ceph:
status: New → In Progress
Changed in charm-ceph-mon:
status: New → In Progress
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I hit this on another set of machines on a completely different network. On the unit with the error, I did this:

sudo service ceph-mon restart

And the problem fixed itself - the unit went into a good state.

Revision history for this message
James Page (james-page) wrote :

Marking Incomplete as we've failed to document a reproducer of this problem atm.

Changed in charm-ceph-mon:
status: In Progress → Incomplete
Changed in charm-ceph:
status: In Progress → Incomplete
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Attached is the logs from a recent run on OSCI where one of the monitors ran into what seems to be this same issue

tags: added: serverstack
Revision history for this message
Ryan Beisner (1chb1n) wrote :

@chris

The attachment in #19 does not indicate reproduction of this bug. In that run, there was an undercloud juju machine (#3) in perpetual pending state, causing a ceph-osd unit to never receive a jujud agent. This is from the archive zip:

Model Controller Cloud/Region Version SLA
auto-osci-sv02 auto-osci-sv02 serverstack/serverstack 2.2.4 unsupported

App Version Status Scale Charm Store Rev OS Notes
ceph-mon 10.2.7 error 3 ceph-mon local 0 ubuntu
ceph-osd 10.2.7 blocked 2/3 ceph-osd local 15 ubuntu
ntp waiting 0 ntp local 0 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-mon/0* active idle 0 172.17.102.9 Unit is ready and clustered
ceph-mon/1 active idle 1 172.17.102.8 Unit is ready and clustered
ceph-mon/2 error idle 2 172.17.102.4 hook failed: "mon-relation-changed"
ceph-osd/0 waiting allocating 3 172.17.102.24 waiting for machine
ceph-osd/1* blocked idle 4 172.17.102.11 Missing relation: monitor
ceph-osd/2 blocked idle 5 172.17.102.17 Missing relation: monitor

Machine State DNS Inst id Series AZ Message
0 started 172.17.102.9 c6137c36-423c-4fe4-8b13-cc74b35e2495 zesty nova ACTIVE
1 started 172.17.102.8 ea33ac65-3078-4254-ab3d-912b022a8e6c zesty nova ACTIVE
2 started 172.17.102.4 ba7fd29f-fb84-46da-b6ad-256f2045529f zesty nova ACTIVE
3 pending 172.17.102.24 1f4db575-dd06-4536-9d3b-fccf0ab04169 zesty nova ACTIVE
4 started 172.17.102.11 e56800c5-25fd-40d3-a87a-b6df5ae23fbd zesty nova ACTIVE
5 started 172.17.102.17 2b47c03c-12b4-4091-b792-9c5992698282 zesty nova ACTIVE

Relation provider Requirer Interface Type
ceph-mon:mon ceph-mon:mon ceph peer
ntp:ntp-peers ntp:ntp-peers ntp peer

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

I agree that the ceph cluster would not have reached completion in that test run; however, it looks like the monitor cluster still had either the same, or a similar issue with regards to a unit joining a cluster.

The error logs in #19 include the ceph-mon unit logs for the failing unit, in which the recently added retry code path is executed. Given that we added that retry logic to try to combat intermittent failures of a monitor node to join the cluster, it looks like something failed before the monitor cluster tried to relate to the OSD machines.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

BY #19 I'm referring to the logs attached in comment #19.

Revision history for this message
Ryan Beisner (1chb1n) wrote :
Download full text (4.1 KiB)

from ceph-mon/2:

2017-10-12 00:20:19 DEBUG mon-relation-changed creating /var/lib/ceph/tmp/juju-c5aebb-auto-osci-sv02-2.mon.keyring
2017-10-12 00:20:19 DEBUG mon-relation-changed added entity mon. auth auth(auid = 18446744073709551615 key=AQDgtN5Z1TiPARAAOO1plbio6lhXS4Yduv0AJg== with 0 caps)
2017-10-12 00:20:19 DEBUG mon-relation-changed ceph-mon: renaming mon.noname-a 172.17.102.4:6789/0 to mon.juju-c5aebb-auto-osci-sv02-2
2017-10-12 00:20:19 DEBUG mon-relation-changed ceph-mon: set fsid to e5c884d4-aee2-11e7-abb4-fa163ec0db83
2017-10-12 00:20:19 DEBUG mon-relation-changed ceph-mon: created monfs at /var/lib/ceph/mon/ceph-juju-c5aebb-auto-osci-sv02-2 for mon.juju-c5aebb-auto-osci-sv02-2
2017-10-12 00:20:22 DEBUG mon-relation-changed Created symlink /etc/systemd/system/multi-user.target.wants/ceph-mon.service → /lib/systemd/system/ceph-mon.service.
2017-10-12 00:20:23 INFO juju-log mon:0: Retrying 'add_keyring_to_ceph' 3 more times (delay=5)
2017-10-12 00:20:28 DEBUG mon-relation-changed creating /var/lib/ceph/tmp/juju-c5aebb-auto-osci-sv02-2.mon.keyring
2017-10-12 00:20:28 DEBUG mon-relation-changed added entity mon. auth auth(auid = 18446744073709551615 key=AQDgtN5Z1TiPARAAOO1plbio6lhXS4Yduv0AJg== with 0 caps)
2017-10-12 00:20:28 DEBUG mon-relation-changed '/var/lib/ceph/mon/ceph-juju-c5aebb-auto-osci-sv02-2' already exists and is not empty: monitor may already exist
2017-10-12 00:20:35 INFO juju-log mon:0: Retrying 'add_keyring_to_ceph' 2 more times (delay=10)
2017-10-12 00:20:45 DEBUG mon-relation-changed creating /var/lib/ceph/tmp/juju-c5aebb-auto-osci-sv02-2.mon.keyring
2017-10-12 00:20:45 DEBUG mon-relation-changed added entity mon. auth auth(auid = 18446744073709551615 key=AQDgtN5Z1TiPARAAOO1plbio6lhXS4Yduv0AJg== with 0 caps)
2017-10-12 00:20:45 DEBUG mon-relation-changed '/var/lib/ceph/mon/ceph-juju-c5aebb-auto-osci-sv02-2' already exists and is not empty: monitor may already exist
2017-10-12 00:20:51 INFO juju-log mon:0: Retrying 'add_keyring_to_ceph' 1 more times (delay=15)
2017-10-12 00:21:06 DEBUG mon-relation-changed creating /var/lib/ceph/tmp/juju-c5aebb-auto-osci-sv02-2.mon.keyring
2017-10-12 00:21:06 DEBUG mon-relation-changed added entity mon. auth auth(auid = 18446744073709551615 key=AQDgtN5Z1TiPARAAOO1plbio6lhXS4Yduv0AJg== with 0 caps)
2017-10-12 00:21:06 DEBUG mon-relation-changed '/var/lib/ceph/mon/ceph-juju-c5aebb-auto-osci-sv02-2' already exists and is not empty: monitor may already exist
2017-10-12 00:21:11 DEBUG mon-relation-changed Traceback (most recent call last):
2017-10-12 00:21:11 DEBUG mon-relation-changed File "/var/lib/juju/agents/unit-ceph-mon-2/charm/hooks/mon-relation-changed", line 713, in <module>
2017-10-12 00:21:11 DEBUG mon-relation-changed hooks.execute(sys.argv)
2017-10-12 00:21:11 DEBUG mon-relation-changed File "/var/lib/juju/agents/unit-ceph-mon-2/charm/hooks/charmhelpers/core/hookenv.py", line 768, in execute
2017-10-12 00:21:11 DEBUG mon-relation-changed self._hooks[hook_name]()
2017-10-12 00:21:11 DEBUG mon-relation-changed File "/var/lib/juju/agents/unit-ceph-mon-2/charm/hooks/mon-relation-changed", line 357, in mon_relation
2017-10-12 00:21:11 DEBUG mon-relation-changed c...

Read more...

Revision history for this message
Ryan Beisner (1chb1n) wrote :

ceph.conf from ceph-osd/2:

[global]
auth cluster required = cephx
auth service required = cephx
auth client required = cephx

mon host = 172.17.102.4:6789 172.17.102.8:6789 172.17.102.9:6789
fsid = e5c884d4-aee2-11e7-abb4-fa163ec0db83

log to syslog = false
err to syslog = false
clog to syslog = false
mon cluster log to syslog = false
debug mon = 1/5
debug osd = 1/5

public network =
cluster network =
public addr = 172.17.102.4
cluster addr = 172.17.102.4

[mon]
keyring = /var/lib/ceph/mon/$cluster-$id/keyring

[mds]
keyring = /var/lib/ceph/mds/$cluster-$id/keyring

Revision history for this message
Ryan Beisner (1chb1n) wrote :

Confirmed this is true, the file didn't exist in /etc/ceph:

2017-10-12 00:21:11 DEBUG mon-relation-changed OSError: [Errno 2] No such file or directory: '/etc/ceph/ceph.client.admin.keyring'

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

"incomplete" because we don't have a reproducer that works every time seems like the wrong approach given we can reproduce it periodically in our CI setup.

Is there no additional logging that can be added to help debug the issue next time we hit it in our setup?

Revision history for this message
Ramon Grullon (rgrullon) wrote :

I've hit this bug on juju 2.3.2 on ceph jujucharms 279.

From logs -

2018-02-12 18:01:03 INFO juju-log mon:1: Waiting for quorum to be reached
2018-02-12 18:01:09 INFO juju-log mon:1: Waiting for quorum to be reached
2018-02-12 18:01:14 INFO juju-log mon:1: Waiting for quorum to be reached
2018-02-12 18:01:20 INFO juju-log mon:1: Waiting for quorum to be reached
2018-02-12 18:01:26 INFO juju-log mon:1: Waiting for quorum to be reached
2018-02-12 18:01:31 INFO juju-log mon:1: Waiting for quorum to be reached
2018-02-12 18:01:37 INFO juju-log mon:1: Waiting for quorum to be reached
2018-02-12 18:01:42 INFO juju-log mon:1: Waiting for quorum to be reached
2018-02-12 18:01:48 INFO juju-log mon:1: Waiting for quorum to be reached
2018-02-12 18:01:54 INFO juju-log mon:1: Waiting for quorum to be reached

Revision history for this message
Ramon Grullon (rgrullon) wrote :
Download full text (5.0 KiB)

2018-02-12 17:12:56 DEBUG mon-relation-changed '/var/lib/ceph/mon/ceph-juju-141baa-1' already exists and is not empty: monitor may already exist
2018-02-12 17:13:09 INFO juju-log mon:1: Making dir /srv/osd ceph:ceph 755
2018-02-12 17:13:20 INFO juju-log mon:1: osdize dir cmd: ['sudo', '-u', 'ceph', 'ceph-disk', 'prepare', '--data-dir', u'/srv/osd']
2018-02-12 17:13:25 DEBUG mon-relation-changed got monmap epoch 1
2018-02-12 17:13:26 DEBUG mon-relation-changed added key for osd.2
2018-02-12 17:13:27 DEBUG mon-relation-changed Created symlink from /etc/systemd/system/ceph-osd.target.wants/ceph-osd@2.service to /lib/systemd/system/ceph-osd@.service.
2018-02-12 17:13:32 INFO juju-log mon:1: mon cluster in quorum - providing client with keys
2018-02-12 17:13:34 INFO juju-log mon:1: mon cluster in quorum - providing client with keys
2018-02-12 17:13:37 INFO juju-log mon:1: mon cluster in quorum - providing client with keys
2018-02-12 17:13:43 INFO juju.worker.uniter.operation runhook.go:119 ran "mon-relation-changed" hook
2018-02-12 17:13:44 DEBUG juju-log radosgw:50: Hardening function 'install'
2018-02-12 17:13:44 DEBUG juju-log radosgw:50: Hardening function 'config_changed'
2018-02-12 17:13:44 DEBUG juju-log radosgw:50: Hardening function 'upgrade_charm'
2018-02-12 17:13:44 DEBUG juju-log radosgw:50: Hardening function 'update_status'
2018-02-12 17:13:47 INFO juju-log radosgw:50: Installing [] with options: ['--option=Dpkg::Options::=--force-confold']
2018-02-12 17:13:47 DEBUG radosgw-relation-joined Reading package lists...
2018-02-12 17:13:47 DEBUG radosgw-relation-joined Building dependency tree...
2018-02-12 17:13:47 DEBUG radosgw-relation-joined Reading state information...
2018-02-12 17:13:47 DEBUG radosgw-relation-joined The following package was automatically installed and is no longer required:
2018-02-12 17:13:47 DEBUG radosgw-relation-joined libfreetype6
2018-02-12 17:13:47 DEBUG radosgw-relation-joined Use 'apt autoremove' to remove it.
2018-02-12 17:13:47 DEBUG radosgw-relation-joined 0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
2018-02-12 17:13:49 INFO juju-log radosgw:50: mon cluster in quorum - providing radosgw with keys
2018-02-12 17:13:52 DEBUG radosgw-relation-joined exported keyring for client.radosgw.gateway
2018-02-12 17:13:55 DEBUG juju-log radosgw:50: Not leader - ignoring broker request
2018-02-12 17:14:04 INFO juju.worker.uniter.operation runhook.go:119 ran "radosgw-relation-joined" hook
2018-02-12 17:14:04 DEBUG juju-log radosgw:50: Hardening function 'install'
2018-02-12 17:14:04 DEBUG juju-log radosgw:50: Hardening function 'config_changed'
2018-02-12 17:14:05 DEBUG juju-log radosgw:50: Hardening function 'upgrade_charm'
2018-02-12 17:14:05 DEBUG juju-log radosgw:50: Hardening function 'update_status'
2018-02-12 17:14:07 INFO juju-log radosgw:50: Installing [] with options: ['--option=Dpkg::Options::=--force-confold']
2018-02-12 17:14:07 DEBUG radosgw-relation-changed Reading package lists...
2018-02-12 17:14:07 DEBUG radosgw-relation-changed Building dependency tree...
2018-02-12 17:14:07 DEBUG radosgw-relation-changed Reading state information...
2018-02-12 17:14:08 DEBUG radosgw-relation-chan...

Read more...

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We hit this again last night.

Changed in charm-ceph-mon:
status: Incomplete → New
Changed in charm-ceph:
status: Incomplete → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We haven't hit this in a long time until last night, when we accidentally ran a bunch of tests against stable. I suspect this may be fixed in the upcoming charm release.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

I accidentally hit this with latest openstack-base bundle from jujucharms.com using xenial/queens.

Manually re-running the commands gives the same result as in the log:
# ceph-create-keys --id juju-b275f7-2-lxd-0
INFO:ceph-create-keys:Key exists already: /etc/ceph/ceph.client.admin.keyring
INFO:ceph-create-keys:Talking to monitor...
2018-03-09 14:04:14.527572 7fce4ffec700 0 librados: client.admin authentication error (22) Invalid argument
InvalidArgumentError does not take keyword arguments
INFO:ceph-create-keys:Cannot get or create bootstrap key for osd, permission denied
INFO:ceph-create-keys:Talking to monitor...
2018-03-09 14:04:14.765891 7f19c8306700 0 librados: client.admin authentication error (22) Invalid argument
InvalidArgumentError does not take keyword arguments
INFO:ceph-create-keys:Cannot get or create bootstrap key for rgw, permission denied
INFO:ceph-create-keys:Talking to monitor...
2018-03-09 14:04:14.924680 7f5761738700 0 librados: client.admin authentication error (22) Invalid argument
InvalidArgumentError does not take keyword arguments
INFO:ceph-create-keys:Cannot get or create bootstrap key for mds, permission denied
INFO:ceph-create-keys:Talking to monitor...
2018-03-09 14:04:15.069345 7fcbfb3ff700 0 librados: client.admin authentication error (22) Invalid argument
InvalidArgumentError does not take keyword arguments
INFO:ceph-create-keys:Cannot get or create bootstrap key for rbd, permission denied

After removing the empty /etc/ceph/ceph.client.admin.keyring the re-run succeeds.
# rm /etc/ceph/ceph.client.admin.keyring
root@juju-b275f7-2-lxd-0:~# ceph-create-keys --id juju-b275f7-2-lxd-0
INFO:ceph-create-keys:Talking to monitor...
exported keyring for client.admin
updated caps for client.admin
INFO:ceph-create-keys:Talking to monitor...
INFO:ceph-create-keys:Talking to monitor...
INFO:ceph-create-keys:Talking to monitor...
INFO:ceph-create-keys:Talking to monitor...

I would suggest two paths forward:
1) Add a -t / --timeout with a higher value to call to ceph-create-keys, default is 600. (https://github.com/ceph/ceph/blob/master/src/ceph-create-keys#L269)
2) Unlink the the zero-sized /etc/ceph/ceph.client.admin.keyring before throwing exception in https://github.com/openstack/charm-ceph-mon/blob/master/lib/ceph/utils.py#L1362

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Proposed fix to charms.ceph: https://review.openstack.org/#/c/551294/

Changed in charm-ceph-mon:
status: New → In Progress
assignee: Chris MacNaughton (chris.macnaughton) → Frode Nordahl (fnordahl)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (master)

Fix proposed to branch: master
Review: https://review.openstack.org/551303

Frode Nordahl (fnordahl)
tags: added: stable-backport sts
Frode Nordahl (fnordahl)
Changed in charm-ceph-mon:
milestone: none → 18.05
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (master)

Reviewed: https://review.openstack.org/551303
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-mon/commit/?id=8134a12bc512369dbb3dc0fe853e87097812eff8
Submitter: Zuul
Branch: master

commit 8134a12bc512369dbb3dc0fe853e87097812eff8
Author: Frode Nordahl <email address hidden>
Date: Fri Mar 9 15:51:13 2018 +0100

    Increase timeout and remove empty keyring for ceph-create-keys

    Sync relevant changes from charms.ceph.

    Default timeout of ceph-create-keys of 600 seconds is not
    adequate in all circumstances.

    When ceph-create-keys times out it leaves a empty keyring.
    The retry attempt will fail as long as the empty keyring exists.
    Remove it before retry attempt.

    Change-Id: I914be8a5a7dcd9676438de92bba7f91283232837
    Closes-Bug: #1719436

Changed in charm-ceph-mon:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (stable/18.02)

Fix proposed to branch: stable/18.02
Review: https://review.openstack.org/551933

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Setting charm-ceph task to Won't Fix as it is deprecated.

Please migrate to ceph-mon charm to get this fix.

https://docs.openstack.org/charm-guide/latest/1802.html

Changed in charm-ceph:
status: New → Won't Fix
Changed in charm-ceph-mon:
milestone: 18.05 → 18.02
Frode Nordahl (fnordahl)
Changed in charm-ceph:
assignee: Chris MacNaughton (chris.macnaughton) → nobody
Revision history for this message
Frode Nordahl (fnordahl) wrote :

For extra validation I also launched a test-cluster with a artificially low value for timeout (10 seconds) to ceph-create-keys exercising the retry part with success.

Excerpt from log:
2018-03-12 12:33:28 INFO juju-log mon:0: Retrying 'add_keyring_to_ceph' 3 more times (delay=5)
2018-03-12 12:33:33 DEBUG mon-relation-changed creating /var/lib/ceph/tmp/juju-d7cf7e-1.mon.keyring
2018-03-12 12:33:33 DEBUG mon-relation-changed added entity mon. auth auth(auid = 18446744073709551615 key=AQBnc6ZaCYEUJRAAqiR95fWY9B8wKiHmbk9vWg== with 0 caps)
2018-03-12 12:33:33 DEBUG mon-relation-changed '/var/lib/ceph/mon/ceph-juju-d7cf7e-1' already exists and is not empty: monitor may already exist
2018-03-12 12:33:37 INFO juju-log mon:0: Waiting for quorum to be reached
2018-03-12 12:33:44 DEBUG juju-log mon:0: Hardening function 'install'
2018-03-12 12:33:44 DEBUG juju-log mon:0: Hardening function 'config_changed'
2018-03-12 12:33:44 DEBUG juju-log mon:0: Hardening function 'upgrade_charm'
2018-03-12 12:33:44 DEBUG juju-log mon:0: Hardening function 'update_status'
2018-03-12 12:33:46 INFO juju-log mon:0: Making dir /var/lib/charm/ceph-mon ceph:ceph 555
2018-03-12 12:33:48 INFO juju-log mon:0: bootstrap_monitor_cluster: mon already initialized.

Excerpt from juju status:
Unit Workload Agent Machine Public address Ports Message
ceph-mon/0* active idle 0 10.130.236.223 Unit is ready and clustered
ceph-mon/1 active idle 1 10.130.236.162 Unit is ready and clustered
ceph-mon/2 active idle 2 10.130.236.184 Unit is ready and clustered

Revision history for this message
Ashley Lai (alai) wrote :

We just hit this bug today on xenial-ocata.

charm: cs:ceph-mon-22

ceph-mon/0:
        workload-status:
          current: maintenance
          message: Bootstrapping MON cluster
          since: 12 Mar 2018 17:39:56Z
        juju-status:
          current: executing
          message: running mon-relation-changed hook
          since: 12 Mar 2018 17:40:51Z
          version: 2.4-beta1
        leader: true
        machine: 0/lxd/0

juju log:
2018-03-12 17:40:42 DEBUG mon-relation-changed added entity mon. auth auth(auid = 18446744073709551615 key=AQAfuqZacc4AIBAA7U0lTDc6JQci7g1GHOMBWQ== with 0 caps)
2018-03-12 17:40:42 DEBUG mon-relation-changed '/var/lib/ceph/mon/ceph-juju-5fc73a-0-lxd-0' already exists and is not empty: monitor may already exist
2018-03-12 17:40:46 DEBUG mon-relation-changed Traceback (most recent call last):
2018-03-12 17:40:46 DEBUG mon-relation-changed File "/var/lib/juju/agents/unit-ceph-mon-0/charm/hooks/mon-relation-changed", line 725, in <module>
2018-03-12 17:40:46 DEBUG mon-relation-changed hooks.execute(sys.argv)
2018-03-12 17:40:46 DEBUG mon-relation-changed File "/var/lib/juju/agents/unit-ceph-mon-0/charm/hooks/charmhelpers/core/hookenv.py", line 800, in execute
2018-03-12 17:40:46 DEBUG mon-relation-changed self._hooks[hook_name]()
2018-03-12 17:40:46 DEBUG mon-relation-changed File "/var/lib/juju/agents/unit-ceph-mon-0/charm/hooks/mon-relation-changed", line 356, in mon_relation
2018-03-12 17:40:46 DEBUG mon-relation-changed ceph.bootstrap_monitor_cluster(leader_get('monitor-secret'))
2018-03-12 17:40:46 DEBUG mon-relation-changed File "lib/ceph/utils.py", line 1323, in bootstrap_monitor_cluster
2018-03-12 17:40:46 DEBUG mon-relation-changed init_marker)
2018-03-12 17:40:46 DEBUG mon-relation-changed File "/var/lib/juju/agents/unit-ceph-mon-0/charm/hooks/charmhelpers/core/decorators.py", line 40, in _retry_on_exception_inner_2
2018-03-12 17:40:46 DEBUG mon-relation-changed return f(*args, **kwargs)
2018-03-12 17:40:46 DEBUG mon-relation-changed File "lib/ceph/utils.py", line 1360, in add_keyring_to_ceph
2018-03-12 17:40:46 DEBUG mon-relation-changed osstat = os.stat("/etc/ceph/ceph.client.admin.keyring")
2018-03-12 17:40:46 DEBUG mon-relation-changed FileNotFoundError: [Errno 2] No such file or directory: '/etc/ceph/ceph.client.admin.keyring'
2018-03-12 17:40:46 ERROR juju.worker.uniter.operation runhook.go:114 hook "mon-relation-changed" failed: exit status 1

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (stable/18.02)

Reviewed: https://review.openstack.org/551933
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-mon/commit/?id=79e27a7b2d7388bcd0e955ffb2130f36da8d11ea
Submitter: Zuul
Branch: stable/18.02

commit 79e27a7b2d7388bcd0e955ffb2130f36da8d11ea
Author: Frode Nordahl <email address hidden>
Date: Fri Mar 9 15:51:13 2018 +0100

    Increase timeout and remove empty keyring for ceph-create-keys

    Sync relevant changes from charms.ceph.

    Default timeout of ceph-create-keys of 600 seconds is not
    adequate in all circumstances.

    When ceph-create-keys times out it leaves a empty keyring.
    The retry attempt will fail as long as the empty keyring exists.
    Remove it before retry attempt.

    Change-Id: I914be8a5a7dcd9676438de92bba7f91283232837
    Closes-Bug: #1719436
    (cherry picked from commit 8134a12bc512369dbb3dc0fe853e87097812eff8)

Frode Nordahl (fnordahl)
Changed in charm-ceph-mon:
status: Fix Committed → Fix Released
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We are still seeing this bug. I'm confused by the proposed fix, as the traceback indicates this error occurs when '/etc/ceph/ceph.client.admin.keyring' doesn't exist, not when it's empty.

bundle:
http://paste.ubuntu.com/p/4HHCpf2TWf/

attached crashdump from reproduction on 3-20-18

Changed in charm-ceph-mon:
status: Fix Released → New
Revision history for this message
Frode Nordahl (fnordahl) wrote :

@jason-hobbs We are seeing multiple issues in the same area of the code. The fix in this bug is for when ceph-mon unit stays in "Bootstrapping MON cluster" workload state.

The crash and error state you reference is not the same issue and needs a new bug report. As I am currently investigating this crash on the back of a intermittent CI failure problem I have created bug 1758664

Changed in charm-ceph-mon:
status: New → Fix Released
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Frode, in my crashdump, you'll see "Bootstrapping MON cluster" on ceph-mon/0. That's exactly like what I reported here. And the traceback is exactly the same as the original bug. I do not believe this is fixed. Perhaps you fixed something else related to this, but our original bug is still happening.

Changed in charm-ceph-mon:
status: Fix Released → New
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

your commit in #39 deals with empty keyring; this bug is with a missing keyring. I've commented on that already but I didn't see a response to that point.

Revision history for this message
Ashley Lai (alai) wrote :

We hit this bug again today. crashdump is attached.

      ceph-mon/0:
        workload-status:
          current: maintenance
          message: Bootstrapping MON cluster
          since: 03 Apr 2018 21:37:04Z
        juju-status:
          current: executing
          message: running mon-relation-changed hook
          since: 03 Apr 2018 21:36:57Z
          version: 2.3.5

Revision history for this message
Ashley Lai (alai) wrote :

bundle is attached.

Revision history for this message
Craig Bender (craig-bender) wrote :

Have same issue, recreatable 100% of time.

$ juju status ceph-mon |awk '/Unit/,/^$/'
Unit Workload Agent Machine Public address Ports Message
ceph-mon/0* maintenance executing 1/lxd/0 172.27.20.149 Bootstrapping MON cluster
ceph-mon/1 maintenance executing 2/lxd/0 172.27.21.97 Bootstrapping MON cluster
ceph-mon/2 maintenance executing 3/lxd/0 172.27.21.77 Bootstrapping MON cluster

bundle is openstack-base (as of 27-04-2018)

All units can ping each other and conf has correct ips of other mon units:

Sanity check: https://pastebin.canonical.com/p/yVgCRBPmkT/
Output of Sanity check: https://pastebin.canonical.com/p/pr74vkPbNC/

One unit will not create admin key ring

Other units create a keyring.tmp.

If I bound jujud-ceph-mon, .tmp files will become keyring files.

output of: juju debug-log --include ceph-mon
unit-ceph-mon-1: 13:33:35 DEBUG unit.ceph-mon/1.mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
unit-ceph-mon-1: 13:33:36 DEBUG unit.ceph-mon/1.mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
unit-ceph-mon-1: 13:33:37 DEBUG unit.ceph-mon/1.mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
unit-ceph-mon-1: 13:33:38 DEBUG unit.ceph-mon/1.mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
unit-ceph-mon-1: 13:33:39 DEBUG unit.ceph-mon/1.mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
unit-ceph-mon-1: 13:33:40 DEBUG unit.ceph-mon/1.mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
unit-ceph-mon-1: 13:33:42 DEBUG unit.ceph-mon/1.mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
unit-ceph-mon-1: 13:33:43 DEBUG unit.ceph-mon/1.mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
unit-ceph-mon-1: 13:33:44 DEBUG unit.ceph-mon/1.mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
unit-ceph-mon-1: 13:33:45 DEBUG unit.ceph-mon/1.mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
unit-ceph-mon-1: 13:32:44 DEBUG unit.ceph-mon/1.mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
unit-ceph-mon-1: 13:32:45 DEBUG unit.ceph-mon/1.mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
unit-ceph-mon-1: 13:32:46 DEBUG unit.ceph-mon/1.mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
unit-ceph-mon-1: 13:32:47 DEBUG unit.ceph-mon/1.mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'

Happy to give access to this box, as I say it's recreatable 100% of the time.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

escalated to Field High SLA.

tags: added: cpe-onsite
Frode Nordahl (fnordahl)
Changed in charm-ceph-mon:
status: New → In Progress
milestone: 18.02 → 18.05
Revision history for this message
Frode Nordahl (fnordahl) wrote :

@craig-bender Would you be able to attach a juju crashdump from your environment to this bug

Revision history for this message
Frode Nordahl (fnordahl) wrote :

A note about ceph versions involved as the charm code paths are a bit different between Jewel and Luminous

ceph 10.2.7 (Jewel) in comment #15
ceph 10.2.9 (Jewel) in comment #29, #38, #40 and #44
ceph 12.2.0 (Luminous) in comment #1 (before related fix in comment #34)

The related issue observed and fixed in comment #34 was for Jewel and part of the solution was to increase timeout as the charm and Jewel ceph-create-keys command was giving up too quickly.

Ceph versions prior to Jewel would be subject to the same realities of deployment time and would need to handle this too.

Revision history for this message
Frode Nordahl (fnordahl) wrote :

However the description of the failed deployment in comment #46 does indicate that Ceph Luminous was used. A juju crashdump from it would be very useful to have.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (master)

Fix proposed to branch: master
Review: https://review.openstack.org/566853

Revision history for this message
Chris Gregan (cgregan) wrote :

Field High SLA now requires that a estimated date for a fix is listed in the comments. Please provide this estimate for the open tasks.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (master)

Reviewed: https://review.openstack.org/566853
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-mon/commit/?id=e7d1e7e4f2bd2ec9f40ae1674dd135d7e4ea59fa
Submitter: Zuul
Branch: master

commit e7d1e7e4f2bd2ec9f40ae1674dd135d7e4ea59fa
Author: Frode Nordahl <email address hidden>
Date: Tue May 8 12:49:07 2018 +0200

    Improve Bootstrap resilience

    Sync relevant changes from charms.ceph

    Disable `ceph-create-keys` in init system and explicitly
    run it in the charms ceph-mon bootstrap process.

    Change-Id: I03cd596e6e336b75d7d108ed0acde15d9940913f
    Depends-On: I3d3c7298076730c423ca5cc059316619f415b885
    Closes-Bug: #1719436

Changed in charm-ceph-mon:
status: In Progress → Fix Committed
Revision history for this message
Vern Hart (vern) wrote :

I have experienced my ceph-mon units seemingly getting stuck in "Bootstrapping MON cluster" even when the cluster is HEALTH_OK with the correct number of mons in the cluster. However, I don't think my issue is related since my keyrings (/etc/ceph/ceph.client.admin.keyring) exist and have appropriate content.

That said, I feel like this bug may be exacerbated by the fact that the hooks mon-relation-departed, mon-relation-changed, leader-settings-changed, and bootstrap-source-relation-departed all set the status to "Bootstrapping MON cluster" when we have at least "monitor-count" mon hosts -- regardless if the cluster is already bootstrapped or not. While this may not be a big deal, the mon_relation function that gets called will call notify_osds, notify_radosgws, and notify_client. That first one, notify_osds, also calls osd_relation for every osd host, and that function also calls notify_radosgws and notify_client. When you have a large number of osd hosts, this can take tens of minutes or even hours to complete -- all the while the status remains "Bootstrapping MON cluster" which often is not true.

Let me draw a picture of the function tree to help drive home my point:

mon_relation
  ├─notify_osds
  | ├─osd_relation node 1
  | | ├─notify_radosgws
  | | └─notify_client
  | ├─osd_relation node 2
  | | ├─notify_radosgws
  | | └─notify_client
  | ├─osd_relation node 3
  | | ├─notify_radosgws
  | | └─notify_client
  | ...
  | └─osd_relation node n
  | ├─notify_radosgws
  | └─notify_client
  ├─notify_radowgws
  └─notify_client

If we could eliminate the redundant notify functions within a single hook execution then these hooks will execute significantly faster resulting in the charm not hanging around in "Bootstrapping MON cluster". In my testing I commented out the notify_radosgw and notify_client lines in osd_relation and the leader-settings-changed hook completed in just a few minutes instead of the 1+ hours it was taking before.

For me, this was enough to get out of the "Bootstrapping MON cluster" purgatory.

Revision history for this message
Rodrigo Ferreira dos Santos (rz6rt6) wrote :

Is the problem corrected with the fix? I´ve tried and the problem continues.....
Some tip?

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

It sounds like this is becoming a bit of a catch-all bug regarding "Bootstrapping MON cluster". I'm going to work on a small fix regarding that particular message (we don't need to set that message if the cluster is bootstrapped), as well as proposing that we start working on some much needed refactoring of the relation updates. Reducing the relation updates is something that we've discussed several times recently as it can definitely become a very painful process on the monitor units

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-mon (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/570745

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

The above fix should actually dramatically reduce time spent in the mon-relation- and leader-settings-relations as well, simply by only doing the bootstrap work when bootstrap is necessary

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-ceph-mon (master)

Reviewed: https://review.openstack.org/570745
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-mon/commit/?id=89f1b798a1b6893c4867a45da3453ea3e7c65eb6
Submitter: Zuul
Branch: master

commit 89f1b798a1b6893c4867a45da3453ea3e7c65eb6
Author: Chris MacNaughton <email address hidden>
Date: Mon May 28 10:24:07 2018 +0200

    Only attempt to bootstrap the cluster if we are not already done

    This should dramatically reduce hook executions during
    mon-relation and leader-setting hook executions as
    we only need to notify all related units on bootstrap

    Change-Id: I45f4dc47f811bdc3bbe5171ca9a388f0e207f1d0
    Related-Bug: #1719436

David Ames (thedac)
Changed in charm-ceph-mon:
status: Fix Committed → Fix Released
Revision history for this message
KingJ (kj-kingj) wrote :

I was running in to this too and couldn't figure out why - juju's status output was stuck on "Bootstrapping MON cluster" and the ceph-mon(s) themselves were logging messages regarding probing/electing.

Ultimately the cause in this case was too large an MTU on the underlying VLAN interface that the LXD container used. After reducing down to an MTU of 9000 from 9216, I was able to bootstrap the ceph-mon cluster.

This isn't technically related to this bug, but given this is one of the most popular results on Google for "Bootstrapping MON cluster", it may help someone in the future.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Networking is certainly one of the more important bits when setting up the monitor cluster, as they cannot fully bootstrap into a quorum if they can't talk to one another reliably!

Revision history for this message
Alexander Litvinov (alitvinov) wrote :

Hitting the same issue on customer site with xenial, queens, ceph-osd-25 which has the fix (commited by Chris).

One nvme-ceph-mon unit is always stuck with:
Bootstrapping MON cluster

The other 2 seem fine.

in the unit log:
2018-06-29 12:44:04 DEBUG mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'

Some logs:
https://private-fileshare.canonical.com/~alitvinov/random/logs.tar

Full juju crashdump:
https://private-fileshare.canonical.com/~alitvinov/juju_dumps/juju-crashdump-737169f7-02b8-4486-9a24-49e92a9daf31.tar.xz

Revision history for this message
Alexander Litvinov (alitvinov) wrote :

Update: the problem was caused by wrong MTU settings in our network environment

Revision history for this message
Billy Olsen (billy-olsen) wrote :

One piece of advice is to check the MTU of the network settings. When one sees that the connectivity works but sending any kind of packets through the network starts to fail or stall, it should be an indicator that you need to verify MTU.

MTU can be verified using the ping command. Use the `-M do` option to make sure that the packet is not fragmented and send a packet that will take up the full frame size. Remember to take into account the sizes of the ICMP (8) and IP (20) headers.

Thus to verify the MTU is fully configured everywhere for jumbo frames (say 9000 bytes), you can use:

ping -M do -s 8972 <target address>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.