All three ceph-mon units not clustered

Bug #1656116 reported by Francis Ginther on 2017-01-12
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Autopilot Log Analyser
High
Francis Ginther
OpenStack ceph-mon charm
Low
Chris Holcombe
ceph-mon (Juju Charms Collection)
Undecided
Chris Holcombe

Bug Description

With Juju 2.1-beta4, this was hit with a Landscape autopilot deployment:

https://ci.lscape.net/job/landscape-system-tests/4981

All three ceph-mon units are showing "Unit not clustered (no quorum)"

Here's an excerpt from juju status:

  ceph-mon:
    charm: cs:xenial/ceph-mon-7
    series: xenial
    os: ubuntu
    charm-origin: jujucharms
    charm-name: ceph-mon
    charm-rev: 7
    exposed: false
    application-status:
      current: blocked
      message: Unit not clustered (no quorum)
      since: 11 Jan 2017 20:18:05Z
...
    units:
      ceph-mon/0:
        workload-status:
          current: blocked
          message: Unit not clustered (no quorum)
          since: 11 Jan 2017 20:18:05Z
...
      ceph-mon/1:
        workload-status:
          current: blocked
          message: Unit not clustered (no quorum)
          since: 11 Jan 2017 20:18:11Z
...
      ceph-mon/2:
        workload-status:
          current: blocked
          message: Unit not clustered (no quorum)
          since: 11 Jan 2017 20:18:00Z

All of the ceph-mon logs are attached. If more logs are needed from ceph-osd, ceph-radosgw or something else let me know.

Francis Ginther (fginther) wrote :
Francis Ginther (fginther) wrote :

Adding autopilot-log-analyser so that we remember to write an analyser issue for this.

Ryan Beisner (1chb1n) wrote :

@fginther - can you please provide a full juju status, including tabular? Trying to understand the placement, metal vs container, etc.

Also, a pointer to the specific code which adds the charm config to ceph-* in your model will be helpful in reproducing. That, or a bundle if this was deployed with a bundle.yaml. Thank you.

Andreas Hasenack (ahasenack) wrote :
Andreas Hasenack (ahasenack) wrote :

This one in json, so you can parse it with python and print or search whatever you want

Andreas Hasenack (ahasenack) wrote :

The ceph-* charm config is trickier, as it's indeed code that does that.

The code is in https://bazaar.launchpad.net/~landscape/landscape/trunk/files/head:/canonical/landscape/model/openstack/services.py

Ryan Beisner (1chb1n) wrote :

@ahasenack - in that case can you paste the resultant charm config for ceph-*? I'm not able to derive exactly what those values are from the code.

description: updated
Ryan Beisner (1chb1n) wrote :

Please also attach the corresponding ceph-osd.tar.gz log archive. Thanks again.

Andreas Hasenack (ahasenack) wrote :

Sorry, can't show you charm config as the deployment isn't live anymore. We want to include that data in our log collection script eventually.

Andreas Hasenack (ahasenack) wrote :

ceph-{mon,osd,radosgw} logs

Ryan Beisner (1chb1n) wrote :

Thanks for the added info, I believe that will be enough for the storage charm folks to triage.

## To summarize log spelunking:
3 ceph-mons in containers
6 ceph-osds
xenial-mitaka
Juju 2.1-beta4

## ceph-mon syslog says:
Jan 11 20:28:33 juju-bd739b-5-lxd-5 systemd[1]: Started Ceph cluster monitor daemon.
Jan 11 20:28:34 juju-bd739b-5-lxd-5 ceph-mon[61101]: 2017-01-11 20:28:34.021764 7f3dc8945580 -1 did not load config file, using default settings.
Jan 11 20:28:34 juju-bd739b-5-lxd-5 ceph-mon[61101]: monitor data directory at '/var/lib/ceph/mon/ceph-juju-bd739b-5-lxd-5' does not exist: have you run 'mkfs'?
Jan 11 20:28:34 juju-bd739b-5-lxd-5 systemd[1]: ceph-mon.service: Main process exited, code=exited, status=1/FAILURE
Jan 11 20:28:34 juju-bd739b-5-lxd-5 systemd[1]: ceph-mon.service: Unit entered failed state.
Jan 11 20:28:34 juju-bd739b-5-lxd-5 systemd[1]: ceph-mon.service: Failed with result 'exit-code'.

And with that, I'll turn it over to a ceph charm SME. :-)

Chris Holcombe (xfactor973) wrote :

Was the `monitor-count` setting in ceph-mon's config.yaml set to something other than the default of 3? I don't see any log messages saying that the monitors were bootstrapping. The only thing the prevents that is setting the monitor-count.

Andreas Hasenack (ahasenack) wrote :

monitor-count in ceph-mon is set to 3:

NUMBER_OF_STORAGE_REPLICAS = 3
...
    "ceph-mon": ServiceInfo(
        u"ceph-mon",
        config={"monitor-count": str(NUMBER_OF_STORAGE_REPLICAS),
                "source": OPENSTACK_SOURCE},
        count=NUMBER_OF_STORAGE_REPLICAS),

Chris Holcombe (xfactor973) wrote :

Beisner helped me find that the mon count is indeed 3. The other thing that would prevent bootstrapping is if the mon relation failed to query the other related mons. Do you have the relation data showing from each mon's perspective how many other related units it sees? That would be really helpful in narrowing this down.

Andreas Hasenack (ahasenack) wrote :

No, this run isn't live anymore. All we have is essentially juju status and /var/log/** and some of /etc/** from all units.

Changed in autopilot-log-analyser:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Francis Ginther (fginther)
Changed in autopilot-log-analyser:
status: In Progress → Fix Committed
Ryan Beisner (1chb1n) on 2017-01-23
Changed in ceph-mon (Juju Charms Collection):
assignee: nobody → Chris Holcombe (xfactor973)
James Page (james-page) on 2017-02-23
Changed in charm-ceph-mon:
assignee: nobody → Chris Holcombe (xfactor973)
Changed in ceph-mon (Juju Charms Collection):
status: New → Invalid
James Page (james-page) wrote :

To be honest the mon's won't even try to bootstrap until they can see each other over the relation data, so I suspect this is something network-ish with regards to LXD containers in Juju 2.1.x

James Page (james-page) wrote :

Might be pertinent:

mon:14: still waiting for leader to setup keys

James Page (james-page) wrote :

Which is odd:

      ceph-mon/0:
        juju-status: {current: idle, since: '11 Jan 2017 20:18:05Z', version: 2.1-beta4}
        leader: true

as that message is from the unit log on mon0.

James Page (james-page) wrote :

So it looks like when the config-changed hook ran on each unit, none of the units where actually the leader - I've never see this behaviour from either the charm or juju.

James Page (james-page) wrote :

Was this a one-off error or has this problem been seen multiple times?

Changed in charm-ceph-mon:
status: New → Incomplete
importance: Undecided → Low
James Page (james-page) wrote :

Definitely something todo with wonky leadership election:

ceph-mon-0/var/log/juju/unit-ceph-mon-0.log:2017-01-11 17:15:44 INFO juju-log still waiting for leader to setup keys
ceph-mon-0/var/log/juju/unit-ceph-mon-0.log:2017-01-11 17:20:02 INFO juju-log mon:14: still waiting for leader to setup keys
ceph-mon-0/var/log/juju/unit-ceph-mon-0.log:2017-01-11 17:20:19 INFO juju-log mon:14: still waiting for leader to setup keys

ceph-mon-1/var/log/juju/unit-ceph-mon-1.log:2017-01-11 17:19:30 INFO juju-log still waiting for leader to setup keys
ceph-mon-1/var/log/juju/unit-ceph-mon-1.log:2017-01-11 17:19:47 INFO juju-log mon:14: still waiting for leader to setup keys
ceph-mon-1/var/log/juju/unit-ceph-mon-1.log:2017-01-11 17:20:01 INFO juju-log mon:14: still waiting for leader to setup keys
ceph-mon-1/var/log/juju/unit-ceph-mon-1.log:2017-01-11 17:20:10 INFO juju-log mon:14: still waiting for leader to setup keys

ceph-mon-2/var/log/juju/unit-ceph-mon-2.log:2017-01-11 17:19:28 INFO juju-log still waiting for leader to setup keys
ceph-mon-2/var/log/juju/unit-ceph-mon-2.log:2017-01-11 17:19:42 INFO juju-log mon:14: still waiting for leader to setup keys
ceph-mon-2/var/log/juju/unit-ceph-mon-2.log:2017-01-11 17:19:54 INFO juju-log mon:14: still waiting for leader to setup keys
ceph-mon-2/var/log/juju/unit-ceph-mon-2.log:2017-01-11 17:20:07 INFO juju-log mon:14: still waiting for leader to setup keys

all three units where waiting for a leader to appear and bootstrap the data for the cluster.

I believe the call at 2017-01-11 17:15:44 by ceph-mon/0 should have resulted in it being declared the lead unit.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers