ceph-create-keys:ceph-mon is not in quorum: u'probing' spams for 2+ hours, then mon-relation-changed hook fails

Bug #1774648 reported by Jason Hobbs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Incomplete
High
Chris MacNaughton

Bug Description

2018-06-01 08:33:58 DEBUG mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'
...
2018-06-01 10:47:20 DEBUG mon-relation-changed INFO:ceph-create-keys:ceph-mon is not in quorum: u'probing'

Then we get a traceback:

http://paste.ubuntu.com/p/Pgpxsh849C/

bundle & overlay:
http://paste.ubuntu.com/p/dBzPhQhMkG/

This is with bionic queens - we have never seen this in many tests on xenial queens. We're hitting this 100% of the time in bionic queens.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
description: updated
tags: added: cdo-release-blocker
description: updated
Changed in charm-ceph-mon:
importance: Undecided → Critical
assignee: nobody → Chris MacNaughton (chris.macnaughton)
status: New → In Progress
Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

I'm having trouble recreating this failure, I'll continue trying to reproduce it but it seems to have worked correctly on this simple deploy:

Model Controller Cloud/Region Version SLA
ruxton-ruxton-maas ruxton-ruxton-maas ruxton-maas 2.3.8 unsupported

App Version Status Scale Charm Store Rev OS Notes
ceph-mon 12.2.4 active 3 ceph-mon jujucharms 327 ubuntu
ceph-osd 12.2.4 active 3 ceph-osd jujucharms 335 ubuntu
magpie active 3 magpie jujucharms 33 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-mon/0 active idle 0/lxd/0 10.245.168.49 Unit is ready and clustered
ceph-mon/1* active idle 1/lxd/0 10.245.168.47 Unit is ready and clustered
ceph-mon/2 active idle 2/lxd/0 10.245.168.48 Unit is ready and clustered
ceph-osd/0* active idle 0 10.245.168.44 Unit is ready (1 OSD)
ceph-osd/1 active idle 1 10.245.168.45 Unit is ready (1 OSD)
ceph-osd/2 active idle 2 10.245.168.46 Unit is ready (1 OSD)
magpie/1 active idle 0/lxd/0 10.245.168.49 icmp ok, local hostname ok, dns ok, net mtu ok: 1500, 940 mbit/s
magpie/2* active idle 1/lxd/0 10.245.168.47 icmp ok, local hostname ok, dns ok, iperf leader, mtu: 1500
magpie/3 active idle 2/lxd/0 10.245.168.48 icmp ok, local hostname ok, dns ok, net mtu ok: 1500, 941 mbit/s

Machine State DNS Inst id Series AZ Message
0 started 10.245.168.44 tmrm8d bionic AZ1 Deployed
0/lxd/0 started 10.245.168.49 juju-df9140-0-lxd-0 bionic AZ1 Container started
1 started 10.245.168.45 66wnqf bionic AZ2 Deployed
1/lxd/0 started 10.245.168.47 juju-df9140-1-lxd-0 bionic AZ2 Container started
2 started 10.245.168.46 cbxwat bionic default Deployed
2/lxd/0 started 10.245.168.48 juju-df9140-2-lxd-0 bionic default Container started

Revision history for this message
David Ames (thedac) wrote :

@Jason,

In the crash dump ceph-mon_*/etc/ceph does not have ceph.conf which is usually a sym link to /etc/alternatives/ceph.conf

$ ls -l ceph-mon_*/etc/ceph
ceph-mon_0/etc/ceph:
total 12
-rw-r--r-- 1 thedac thedac 92 Apr 24 18:45 rbdmap

ceph-mon_1/etc/ceph:
total 12
-rw-r--r-- 1 thedac thedac 92 Apr 24 18:45 rbdmap

ceph-mon_2/etc/ceph:
total 12
-rw-r--r-- 1 thedac thedac 92 Apr 24 18:45 rbdmap

This could be a race in the charm or simply the crashdump not including the sym link. Can you check the next time you see this failure if /etc/ceph/ceph.conf exists?

We will check the charm for this race.

Revision history for this message
David Ames (thedac) wrote :

@Chris,

This bug was reported just a couple of days after this commit landed. This seems relevant.
https://github.com/openstack/charm-ceph-mon/commit/89f1b798a1b6893c4867a45da3453ea3e7c65eb6

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

David, it looks like it's not there on good runs either - crashdump must be ignoring it due to the symlink

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

@thedac I understand the timing seeming weird, but the fact that it throws a traceback inside of the call to `bootstrap_monitor_cluster`, it would be difficult for that code block to break it, especially for it to only break in the CDO/QA lab but nowhere else. The change you linked in #4 reduces the load of that hook constantly re-running the bootstrap steps

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Looking through the code in charms.ceph where we call `add_keyring_to_ceph`, it looks like this will have spent 90 minutes trying to load the keyring into the mon cluster

Revision history for this message
James Page (james-page) wrote :
Download full text (6.8 KiB)

Grepping the logs:

ceph-mon_0/var/log/ceph/ceph-mon.juju-ac7cf0-0-lxd-0.log:2018-06-01 09:25:18.236221 7f39afdf4700 0 -- 192.168.36.25:6789/0 >> 192.168.36.16:6789/0 conn(0x564d26c87800 :-1 s=STATE_OPEN pgs=39 cs=1 l=0).fault initiating reconnect
ceph-mon_0/var/log/ceph/ceph-mon.juju-ac7cf0-0-lxd-0.log:2018-06-01 09:25:30.523949 7f39aedf2700 0 -- 192.168.36.25:6789/0 >> 192.168.36.20:6789/0 conn(0x564d26c89000 :-1 s=STATE_OPEN pgs=28 cs=1 l=0).fault initiating reconnect
ceph-mon_0/var/log/ceph/ceph-mon.juju-ac7cf0-0-lxd-0.log:2018-06-01 09:41:01.205276 7f39afdf4700 0 -- 192.168.36.25:6789/0 >> 192.168.36.16:6789/0 conn(0x564d26c87800 :-1 s=STATE_OPEN pgs=41 cs=3 l=0).fault initiating reconnect
ceph-mon_0/var/log/ceph/ceph-mon.juju-ac7cf0-0-lxd-0.log:2018-06-01 09:41:14.651865 7f39aedf2700 0 -- 192.168.36.25:6789/0 >> 192.168.36.20:6789/0 conn(0x564d26c89000 :-1 s=STATE_OPEN pgs=31 cs=3 l=0).fault initiating reconnect
ceph-mon_0/var/log/ceph/ceph-mon.juju-ac7cf0-0-lxd-0.log:2018-06-01 09:41:26.693514 7f39aedf2700 0 -- 192.168.36.25:6789/0 >> 192.168.36.20:6789/0 conn(0x564d26c89000 :-1 s=STATE_OPEN pgs=36 cs=5 l=0).fault initiating reconnect
ceph-mon_0/var/log/ceph/ceph-mon.juju-ac7cf0-0-lxd-0.log:2018-06-01 09:58:43.227801 7f972591d700 0 -- 192.168.36.25:6789/0 >> 192.168.36.16:6789/0 conn(0x55ea028dd800 :-1 s=STATE_OPEN pgs=48 cs=1 l=0).fault initiating reconnect
ceph-mon_0/var/log/ceph/ceph-mon.juju-ac7cf0-0-lxd-0.log:2018-06-01 09:58:55.515547 7f972491b700 0 -- 192.168.36.25:6789/0 >> 192.168.36.20:6789/0 conn(0x55ea028df000 :-1 s=STATE_OPEN pgs=40 cs=1 l=0).fault initiating reconnect
ceph-mon_0/var/log/ceph/ceph-mon.juju-ac7cf0-0-lxd-0.log:2018-06-01 10:14:27.356064 7f972591d700 0 -- 192.168.36.25:6789/0 >> 192.168.36.16:6789/0 conn(0x55ea028dd800 :-1 s=STATE_OPEN pgs=50 cs=3 l=0).fault initiating reconnect
ceph-mon_0/var/log/ceph/ceph-mon.juju-ac7cf0-0-lxd-0.log:2018-06-01 10:14:39.644046 7f972491b700 0 -- 192.168.36.25:6789/0 >> 192.168.36.20:6789/0 conn(0x55ea028df000 :-1 s=STATE_OPEN pgs=43 cs=3 l=0).fault initiating reconnect
ceph-mon_0/var/log/ceph/ceph-mon.juju-ac7cf0-0-lxd-0.log:2018-06-01 10:14:42.529463 7f972491b700 0 -- 192.168.36.25:6789/0 >> 192.168.36.20:6789/0 conn(0x55ea028df000 :-1 s=STATE_OPEN pgs=48 cs=5 l=0).fault initiating reconnect
ceph-mon_0/var/log/ceph/ceph-mon.juju-ac7cf0-0-lxd-0.log:2018-06-01 10:32:02.076213 7fadb7934700 0 -- 192.168.36.25:6789/0 >> 192.168.36.16:6789/0 conn(0x564320d6e000 :-1 s=STATE_OPEN pgs=57 cs=1 l=0).fault initiating reconnect
ceph-mon_0/var/log/ceph/ceph-mon.juju-ac7cf0-0-lxd-0.log:2018-06-01 10:32:20.509295 7fadb1127700 0 -- 192.168.36.25:6789/0 >> 192.168.36.20:6789/0 conn(0x564320d6f800 :-1 s=STATE_OPEN pgs=53 cs=1 l=0).fault initiating reconnect
ceph-mon_0/var/log/ceph/ceph-mon.juju-ac7cf0-0-lxd-0.log:2018-06-01 10:47:46.204373 7fadb7934700 0 -- 192.168.36.25:6789/0 >> 192.168.36.16:6789/0 conn(0x564320d6e000 :-1 s=STATE_OPEN pgs=59 cs=3 l=0).fault initiating reconnect
ceph-mon_0/var/log/ceph/ceph-mon.juju-ac7cf0-0-lxd-0.log:2018-06-01 10:48:04.636538 7fadb1127700 0 -- 192.168.36.25:6789/0 >> 192.168.36.20:6789/0 conn(0x564320d6f800 :-1 s=STATE_OPEN pgs=56 cs=3 l=0).f...

Read more...

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

there's a bug filed upstream that has the exact log lines noticed in this bug: tracker.ceph.com/issues/21427

without the ceph.conf that the monitors are using, there's really nowhere else I can go with this bug

Changed in charm-ceph-mon:
status: In Progress → Incomplete
importance: Critical → High
Revision history for this message
James Page (james-page) wrote :

Can we get /etc/netplan added to the juju crashdump reports please?

Bionic manages interfaces in a different way and it would be useful to see this information

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I think this is bug 1774666. The MTU is not being properly set for bond interfaces, but it is being properly set in containers that are bound spaces connected to the bond interface. So, the bond interface is stuck at 1500 (should be 9000), while the container is at 9000.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.