kolla

Ceph monitors not responding (Error connecting to cluster: TimedOut)

Bug #1629237 reported by Paul Bourke on 2016-09-30

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	kolla	Invalid	Critical	Paul Bourke	kolla newton-rc3 "rc3"

Bug Description

Ceph mons are timing out, meaning the OSDs cannot bootstrap and cluster is unusable.

[root@storage01 ~]# docker logs bootstrap_osd_0
INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
<snip>
INFO:__main__:Writing out command to execute
2016-09-30 10:06:57.882178 7fdb08f02700 0 monclient(hunting): authenticate timed out after 300
2016-09-30 10:06:57.882301 7fdb08f02700 0 librados: client.admin authentication error (110) Connection timed out
Error connecting to cluster: TimedOut

docker exec ceph_mon ceph -s

hangs.

See original description

Paul Bourke (pauldbourke) on 2016-09-30

Changed in kolla:
status:	New → Confirmed
importance:	Undecided → Critical
milestone:	none → newton-rc2
assignee:	nobody → Paul Bourke (pauldbourke)

Revision history for this message

Paul Bourke (pauldbourke) wrote on 2016-09-30:

Some logs from ceph_mon, seems to be continually trying to form a quorum:

2016-09-30 14:40:30.171053 7f5e2294e700 0 mon.192.168.7.253@2(electing).data_health(0) update_stats avail 90% total 34875 MB, used 1570 MB, avail 31435 MB
2016-09-30 14:40:36.153645 7f5e2214d700 0 log_channel(cluster) log [INF] : mon.192.168.7.253 calling new monitor election
2016-09-30 14:40:36.153786 7f5e2214d700 1 mon.192.168.7.253@2(electing).elector(175) init, last seen epoch 175
2016-09-30 14:40:51.187476 7f5e2214d700 0 log_channel(cluster) log [INF] : mon.192.168.7.253 calling new monitor election
2016-09-30 14:40:51.187574 7f5e2214d700 1 mon.192.168.7.253@2(electing).elector(177) init, last seen epoch 177
2016-09-30 14:41:06.223918 7f5e2214d700 0 log_channel(cluster) log [INF] : mon.192.168.7.253 calling new monitor election
2016-09-30 14:41:06.224077 7f5e2214d700 1 mon.192.168.7.253@2(electing).elector(179) init, last seen epoch 179
2016-09-30 14:41:21.264659 7f5e2214d700 0 log_channel(cluster) log [INF] : mon.192.168.7.253 calling new monitor election
2016-09-30 14:41:21.264792 7f5e2214d700 1 mon.192.168.7.253@2(electing).elector(181) init, last seen epoch 181
2016-09-30 14:41:30.171589 7f5e2294e700 0 mon.192.168.7.253@2(electing).data_health(0) update_stats avail 90% total 34875 MB, used 1570 MB, avail 31435 MB

Paul Bourke (pauldbourke) on 2016-09-30

Changed in kolla:
importance:	Critical → High

Revision history for this message

Jeffrey Zhang (jeffrey4l) wrote on 2016-10-03:

Tried several times. I can not re-produce such issue.

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2016-10-03:

i have been stuck with this issue for 3 weeks since before i went on vacation.
disabling dumb init was going to be my next debuging step so im glad that works.

ill test locally on my failing systems and comment back if patching it out fixes it for me.

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2016-10-03:

by the way i was reproducing this on a ubuntu source build of master

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2016-10-03:

in my case removing dumb-init does not help

diff --git a/docker/ceph/ceph-base/Dockerfile.j2 b/docker/ceph/ceph-base/Dockerfile.j2
index d3ebbe2..db91fe5 100644
--- a/docker/ceph/ceph-base/Dockerfile.j2
+++ b/docker/ceph/ceph-base/Dockerfile.j2
@@ -30,7 +30,9 @@ MAINTAINER {{ maintainer }}

COPY extend_start.sh /usr/local/bin/kolla_extend_start
RUN chmod 755 /usr/local/bin/kolla_extend_start \
- && usermod -a -G kolla ceph
+ && usermod -a -G kolla ceph ; \
+ sed -e "s?#!/usr/local/bin/dumb-init /bin/bash?#!/bin/bash?" -i /usr/local/bin/kolla_start
+

after rebuilding and redeploying i see the same behavior using the current head of master
f9f619d208413aee57309b7999227e05bcbf2e62

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2016-10-03:

reducing for 6 mons to 3 resolved this for me with the dumb-init change.
i will check without the dumb-init change and see if it was jsut cause by
the fact i was deploying 6 mons tommorow.

Revision history for this message

Jeffrey Zhang (jeffrey4l) wrote on 2016-10-04:

i think the even number of ceph mon cause your issue.

Revision history for this message

Paul Bourke (pauldbourke) wrote on 2016-10-05:

@Jeffrey I have 3 mons, usually works. I'll retest tomorrow and confirm if this is still occurring as it sounds like its not consistent

Changed in kolla:
status:	Confirmed → Incomplete

Paul Bourke (pauldbourke) on 2016-10-10

description:

updated

Steven Dake (sdake) on 2016-10-13

Changed in kolla:
importance:	High → Critical
milestone:	newton-rc2 → newton-rc3

Revision history for this message

Paul Bourke (pauldbourke) wrote on 2016-10-13:

Just deployed master copy of ceph (3 mons, 3 osds), no issues (git ce23dbe). I'm not 100% convinced this will not reappear but going to mark as invalid for now.

Changed in kolla:
status:	Incomplete → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.