Ceph monitors not responding (Error connecting to cluster: TimedOut)

Bug #1629237 reported by Paul Bourke
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
kolla
Invalid
Critical
Paul Bourke

Bug Description

Ceph mons are timing out, meaning the OSDs cannot bootstrap and cluster is unusable.

[root@storage01 ~]# docker logs bootstrap_osd_0
INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
<snip>
INFO:__main__:Writing out command to execute
2016-09-30 10:06:57.882178 7fdb08f02700 0 monclient(hunting): authenticate timed out after 300
2016-09-30 10:06:57.882301 7fdb08f02700 0 librados: client.admin authentication error (110) Connection timed out
Error connecting to cluster: TimedOut

docker exec ceph_mon ceph -s

hangs.

Changed in kolla:
status: New → Confirmed
importance: Undecided → Critical
milestone: none → newton-rc2
assignee: nobody → Paul Bourke (pauldbourke)
Revision history for this message
Paul Bourke (pauldbourke) wrote :

Some logs from ceph_mon, seems to be continually trying to form a quorum:

2016-09-30 14:40:30.171053 7f5e2294e700 0 mon.192.168.7.253@2(electing).data_health(0) update_stats avail 90% total 34875 MB, used 1570 MB, avail 31435 MB
2016-09-30 14:40:36.153645 7f5e2214d700 0 log_channel(cluster) log [INF] : mon.192.168.7.253 calling new monitor election
2016-09-30 14:40:36.153786 7f5e2214d700 1 mon.192.168.7.253@2(electing).elector(175) init, last seen epoch 175
2016-09-30 14:40:51.187476 7f5e2214d700 0 log_channel(cluster) log [INF] : mon.192.168.7.253 calling new monitor election
2016-09-30 14:40:51.187574 7f5e2214d700 1 mon.192.168.7.253@2(electing).elector(177) init, last seen epoch 177
2016-09-30 14:41:06.223918 7f5e2214d700 0 log_channel(cluster) log [INF] : mon.192.168.7.253 calling new monitor election
2016-09-30 14:41:06.224077 7f5e2214d700 1 mon.192.168.7.253@2(electing).elector(179) init, last seen epoch 179
2016-09-30 14:41:21.264659 7f5e2214d700 0 log_channel(cluster) log [INF] : mon.192.168.7.253 calling new monitor election
2016-09-30 14:41:21.264792 7f5e2214d700 1 mon.192.168.7.253@2(electing).elector(181) init, last seen epoch 181
2016-09-30 14:41:30.171589 7f5e2294e700 0 mon.192.168.7.253@2(electing).data_health(0) update_stats avail 90% total 34875 MB, used 1570 MB, avail 31435 MB

Changed in kolla:
importance: Critical → High
Revision history for this message
Jeffrey Zhang (jeffrey4l) wrote :

Tried several times. I can not re-produce such issue.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

i have been stuck with this issue for 3 weeks since before i went on vacation.
disabling dumb init was going to be my next debuging step so im glad that works.

ill test locally on my failing systems and comment back if patching it out fixes it for me.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

by the way i was reproducing this on a ubuntu source build of master

Revision history for this message
sean mooney (sean-k-mooney) wrote :

in my case removing dumb-init does not help

diff --git a/docker/ceph/ceph-base/Dockerfile.j2 b/docker/ceph/ceph-base/Dockerfile.j2
index d3ebbe2..db91fe5 100644
--- a/docker/ceph/ceph-base/Dockerfile.j2
+++ b/docker/ceph/ceph-base/Dockerfile.j2
@@ -30,7 +30,9 @@ MAINTAINER {{ maintainer }}

 COPY extend_start.sh /usr/local/bin/kolla_extend_start
 RUN chmod 755 /usr/local/bin/kolla_extend_start \
- && usermod -a -G kolla ceph
+ && usermod -a -G kolla ceph ; \
+ sed -e "s?#!/usr/local/bin/dumb-init /bin/bash?#!/bin/bash?" -i /usr/local/bin/kolla_start
+

after rebuilding and redeploying i see the same behavior using the current head of master
f9f619d208413aee57309b7999227e05bcbf2e62

Revision history for this message
sean mooney (sean-k-mooney) wrote :

reducing for 6 mons to 3 resolved this for me with the dumb-init change.
i will check without the dumb-init change and see if it was jsut cause by
the fact i was deploying 6 mons tommorow.

Revision history for this message
Jeffrey Zhang (jeffrey4l) wrote :

i think the even number of ceph mon cause your issue.

Revision history for this message
Paul Bourke (pauldbourke) wrote :

@Jeffrey I have 3 mons, usually works. I'll retest tomorrow and confirm if this is still occurring as it sounds like its not consistent

Changed in kolla:
status: Confirmed → Incomplete
description: updated
Steven Dake (sdake)
Changed in kolla:
importance: High → Critical
milestone: newton-rc2 → newton-rc3
Revision history for this message
Paul Bourke (pauldbourke) wrote :

Just deployed master copy of ceph (3 mons, 3 osds), no issues (git ce23dbe). I'm not 100% convinced this will not reappear but going to mark as invalid for now.

Changed in kolla:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.