The ceph cluster will get stuck when multiple mons are deployed.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
kolla |
Fix Released
|
Medium
|
Unassigned | ||
Train |
Fix Released
|
Medium
|
Unassigned |
Bug Description
When I deployed ceph with kolla and kolla-ansible(rocky version), I set up five mon nodes.
```
[storage-mon]
ceph-node1
ceph-node2
ceph-node3
ceph-node4
ceph-node5
```
Kolla-ansible will get stuck while performing this task:
```
TASK [ceph : Getting ceph mgr keyring]
```
I checked the status of mon, it was stuck.
```
(ceph-mon)
[errno 110] error connecting to the cluster
```
This is the first mon log:
```
431 2019-04-15 14:13:35.265201 7f3b1fd43000 0 mon.ceph-
432 2019-04-15 14:13:42.364548 7f3b1236a700 0 mon.ceph-
433 2019-04-15 14:13:42.365666 7f3b15b71700 1 mon.ceph-
434 2019-04-15 14:13:42.684252 7f3b1336c700 0 mon.ceph-
435 2019-04-15 14:13:42.690717 7f3b15b71700 1 mon.ceph-
436 2019-04-15 14:13:42.708722 7f3b12b6b700 0 mon.ceph-
437 2019-04-15 14:13:42.709649 7f3b15b71700 1 mon.ceph-
438 2019-04-15 14:13:42.733018 7f3b1236a700 0 mon.ceph-
439 2019-04-15 14:13:42.739142 7f3b15b71700 1 mon.ceph-
440 2019-04-15 14:13:43.267431 7f3b15b71700 1 mon.ceph-
441 2019-04-15 14:13:43.267443 7f3b15b71700 0 mon.ceph-
442 2019-04-15 14:13:43.271262 7f3b12b6b700 0 -- 192.168.
443 2019-04-15 14:13:43.271461 7f3b15b71700 1 mon.ceph-
444 2019-04-15 14:13:43.271470 7f3b15b71700 0 mon.ceph-
445 2019-04-15 14:13:43.276008 7f3b1336c700 0 -- 192.168.
446 2019-04-15 14:13:43.276346 7f3b1336c700 0 -- 192.168.
447 2019-04-15 14:13:43.276459 7f3b15b71700 0 log_channel(
448 2019-04-15 14:13:43.276560 7f3b15b71700 1 mon.ceph-
449 2019-04-15 14:13:43.277178 7f3b12b6b700 0 mon.ceph-
450 2019-04-15 14:13:43.277403 7f3b1336c700 0 mon.ceph-
451 2019-04-15 14:13:43.280056 7f3b12b6b700 0 -- 192.168.
452 2019-04-15 14:13:43.280079 7f3b12b6b700 0 -- 192.168.
453 2019-04-15 14:13:43.280299 7f3b1336c700 0 -- 192.168.
454 2019-04-15 14:13:43.280338 7f3b1336c700 0 -- 192.168.
455 2019-04-15 14:13:43.283014 7f3b12b6b700 0 -- 192.168.
456 2019-04-15 14:13:43.283333 7f3b1336c700 0 -- 192.168.
457 2019-04-15 14:13:43.293830 7f3b1236a700 0 -- 192.168.
458 2019-04-15 14:13:43.294164 7f3b15b71700 1 mon.ceph-
459 2019-04-15 14:13:43.294069 7f3b1336c700 0 -- 192.168.
460 2019-04-15 14:13:43.294603 7f3b15b71700 1 mon.ceph-
461 2019-04-15 14:13:43.296143 7f3b15b71700 0 log_channel(
462 2019-04-15 14:13:43.296201 7f3b15b71700 1 mon.ceph-
463 2019-04-15 14:13:48.299861 7f3b15b71700 0 log_channel(
464 2019-04-15 14:13:48.300003 7f3b15b71700 1 mon.ceph-
465 2019-04-15 14:13:53.304789 7f3b15b71700 0 log_channel(
466 2019-04-15 14:13:53.304845 7f3b15b71700 1 mon.ceph-
467 2019-04-15 14:13:58.309354 7f3b15b71700 0 log_channel(
468 2019-04-15 14:13:58.309428 7f3b15b71700 1 mon.ceph-
469 2019-04-15 14:14:03.314816 7f3b15b71700 0 log_channel(
470 2019-04-15 14:14:03.314882 7f3b15b71700 1 mon.ceph-
471 2019-04-15 14:14:08.319605 7f3b15b71700 0 log_channel(
472 2019-04-15 14:14:08.319667 7f3b15b71700 1 mon.ceph-
473 2019-04-15 14:14:13.324962 7f3b15b71700 0 log_channel(
474 2019-04-15 14:14:13.325064 7f3b15b71700 1 mon.ceph-
475 2019-04-15 14:14:18.329602 7f3b15b71700 0 log_channel(
476 2019-04-15 14:14:18.329730 7f3b15b71700 1 mon.ceph-
477 2019-04-15 14:14:23.335145 7f3b15b71700 0 log_channel(
478 2019-04-15 14:14:23.335462 7f3b15b71700 1 mon.ceph-
```
The log shows that mon is always in the election state.
I debugged the deployment process of ceph-deploy, the initialization of mon is different from that in kolla.
-ceph-deploy
```
ceph-mon --cluster ceph --mkfs -i ceph-node1 --keyring /var/lib/
```
- kolla (kolla\
```
KEYRING_
# Generate keyring for current monitor
ceph-authtool --create-keyring "${KEYRING_TMP}" --import-keyring "${KEYRING_ADMIN}"
ceph-authtool "${KEYRING_TMP}" --import-keyring "${KEYRING_MON}"
mkdir -p "${MON_DIR}"
ceph-mon --mkfs -i "${HOSTNAME}" --monmap "${MONMAP}" --keyring "${KEYRING_TMP}"
rm "${KEYRING_TMP}"
```
When the monmap option is removed, the cluster deployment is back to normal.
https:/ /review. openstack. org/652606
This commit fixes this problem.