Playbook ceph-install.yml won't boostrap cluster

Bug #1741085 reported by JérômeH
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
Undecided
Andy McCrae

Bug Description

Hi,

I think following commit https://git.openstack.org/cgit/openstack/openstack-ansible/commit/playbooks/ceph-install.yml?id=ce07654f03324af04d3c8bea2f96ab909cc16dc9 broke the bootstrapping of a new ceph cluster.
Ceph cannot reach a quorum because the "serial: 1" will install one monitor at a time.

Maybe add a flag "-eceph_bootstrap=True" that will change the serial value in the playbook.

Regards,
Jérôme

Revision history for this message
Logan V (loganv) wrote :

Hi Jerome,

It seems possible that ceph-mon is checking the quorum status of the cluster during the role, but Id like to understand a little better where/how that is happening to propose a fix. Can you post a playbook run log during the deploy of ceph-install.yml so we can see where it is failing in the deploy process?

Revision history for this message
Andy McCrae (andrew-mccrae) wrote :

On Pike + we shouldn't need serial anyway I don't believe.
Ocata is using v2.2 of ceph-ansible, but pike+ is v3+

in v3 the restarts happen in serial as part of the ceph-ansible handler:
https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-defaults/handlers/main.yml#L34-L47

I think we could remove serial from this.

Revision history for this message
JérômeH (jeromeh) wrote :

Hi Logan,

You can find the attached run log.

Also for your information, here is my host list :
# ansible --list-hosts ceph-mon
  hosts (4):
    compute1_ceph-mon_container-caa07057
    compute2_ceph-mon_container-48e7b72a
    infra1_ceph-mon_container-c5204c5b
    compute3_ceph-mon_container-e842e62b

Regards,
Jérôme

Revision history for this message
Jean-Philippe Evrard (jean-philippe-evrard) wrote :

We have no issues with ceph deployments (from scratch) under master, Pike, and Ocata, which makes me think this bug is invalid.

Maybe Andy's suggestion is worth pursuing, but not in the context of this bug.

Something seems missing for me to understand what really happened on that installation.

Revision history for this message
Jean-Philippe Evrard (jean-philippe-evrard) wrote :
Revision history for this message
Jean-Philippe Evrard (jean-philippe-evrard) wrote :

Ok it seems that our gate tests don't run with the same amount of mons hosts, which would lead to the issue.

Revision history for this message
Jean-Philippe Evrard (jean-philippe-evrard) wrote :

We'll wait for someone to manually confirm this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (master)

Fix proposed to branch: master
Review: https://review.openstack.org/536891

Changed in openstack-ansible:
assignee: nobody → Andy McCrae (andrew-mccrae)
status: New → In Progress
Revision history for this message
Marcin Dulak (marcin-dulak) wrote :

https://bugs.launchpad.net/openstack-ansible/+bug/1745287 brought me here.

Removing "serial: 1" solves the above bug.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (master)

Reviewed: https://review.openstack.org/536891
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=6a4c1bdc6b6849c61f78540e7728be534ce7876b
Submitter: Zuul
Branch: master

commit 6a4c1bdc6b6849c61f78540e7728be534ce7876b
Author: Andy McCrae <email address hidden>
Date: Tue Jan 23 16:29:08 2018 +0000

    Install mon servers in parallel.

    The ceph-ansible roles handle restarting services in serial via a
    script, so we shouldn't need to run this in serial.

    Change-Id: Ia2f2694abe02af282e7a98fcc40410b0510cc1cd
    Closes-Bug: #1741085

Changed in openstack-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/559610

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (stable/queens)

Reviewed: https://review.openstack.org/559610
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=5ea12067219877950496068f42c7dd41f051a425
Submitter: Zuul
Branch: stable/queens

commit 5ea12067219877950496068f42c7dd41f051a425
Author: Andy McCrae <email address hidden>
Date: Tue Jan 23 16:29:08 2018 +0000

    Install mon servers in parallel.

    The ceph-ansible roles handle restarting services in serial via a
    script, so we shouldn't need to run this in serial.

    Change-Id: Ia2f2694abe02af282e7a98fcc40410b0510cc1cd
    Closes-Bug: #1741085
    (cherry picked from commit 6a4c1bdc6b6849c61f78540e7728be534ce7876b)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/openstack-ansible 17.0.2

This issue was fixed in the openstack/openstack-ansible 17.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/openstack-ansible 18.0.0.0b1

This issue was fixed in the openstack/openstack-ansible 18.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/openstack-ansible 18.0.0.0b2

This issue was fixed in the openstack/openstack-ansible 18.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.