Pike + Ceph : Missing ceph-mgr container

Bug #1735122 reported by Sebastien
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Expired
Undecided
Unassigned

Bug Description

I am trying to deploy Pike with Ceph using ubuntu binaries.

Everything seems fine except ceph cluster is not in a good state:
###
(ceph-mon)[root@ocp1-cn10 /]# ceph -s
  cluster:
    id: c2b871dd-b578-4717-906e-0d5dfb7adfc0
    health: HEALTH_WARN
            no active mgr

  services:
    mon: 3 daemons, quorum 172.16.224.60,172.16.224.61,172.16.224.62
    mgr: no daemons active
    osd: 22 osds: 22 up, 22 in

  data:
    pools: 0 pools, 0 pgs
    objects: 0 objects, 0 bytes
    usage: 0 kB used, 0 kB / 0 kB avail
    pgs:
###

cinder is also complaining, see https://bugs.launchpad.net/kolla-ansible/+bug/1732833

According to ceph's documentation ( http://docs.ceph.com/docs/master/mgr/ ) since the 12.x (luminous) Ceph release, the ceph-mgr daemon is required for normal operations.

Tags: ceph pike
Revision history for this message
Sebastien (termeau) wrote :

Creating a keyring for the manager and starting it in the monitor container fixed both ceph and cinder.

Manual steps:
############
# Enter the ceph-mon container
docker exec -it -u root ceph_mon bash
# Run
ceph --cluster ceph auth get-or-create mgr.XXX mon 'allow profile mgr' osd 'allow *' mds 'allow *' -o /var/lib/ceph/mgr/ceph-XXX/keyring
/usr/bin/ceph-mgr -f --cluster ceph --setuser ceph --setgroup ceph --id XXX

Revision history for this message
Sabbir Sakib (sakibsys) wrote :
Download full text (10.3 KiB)

I have been trying to install OpenStack on 3 bare-metal servers using kolla-ansible v6 for some time and running into several issues. Without ceph it works fine though.

Here is the error I'm getting.

TASK [ceph : Getting ceph mgr keyring] ***********************************************************************************************************************************************************************
failed: [oscontroller01.xyz.pvt -> oscontroller01.xyz.pvt] (item=oscontroller01.xyz.pvt) => {"changed": false, "cmd": ["docker", "exec", "ceph_mon", "ceph", "auth", "get-or-create", "mgr.oscontroller01.xyz.pvt", "mon", "allow profile mgr", "osd", "allow *", "mds", "allow *"], "delta": "0:00:00.260286", "end": "2018-02-15 09:55:31.713964", "item": "oscontroller01.xyz.pvt", "msg": "non-zero return code", "rc": 22, "start": "2018-02-15 09:55:31.453678", "stderr": "Error EINVAL: bad entity name", "stderr_lines": ["Error EINVAL: bad entity name"], "stdout": "", "stdout_lines": []}

Here is my globals.yml and multinode files. I will appreciate if you can take a look and let me know If I missed something and declare something wrongly in both files.

globals.yml:
kolla_base_distro: "centos"
kolla_install_type: "binary"
openstack_release: "pike"
kolla_internal_vip_address: "10.88.120.110"
kolla_internal_fqdn: "{{ kolla_internal_vip_address }}"
kolla_external_vip_address: "{{ kolla_internal_vip_address }}"
kolla_external_fqdn: "{{ kolla_external_vip_address }}"
network_interface: "bond0"
kolla_external_vip_interface: "{{ network_interface }}"
api_interface: "{{ network_interface }}"
storage_interface: "{{ network_interface }}"
cluster_interface: "{{ network_interface }}"
tunnel_interface: "{{ network_interface }}"
dns_interface: "{{ network_interface }}"
neutron_external_interface: "bond1"
neutron_plugin_agent: "openvswitch"
openstack_logging_debug: "False"
enable_aodh: "yes"
enable_ceph: "yes"
enable_ceph_rgw: "yes"
enable_cinder: "yes"
enable_fluentd: "yes"
enable_haproxy: "yes"
enable_heat: "yes"
enable_horizon: "yes"
enable_neutron_provider_networks: "yes"
enable_horizon_neutron_lbaas: "{{ enable_neutron_lbaas | bool }}"
enable_neutron_lbaas: "yes"
glance_backend_file: "no"
glance_backend_ceph: "yes"
cinder_backend_ceph: "{{ enable_ceph }}"
nova_backend_ceph: "{{ enable_ceph }}"
nova_compute_virt_type: "kvm"
tempest_image_id:
tempest_flavor_ref_id:
tempest_public_network_id:
tempest_floating_network_name:

 multinode file:
# These initial groups are the only groups required to be modified. The
# additional groups are for more control of the environment.
[control]
# These hostname must be resolvable from your deployment host
oscontroller01.xyz.pvt

# The above can also be specified as follows:
#control[01:03] ansible_user=kolla

# The network nodes are where your l3-agent and loadbalancers will run
# This can be the same as a host in the control group
[network]
oscontroller01.xyz.pvt

# inner-compute is the groups of compute nodes which do not have
# external reachability
[inner-compute]
oshyp01.xyz.pvt
oshyp02.xyz.pvt
# external-compute is the groups of compute nodes which can reach
# outside
[external-compute]
oshyp01.xyz.pvt
oshyp02.x...

Revision history for this message
Sabbir Sakib (sakibsys) wrote :

Also, how did you solve the ceph-mgr contaner issues?

Revision history for this message
Benjamin Bendel (benvandamme) wrote :

I have an similar problem in stable/queens. In nine out of ten times the deployment stucks at

TASK [ceph : Getting ceph mgr keyring]

TASK [ceph : Getting ceph mgr keyring] ***************************************************
failed: [dir1.maas -> dir1.maas] (item=dir1.maas) => {"changed": false, "cmd": ["docker", "exec", "ceph_mon", "ceph", "auth", "get-or-create", "mgr.dir1.maas", "mon", "allow profile mgr", "osd", "allow *", "mds", "allow *"], "delta": "0:05:00.327827", "end": "2018-08-09 17:06:08.428261", "item": "dir1.maas", "msg": "non-zero return code", "rc": 1, "start": "2018-08-09 17:01:08.100434", "stderr": "[errno 110] error connecting to the cluster", "stderr_lines": ["[errno 110] error connecting to the cluster"], "stdout": "", "stdout_lines": []}
failed: [dir1.maas -> dir1.maas] (item=dir2.maas) => {"changed": false, "cmd": ["docker", "exec", "ceph_mon", "ceph", "auth", "get-or-create", "mgr.dir2.maas", "mon", "allow profile mgr", "osd", "allow *", "mds", "allow *"], "delta": "0:05:00.299945", "end": "2018-08-09 17:11:09.389523", "item": "dir2.maas", "msg": "non-zero return code", "rc": 1, "start": "2018-08-09 17:06:09.089578", "stderr": "[errno 110] error connecting to the cluster", "stderr_lines": ["[errno 110] error connecting to the cluster"], "stdout": "", "stdout_lines": []}
failed: [dir1.maas -> dir1.maas] (item=dir3.maas) => {"changed": false, "cmd": ["docker", "exec", "ceph_mon", "ceph", "auth", "get-or-create", "mgr.dir3.maas", "mon", "allow profile mgr", "osd", "allow *", "mds", "allow *"], "delta": "0:05:00.325244", "end": "2018-08-09 17:16:10.333002", "item": "dir3.maas", "msg": "non-zero return code", "rc": 1, "start": "2018-08-09 17:11:10.007758", "stderr": "[errno 110] error connecting to the cluster", "stderr_lines": ["[errno 110] error connecting to the cluster"], "stdout": "", "stdout_lines": []}

Revision history for this message
Eduardo Gonzalez (egonzalez90) wrote :

Hi, cannot triage the bug, are you still having issues? please provide updated information

Changed in kolla-ansible:
status: New → Incomplete
Revision history for this message
Sebastien Termeau (st-m) wrote :

The issue is that Ubuntu has updated ceph during the life cycle of the OS.
Building ubuntu images now results in a version of ceph not expected by kolla-ansible.
I ended using centos instead and the problem is gone.
I don't think it affects queens.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for kolla-ansible because there has been no activity for 60 days.]

Changed in kolla-ansible:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.