Bootstrapping Ceph OSDs fails

Bug #1558853 reported by Ahmad Al-Shishtawy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla
Fix Released
Critical
Sam Yaple

Bug Description

I'm trying to deploy Openstack with Ceph on 10 servers and it is failing at bootstrapping Ceph OSDs

I prepared 4 disks on each of the 10 servers for storage as the example below

$ sudo parted /dev/sda print
Model: ATA ST4000NM0033-9ZM (scsi)
Disk /dev/sda: 4001GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number Start End Size File system Name Flags
 1 1049kB 4001GB 4001GB ext4 KOLLA_CEPH_OSD_BOOTSTRAP

Here is the part that is failing

TASK: [ceph | Bootstrapping Ceph OSDs] ****************************************
failed: [opst1] => (item=(0, {u'device': u'/dev/sda', u'fs_uuid': u'0e246458-0dcd-4524-b758-4a9adc30920e', u'fs_label': u''})) => {"changed": true, "failed": true, "item": [0, {"device": "/dev/sda", "fs_label": "", "fs_uuid": "0e246458-0dcd-4524-b758-4a9adc30920e"}]}
msg: Container exited with non-zero return code
failed: [opst1] => (item=(1, {u'device': u'/dev/sdb', u'fs_uuid': u'046daf31-042c-4a3c-9b18-a7ba97ada62b', u'fs_label': u''})) => {"changed": true, "failed": true, "item": [1, {"device": "/dev/sdb", "fs_label": "", "fs_uuid": "046daf31-042c-4a3c-9b18-a7ba97ada62b"}]}
msg: Container exited with non-zero return code
failed: [opst1] => (item=(2, {u'device': u'/dev/sdc', u'fs_uuid': u'ab562d7d-a4c6-4305-ae3f-d87f9e061e90', u'fs_label': u''})) => {"changed": true, "failed": true, "item": [2, {"device": "/dev/sdc", "fs_label": "", "fs_uuid": "ab562d7d-a4c6-4305-ae3f-d87f9e061e90"}]}
msg: Container exited with non-zero return code
failed: [opst1] => (item=(3, {u'device': u'/dev/sdd', u'fs_uuid': u'577cd6b0-ef23-4f27-8732-8c9b7bba21a4', u'fs_label': u''})) => {"changed": true, "failed": true, "item": [3, {"device": "/dev/sdd", "fs_label": "", "fs_uuid": "577cd6b0-ef23-4f27-8732-8c9b7bba21a4"}]}
msg: Container exited with non-zero return code

FATAL: all hosts have already failed -- aborting

Steven Dake (sdake)
Changed in kolla:
importance: Undecided → Critical
milestone: none → mitaka-rc2
status: New → Triaged
Revision history for this message
Sam Yaple (s8m) wrote :

This is most commonly a problem when youve attempted multiple ceph deploys and not properly cleaned the environment.

please remove all ceph containers and volumes as well as all ceph configs folders in /etc/kolla/* on all nodes and attempt this again.

If you still have an issue, run docker logs ceph_osd_bootstrap_0 (or whatever container name you have) and post those logs here

Revision history for this message
Ahmad Al-Shishtawy (alshishtawy) wrote :

Thanks for your reply!

I cleaned all servers with tools/cleanup-containers and tools/cleanup-host
Removed config files and cleaned the local registry then pulled latest kolla source an rebuild images.
The problem still exists!

Here are the logs you mentioned. Thanks!

$ docker logs bootstrap_osd_0

INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json
INFO:__main__:Validating config file
INFO:__main__:Copying service configuration files
INFO:__main__:Copying /var/lib/kolla/config_files/ceph.conf to /etc/ceph/ceph.conf
INFO:__main__:Setting permissions for /etc/ceph/ceph.conf
INFO:__main__:Copying /var/lib/kolla/config_files/ceph.client.admin.keyring to /etc/ceph/ceph.client.admin.keyring
INFO:__main__:Setting permissions for /etc/ceph/ceph.client.admin.keyring
INFO:__main__:Writing out command to execute
2016-03-18 21:32:41.485882 7f83c8578700 0 -- 10.0.112.61:0/1000028 >> 10.0.112.63:6789/0 pipe(0x7f83bc000c00 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f83bc004ef0).fault
2016-03-18 21:32:44.486210 7f83d010f700 0 -- 10.0.112.61:0/1000028 >> 10.0.112.62:6789/0 pipe(0x7f83bc008280 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f83bc00c520).fault
2016-03-18 21:32:50.486622 7f83d0210700 0 -- 10.0.112.61:0/1000028 >> 10.0.112.62:6789/0 pipe(0x7f83bc008280 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f83bc007650).fault
....
....
2016-03-18 21:37:29.507208 7f83d0210700 0 -- 10.0.112.61:0/1000028 >> 10.0.112.62:6789/0 pipe(0x7f83bc0008c0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f83bc00f540).fault
2016-03-18 21:37:35.507671 7f83d010f700 0 -- 10.0.112.61:0/1000028 >> 10.0.112.63:6789/0 pipe(0x7f83bc0008c0 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f83bc00dae0).fault
2016-03-18 21:37:38.485583 7f83d1705700 0 monclient(hunting): authenticate timed out after 300
2016-03-18 21:37:38.485649 7f83d1705700 0 librados: client.admin authentication error (110) Connection timed out
Error connecting to cluster: TimedOut

Revision history for this message
Vikram Hosakote (vhosakot) wrote :

I think those messages ending in .fault are not expected.

2016-03-18 21:32:41.485882 7f83c8578700 0 -- 10.0.112.61:0/1000028 >> 10.0.112.63:6789/0 pipe(0x7f83bc000c00 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f83bc004ef0).fault
2016-03-18 21:32:44.486210 7f83d010f700 0 -- 10.0.112.61:0/1000028 >> 10.0.112.62:6789/0 pipe(0x7f83bc008280 sd=4 :0 s=1 pgs=0 cs=0 l=1 c=0x7f83bc00c520).fault

Looks an issue with ceph_mon. Is ceph_mon container up ?

Can you send the output of the following.

docker ps -a | grep ceph
docker logs ceph_mon
docker exec ceph_mon ceph -s
docker exec ceph_mon ceph mon dump
docker exec ceph_mon ceph mon stat
docker exec ceph_mon ceph quorum_status
docker volume ls

Revision history for this message
Ahmad Al-Shishtawy (alshishtawy) wrote :
Download full text (15.2 KiB)

Here are the logs from one of the three control servers

docker ps -a | grep ceph
0b9666aa6c57 10.0.112.61:4000/kollaglue/centos-binary-ceph-osd:2.0.0 "kolla_start" 31 minutes ago Exited (1) 26 minutes ago bootstrap_osd_3
2cbc1914c3b9 10.0.112.61:4000/kollaglue/centos-binary-ceph-osd:2.0.0 "kolla_start" 36 minutes ago Exited (1) 31 minutes ago bootstrap_osd_2
2ec060d596cb 10.0.112.61:4000/kollaglue/centos-binary-ceph-osd:2.0.0 "kolla_start" 41 minutes ago Exited (1) 36 minutes ago bootstrap_osd_1
0949a7b8c99e 10.0.112.61:4000/kollaglue/centos-binary-ceph-osd:2.0.0 "kolla_start" 46 minutes ago Exited (1) 41 minutes ago bootstrap_osd_0
6bea26a16776 10.0.112.61:4000/kollaglue/centos-binary-ceph-mon:2.0.0 "kolla_start" 47 minutes ago Up 46 minutes ceph_mon

docker logs ceph_mon
INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json
INFO:__main__:Validating config file
INFO:__main__:Copying service configuration files
INFO:__main__:Removing existing destination: /etc/ceph/ceph.conf
INFO:__main__:Copying /var/lib/kolla/config_files/ceph.conf to /etc/ceph/ceph.conf
INFO:__main__:Setting permissions for /etc/ceph/ceph.conf
WARNING:__main__:/var/lib/kolla/config_files/ceph.client.admin.keyring does not exist, but is not required
WARNING:__main__:/var/lib/kolla/config_files/ceph.client.mon.keyring does not exist, but is not required
WARNING:__main__:/var/lib/kolla/config_files/ceph.client.radosgw.keyring does not exist, but is not required
WARNING:__main__:/var/lib/kolla/config_files/ceph.monmap does not exist, but is not required
INFO:__main__:Writing out command to execute
creating /tmp/ceph.mon.keyring
importing contents of /etc/ceph/ceph.client.admin.keyring into /tmp/ceph.mon.keyring
importing contents of /etc/ceph/ceph.client.mon.keyring into /tmp/ceph.mon.keyring
ceph-mon: set fsid to a80c84b4-cee4-4419-a950-63fdb70bef21
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-10.0.112.61 for mon.10.0.112.61
Running command: '/usr/bin/ceph-mon -d -i 10.0.112.61 --public-addr 10.0.112.61:6789'
2016-03-19 06:30:01.715236 7f361c7e8880 0 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43), process ceph-mon, pid 1
2016-03-19 06:30:01.837093 7f361c7e8880 0 starting mon.10.0.112.61 rank 0 at 10.0.112.61:6789/0 mon_data /var/lib/ceph/mon/ceph-10.0.112.61 fsid a80c84b4-cee4-4419-a950-63fdb70bef21
starting mon.10.0.112.61 rank 0 at 10.0.112.61:6789/0 mon_data /var/lib/ceph/mon/ceph-10.0.112.61 fsid a80c84b4-cee4-4419-a950-63fdb70bef21
2016-03-19 06:30:01.837734 7f361c7e8880 1 mon.10.0.112.61@-1(probing) e0 preinit fsid a80c84b4-cee4-4419-a950-63fdb70bef21
2016-03-19 06:30:01.837863 7f361c7e8880 1 mon.10.0.112.61@-1(probing) e0 initial_members 10.0.112.61,10.0.112.62,10.0.112.63, filtering seed monmap
2016-03-19 06:30:01.839076 7f361c7e8880 -1 compacting monitor store ...
2016...

Ryan Hallisey (rthall14)
Changed in kolla:
assignee: nobody → Sam Yaple (s8m)
Changed in kolla:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla (master)

Reviewed: https://review.openstack.org/294862
Committed: https://git.openstack.org/cgit/openstack/kolla/commit/?id=5250a00781a214911fec78718ef6dfb91154b0de
Submitter: Jenkins
Branch: master

commit 5250a00781a214911fec78718ef6dfb91154b0de
Author: SamYaple <email address hidden>
Date: Fri Mar 18 13:52:32 2016 +0000

    Allow external ceph journals and fix bootstrap

    This allows us to specify external journals for osds which can greatly
    improve performance when the external journals are on the solid-state
    drives.

    The new lookup and startup methods fix the previous races we had
    preventing osds from being created properly.

    This retains the same functionality as before and is completely
    compatible with the previous method and labels, however this does set
    new labels for all new bootstrap OSDs. This was due to a limitation
    in the length of the name of a GPT partition.

    Closes-Bug: #1558853
    DocImpact
    Partially-Implements: blueprint ceph-improvements
    Change-Id: I61fd10cb35c67dabc53bd82270f26909ef51fc38

Changed in kolla:
status: In Progress → Fix Released
Revision history for this message
Ahmad Al-Shishtawy (alshishtawy) wrote :
Download full text (3.9 KiB)

I updated to the latest fix, cleaned up hosts, build images, and re deployed. It is still failing for me.

The logs are now different. I see /dev/sda2 in the logs, but I only have one partition per disk (sda1) is this normal?

I have four disks per host (/dev/sda - /dev/sdd) and trying to deploy on 10 hosts. All 40 disks look similar to what below:

parted /dev/sdc print
Model: ATA ST4000NM0033-9ZM (scsi)
Disk /dev/sdc: 4001GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number Start End Size File system Name Flags
 1 1049kB 4001GB 4001GB KOLLA_CEPH_OSD_BOOTSTRAP

TASK: [ceph | Bootstrapping Ceph OSDs] ****************************************
failed: [opst5] => (item=(0, {u'device': u'/dev/sda', u'fs_uuid': u'', u'journal_num': 2, u'partition_num': u'1', u'journal_device': u'/dev/sda', u'journal':u'/dev/sda2', u'partition': u'/dev/sda1', u'fs_label': u'', u'external_journal': False})) => {"changed": true, "failed": true, "item": [0, {"device": "/dev/sda", "external_journal": false, "fs_label": "", "fs_uuid": "", "journal": "/dev/sda2", "journal_device": "/dev/sda", "journal_num": 2, "partition": "/dev/sda1", "partition_num": "1"}]}
msg: Container exited with non-zero return code
failed: [opst4] => (item=(0, {u'device': u'/dev/sda', u'fs_uuid': u'', u'journal_num': 2, u'partition_num': u'1', u'journal_device': u'/dev/sda', u'journal':u'/dev/sda2', u'partition': u'/dev/sda1', u'fs_label': u'', u'external_journal': False})) => {"changed": true, "failed": true, "item": [0, {"device": "/dev/sda", "external_journal": false, "fs_label": "", "fs_uuid": "", "journal": "/dev/sda2", "journal_device": "/dev/sda", "journal_num": 2, "partition": "/dev/sda1", "partition_num": "1"}]}
msg: Container exited with non-zero return code
failed: [opst3] => (item=(0, {u'device': u'/dev/sda', u'fs_uuid': u'', u'journal_num': 2, u'partition_num': u'1', u'journal_device': u'/dev/sda', u'journal':u'/dev/sda2', u'partition': u'/dev/sda1', u'fs_label': u'', u'external_journal': False})) => {"changed": true, "failed": true, "item": [0, {"device": "/dev/sda", "external_journal": false, "fs_label": "", "fs_uuid": "", "journal": "/dev/sda2", "journal_device": "/dev/sda", "journal_num": 2, "partition": "/dev/sda1", "partition_num": "1"}]}
msg: Container exited with non-zero return code
failed: [opst2] => (item=(0, {u'device': u'/dev/sda', u'fs_uuid': u'', u'journal_num': 2, u'partition_num': u'1', u'journal_device': u'/dev/sda', u'journal':u'/dev/sda2', u'partition': u'/dev/sda1', u'fs_label': u'', u'external_journal': False})) => {"changed": true, "failed": true, "item": [0, {"device": "/dev/sda", "external_journal": false, "fs_label": "", "fs_uuid": "", "journal": "/dev/sda2", "journal_device": "/dev/sda", "journal_num": 2, "partition": "/dev/sda1, "partition_num": "1"}]}
msg: Container exited with non-zero return code
failed: [opst1] => (item=(0, {u'device': u'/dev/sda', u'fs_uuid': u'', u'journal_num': 2, u'partition_num': u'1', u'journal_device': u'/dev/sda', u'journal':u'/dev/sda2', u'partition': u'/dev/sda1', u'fs_label': u'', u'external_journal': False})) => {"changed": true, "failed":...

Read more...

Revision history for this message
Sam Yaple (s8m) wrote :

The logs from the containers would be needed here. Please look at bootstrap_osd_* logs and post them here

Revision history for this message
Ahmad Al-Shishtawy (alshishtawy) wrote :
Download full text (12.3 KiB)

Here are the logs. Thanks!

$ docker volume ls
DRIVER VOLUME NAME
local ceph_mon
local a9d0c5a17f26bd403ae7bc7f2096d52ebcbe82ec7bb5051ac70ba8dfb0a56130
local kolla_logs
local heka_socket
local heka
local ceph_mon_config

$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
ea387c55b9b5 10.0.112.61:4000/kollaglue/centos-binary-ceph-mon:2.0.0 "kolla_start" 56 minutes ago Up 56 minutes ceph_mon
8894cbccc227 10.0.112.61:4000/kollaglue/centos-binary-cron:2.0.0 "kolla_start" 56 minutes ago Up 56 minutes cron
25e79fb927e3 10.0.112.61:4000/kollaglue/centos-binary-kolla-toolbox:2.0.0 "/bin/sleep infinity" 56 minutes ago Up 56 minutes kolla_toolbox
66705fcdcc04 10.0.112.61:4000/kollaglue/centos-binary-heka:2.0.0 "kolla_start" 56 minutes ago Up 56 minutes heka
e1fd3ffeee50 registry:2 "/bin/registry /etc/d" 7 hours ago Up 7 hours 0.0.0.0:4000->5000/tcp registry

docker logs ceph_mon
INFO:__main__:Kolla config strategy set to: COPY_ALWAYS
INFO:__main__:Loading config file at /var/lib/kolla/config_files/config.json
INFO:__main__:Validating config file
INFO:__main__:Copying service configuration files
INFO:__main__:Removing existing destination: /etc/ceph/ceph.conf
INFO:__main__:Copying /var/lib/kolla/config_files/ceph.conf to /etc/ceph/ceph.conf
INFO:__main__:Setting permissions for /etc/ceph/ceph.conf
INFO:__main__:/var/lib/kolla/config_files/ceph.client.admin.keyring does not exist, but is not required
INFO:__main__:/var/lib/kolla/config_files/ceph.client.mon.keyring does not exist, but is not required
INFO:__main__:/var/lib/kolla/config_files/ceph.client.radosgw.keyring does not exist, but is not required
INFO:__main__:/var/lib/kolla/config_files/ceph.monmap does not exist, but is not required
INFO:__main__:Writing out command to execute
creating /tmp/ceph.mon.keyring
importing contents of /etc/ceph/ceph.client.admin.keyring into /tmp/ceph.mon.keyring
importing contents of /etc/ceph/ceph.client.mon.keyring into /tmp/ceph.mon.keyring
ceph-mon: set fsid to a80c84b4-cee4-4419-a950-63fdb70bef21
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-10.0.112.61 for mon.10.0.112.61
Running command: '/usr/bin/ceph-mon -d -i 10.0.112.61 --public-addr 10.0.112.61:6789'
2016-03-22 16:20:42.127794 7ff00533c880 0 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43), process ceph-mon, pid 1
2016-03-22 16:20:42.240972 7ff00533c880 0 starting mon.10.0.112.61 rank 0 at 10.0.112.61:6789/0 mon_data /var/lib/ceph/mon/ceph-10.0.112.61 fsid a80c84b4-cee4-4419-a950-63fdb70bef21
starting mon.10.0.112.61 rank 0 at 10.0.112.61:6789/0 mon_data /var/lib/ceph/mon/ceph-10.0.112.61 fsid a80c84b4-cee4-4419-a950-63fdb70bef21
2016-03...

Revision history for this message
Sam Yaple (s8m) wrote :

Ahmad, you need to properly clean out your environment.

remove all ceph containers as seen by `docker ps -a`
remove all ceph volumes as seen by `docker volume ls`
remove all ceph configs in /etc/kolla/ceph-*
umount anything in /var/lib/ceph/osd/*
remove any ceph /etc/fstab entries

And then reattempt your task. What yo uare seeing is most common in dev environments that are not properly cleaned as shown above

Revision history for this message
Ahmad Al-Shishtawy (alshishtawy) wrote :

Thanks for your help! But it is still failing for me.

As I mentioned in previous post, i clean up all 10 hosts with tools/cleanup-containers and tools/cleanup-host

docker ps -a and docker volume ls show no containers or volumes on all 10 hosts except for a registry on one host.
/etc/kolla/ is cleaned on all, no ceph entries in /etc/fstab, /var/lib/ceph does not exist.
After making sure everything is clean, I deployed again but got the same failure.

Any hints or clues on where to look or what to try different?
I followed the quick start guide and the ceph guide. Any special requirements for a multi-node setup that is not in the docs?

Revision history for this message
Ahmad Al-Shishtawy (alshishtawy) wrote :

Problem solved!

Was bad firewall configuration on the controller nodes and in addition to unmounting /etc/kolla/ceph-* I had to remove the partitions and reinitialize disks with KOLLA_CEPH_OSD_BOOTSTRAP flag.

Thanks for the help!

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/kolla 2.0.0

This issue was fixed in the openstack/kolla 2.0.0 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/kolla 1.1.0

This issue was fixed in the openstack/kolla 1.1.0 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/kolla 3.0.0.0b1

This issue was fixed in the openstack/kolla 3.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.