different ceph-osd use the same journals on SSD

Bug #1280752 reported by Gleb
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Ryan Moe

Bug Description

{"ostf_sha": "83ada35fec2664089e07fdc0d34861ae2a4d948a", "fuelmain_sha": "17eed776b30886851ae0042fa7a30184f5cd8eb6", "astute_sha": "8b2059a37be9bd82df49f684822727b4df4c511b", "release": "4.0", "nailgun_sha": "ac02e18990cd652db6577ce42bdea9838076c63c", "fuellib_sha": "098f381ff8a528a39d3b6f17ea70955baeb159e8"}

Deployment ceph-osd node with 2 SSD (for journal).

After the successful deployment we have this situation

root@node-25:~# find /var/lib/ceph/osd/ -name journal -exec ls -la {} \;
lrwxrwxrwx 1 root root 9 Feb 15 11:15 /var/lib/ceph/osd/ceph-66/journal -> /dev/sdb2
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-57/journal -> /dev/sda9
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-8/journal -> /dev/sda3
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-49/journal -> /dev/sda8
lrwxrwxrwx 1 root root 9 Feb 15 11:15 /var/lib/ceph/osd/ceph-83/journal -> /dev/sdb4
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-9/journal -> /dev/sda3
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-45/journal -> /dev/sda7
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-21/journal -> /dev/sda4
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-29/journal -> /dev/sda5
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-41/journal -> /dev/sda7
lrwxrwxrwx 1 root root 9 Feb 15 11:15 /var/lib/ceph/osd/ceph-74/journal -> /dev/sdb3
lrwxrwxrwx 1 root root 9 Feb 15 11:15 /var/lib/ceph/osd/ceph-78/journal -> /dev/sdb3
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-17/journal -> /dev/sda4
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-6/journal -> /dev/sda2
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-25/journal -> /dev/sda5
lrwxrwxrwx 1 root root 9 Feb 15 11:15 /var/lib/ceph/osd/ceph-70/journal -> /dev/sdb2
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-33/journal -> /dev/sda6
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-1/journal -> /dev/sda2
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-62/journal -> /dev/sda9
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-37/journal -> /dev/sda6
lrwxrwxrwx 1 root root 9 Feb 15 11:14 /var/lib/ceph/osd/ceph-54/journal -> /dev/sda8

Every SSD partition are used twice by different ceph-osd processes.

That's another command
root@node-25:~# grep 'Running command' /root/ceph.log
2014-02-15 11:14:13,211 [node-25][INFO ] Running command: udevadm trigger --subsystem-match=block --action=add
2014-02-15 11:14:13,290 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdd2 /dev/sda2
2014-02-15 11:14:15,696 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sde2 /dev/sda2
2014-02-15 11:14:19,596 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdf2 /dev/sda3
2014-02-15 11:14:23,102 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdg2 /dev/sda3
2014-02-15 11:14:26,725 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdh2 /dev/sda4
2014-02-15 11:14:30,207 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdi2 /dev/sda4
2014-02-15 11:14:32,951 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdj2 /dev/sda5
2014-02-15 11:14:35,935 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdk2 /dev/sda5
2014-02-15 11:14:38,422 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdl2 /dev/sda6
2014-02-15 11:14:42,011 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdm2 /dev/sda6
2014-02-15 11:14:43,990 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdn2 /dev/sda7
2014-02-15 11:14:45,742 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdo2 /dev/sda7
2014-02-15 11:14:48,628 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdp2 /dev/sda8
2014-02-15 11:14:50,352 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdq2 /dev/sda8
2014-02-15 11:14:52,977 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdr2 /dev/sda9
2014-02-15 11:14:55,607 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sds2 /dev/sda9
2014-02-15 11:14:58,955 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdt2 /dev/sdb2
2014-02-15 11:15:02,629 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdu2 /dev/sdb2
2014-02-15 11:15:04,643 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdv2 /dev/sdb3
2014-02-15 11:15:07,295 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdw2 /dev/sdb3
2014-02-15 11:15:09,395 [node-25][INFO ] Running command: ceph-disk-prepare --fs-type xfs --cluster ceph -- /dev/sdx2 /dev/sdb4

Also number of journal partitions is not enough. We have 22 ceph-osd and only 21 journal partition

root@node-25:~# parted /dev/sda print
Model: HP LOGICAL VOLUME (scsi)
Disk /dev/sda: 400GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
 1 17.4kB 25.2MB 25.1MB primary bios_grub
 2 25.2MB 10.8GB 10.7GB primary
 3 10.8GB 21.5GB 10.7GB primary
 4 21.5GB 32.2GB 10.7GB primary
 5 32.2GB 43.0GB 10.7GB primary
 6 43.0GB 53.7GB 10.7GB primary
 7 53.7GB 64.4GB 10.7GB primary
 8 64.4GB 75.2GB 10.7GB primary
 9 75.2GB 85.9GB 10.7GB primary
10 85.9GB 96.7GB 10.7GB primary
11 96.7GB 107GB 10.7GB primary
12 107GB 118GB 10.7GB primary

root@node-25:~# parted /dev/sdb print
Model: HP LOGICAL VOLUME (scsi)
Disk /dev/sdb: 400GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt

Number Start End Size File system Name Flags
 1 17.4kB 25.2MB 25.1MB primary bios_grub
 2 25.2MB 10.8GB 10.7GB primary
 3 10.8GB 21.5GB 10.7GB primary
 4 21.5GB 32.2GB 10.7GB primary
 5 32.2GB 43.0GB 10.7GB primary
 6 43.0GB 53.7GB 10.7GB primary
 7 53.7GB 64.4GB 10.7GB primary
 8 64.4GB 75.2GB 10.7GB primary
 9 75.2GB 85.9GB 10.7GB primary
10 85.9GB 96.7GB 10.7GB primary
11 96.7GB 107GB 10.7GB primary

Changed in fuel:
importance: Undecided → Critical
assignee: nobody → Dmitry Borodaenko (dborodaenko)
milestone: none → 4.1
Gleb (gleb-q)
description: updated
Revision history for this message
Andrey Korolyov (xdeller) wrote :
Mike Scherbakov (mihgen)
tags: added: ceph customer-found
Revision history for this message
Roman Sokolkov (rsokolkov) wrote :

I also have 3 nodes with 2 equal SSDs per server and 20 OSDs per server.

Andrey, does this actions are right before your script?

http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/#stopping-w-out-rebalancing

My guess:

node1# ceph osd set noout
node1# for i in id1 id2 id3...; do ceph osd stop osd.$i; done
node1# ./ultimate_dangerous_script_beware_of_dragons.sh /dev/sdb /dev/sdw <--- SSDs
node1# for i in id1 id2 id3...; do ceph osd start osd.$i; done
node1# ceph osd unset noout

And same for node2, node3.

Revision history for this message
Gleb (gleb-q) wrote :

There are some logs from the environment.
But unfortunately json and yaml contain irrelevant disk information after the deployment.
(see the bug https://bugs.launchpad.net/fuel/+bug/1280978)

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Looks like provisioning serializer generates incorrect data for installer leaving single partition for journal instead of journal-per-disk mapping.

Changed in fuel:
status: New → Triaged
assignee: Dmitry Borodaenko (dborodaenko) → Fuel Python Team (fuel-python)
Revision history for this message
Gleb (gleb-q) wrote :

What exactly I did in GUI.
I have 22 SAS HDD and 2 SSD on the node.
So I assign entire 2 SSD as Ceph Journals , assigh one of SAS as Base System and assign other 21 SAS as Ceph.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This is not what is expected from the user. Currently, you will need partition these disks into 21 partitions, each one representing dedicated ceph journal for each corresponding data disk. This is currently not achievable. You can alter disk configuration using API.

Changed in fuel:
importance: Critical → Medium
milestone: 4.1 → 5.0
Revision history for this message
Gleb (gleb-q) wrote :

You should write it in the disk configuration dialog with red letters.

I was said that automatic partitioning for journal is worked out of the box. So I did it this way.
And if we look into the original post with find /var/lib/ceph/osd/ -name journal -exec ls -la {} \; we can see that Fuel nevertheless made a number of partitions for different osd but did it wrong way.

Revision history for this message
Andrew Woodward (xarses) wrote :

It looks like the partitions where created correctly, but that the puppet fact that is supposed to pair them is broken.

See https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/ceph/lib/facter/ceph_osd.rb#L30-35

Andrew Woodward (xarses)
Changed in fuel:
milestone: 5.0 → 4.1
importance: Medium → High
assignee: Fuel Python Team (fuel-python) → Ryan Moe (rmoe)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/74541

Changed in fuel:
status: Triaged → In Progress
tags: added: release-notes
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/74541
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=491998df46eb83104ed53c3610f09dff373d45c1
Submitter: Jenkins
Branch: master

commit 491998df46eb83104ed53c3610f09dff373d45c1
Author: Ryan Moe <email address hidden>
Date: Tue Feb 18 15:41:12 2014 -0800

    Assign journal devices to OSDs sequentially

    If there are morei OSDs than journals then the
    remaining OSDs will place its journal on its disk.

    Closes-bug: #1280752
    Change-Id: If78615450b8daaece69dca9d6cd409f79eb3ac01

Changed in fuel:
status: In Progress → Fix Committed
Andrew Woodward (xarses)
Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.