Some issues with the bluestore code

Bug #1776888 reported by wangwei
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla
Won't Fix
Undecided
wangwei

Bug Description

I tested the latest bluestore code, it is the patch of Mr Tonezhang:
kolla:
https://review.openstack.org/#/c/566810/
kolla-ansible:
https://review.openstack.org/#/c/566801/9

1.

PS:
I have only encountered the following problems on the cloud virtual machine. It is okay to use partlabel on the virtual machine of VMware. If anyone else encounters this problem, please leave the environment description.

The following is the original description:

In my tests, I encountered a problem that osd bootstrap failed when executing this command:

ceph-osd -i "${OSD_ID}" --mkfs -k "${OSD_DIR}"/keyring --osd-uuid "${OSD_UUID}"

the los as follows:

```
++ partprobe
++ ln -sf /dev/disk/by-partlabel/KOLLA_CEPH_DATA_BS_B_2 /var/lib/ceph/osd/ceph-2/block
++ '[' -n '' ']'
++ '[' -n '' ']'
++ ceph-osd -i 2 --mkfs -k /var/lib/ceph/osd/ceph-2/keyring --osd-uuid b5703869-87d1-4ab8-be11-ab24db2870cc
```

So I add "-d" parameter to debug the problem:
ceph-osd -i -d "${OSD_ID}" --mkfs -k "${OSD_DIR}"/keyring --osd-uuid "${OSD_UUID}"

```
++ ceph-osd -d -i 0 --mkfs -k /var/lib/ceph/osd/ceph-0/keyring --osd-uuid e14d5061-ae41-4c16-bf3c-2e9c5973cb54
2018-06-13 17:29:53.216034 7f808b6a0d80 0 ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a) luminous (stable), process (unknown), pid 78
2018-06-13 17:29:53.243358 7f808b6a0d80 0 stack NetworkStack max thread limit is 24, switching to this now. Higher thread values are unnecessary and currently unsupported.
2018-06-13 17:29:53.248479 7f808b6a0d80 1 bluestore(/var/lib/ceph/osd/ceph-0) mkfs path /var/lib/ceph/osd/ceph-0
2018-06-13 17:29:53.248676 7f808b6a0d80 -1 bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-0/block: (2) No such file or directory
2018-06-13 17:29:53.248714 7f808b6a0d80 -1 bluestore(/var/lib/ceph/osd/ceph-0/block) _read_bdev_label failed to open /var/lib/ceph/osd/ceph-0/block: (2) No such file or directory
2018-06-13 17:29:53.249134 7f808b6a0d80 -1 bluestore(/var/lib/ceph/osd/ceph-0) _read_fsid unparsable uuid
2018-06-13 17:29:53.249141 7f808b6a0d80 1 bluestore(/var/lib/ceph/osd/ceph-0) mkfs using provided fsid e14d5061-ae41-4c16-bf3c-2e9c5973cb54
2018-06-13 17:29:53.249361 7f808b6a0d80 1 bdev create path /var/lib/ceph/osd/ceph-0/block type kernel
2018-06-13 17:29:53.249372 7f808b6a0d80 1 bdev(0x563ef1b19600 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
2018-06-13 17:29:53.249400 7f808b6a0d80 -1 bdev(0x563ef1b19600 /var/lib/ceph/osd/ceph-0/block) open open got: (2) No such file or directory
2018-06-13 17:29:53.249654 7f808b6a0d80 -1 bluestore(/var/lib/ceph/osd/ceph-0) mkfs failed, (2) No such file or directory
2018-06-13 17:29:53.249662 7f808b6a0d80 -1 OSD::mkfs: ObjectStore::mkfs failed with error (2) No such file or directory
2018-06-13 17:29:53.249950 7f808b6a0d80 -1 ** ERROR: error creating empty object store in /var/lib/ceph/osd/ceph-0: (2) No such file or directory
```

After my testing, I found that after executing this command:

```
sgdisk "--change-name=2:KOLLA_CEPH_DATA_BS_B_${OSD_ID}" "--typecode=2:${CEPH_OSD_TYPE_CODE}" -- "${OSD_BS_BLOCK_DEV}"
```
It took 3 seconds to generate the by-partlabel folder and partlabel on my centos virtual machine, but the partuuid was generated immediately without delay.

So I think using partuuid is better than partlabel, when initializing ceph.

In the ceph-deploy tool, the command to initialize osd is 'ceph-disk preapre', which also uses partuuid.

2.

In the current find_disks.py logic, both bluestore and filestore return all disk information, including osd partition, journal partition, block partition, wal partition and db partition.
```
"bs_db_device": "",
"bs_db_label": "",
"bs_db_partition_num": "",
"bs_wal_device": "",
"bs_wal_label": "",
"bs_wal_partition_num": "",
"device": "/dev/xvdb",
"external_journal": false,
"fs_label": "",
"fs_uuid": "cd711f44-2fa8-41c8-8f74-b43e96758edd",
"journal": "",
"journal_device": "",
"journal_num": 0,
"partition": "/dev/xvdb",
"partition_label": "KOLLA_CEPH_OSD_BOOTSTRAP_BS",
"partition_num": "1"
```
There is a bit of confusion here. In fact, in the filestore, there are only osd partition and journal partition. In the bluestore, there are osd data partition, block partition, wal partition and db partition.

I think we should distinguish between the bluestore and filestore disk information, like this:

```bluestore
"osds_bootstrap": [
        {
            "bs_blk_device": "/dev/sdb",
            "bs_blk_partition": "/dev/sdb2",
            "bs_blk_partition_num": 2,
            "fs_label": "",
            "fs_uuid": "d5ca7d92-457e-484c-a7f3-7b0497249f87",
            "osd_device": "/dev/sdb",
            "osd_partition": "/dev/sdb1",
            "osd_partition_num": "1",
            "store_type": "bluestore",
            "use_entire_disk": true
        },

"osds_bootstrap": [
        {
            "bs_blk_device": "/dev/sdb",
            "bs_blk_partition": "/dev/sdb2",
            "bs_blk_partition_num": 2,
            "bs_db_device": "/dev/sdc",
            "bs_db_partition": "/dev/sdc2",
            "bs_db_partition_num": "2",
            "bs_wal_device": "/dev/sdc",
            "bs_wal_partition": "/dev/sdc1",
            "bs_wal_partition_num": "1",
            "fs_label": "",
            "fs_uuid": "f1590016-2bf2-4690-b9cb-497a95eacac0",
            "osd_device": "/dev/sdb",
            "osd_partition": "/dev/sdb1",
            "osd_partition_num": "1",
            "store_type": "bluestore",
            "use_entire_disk": true
        }
    ]

```

```filestore
"osds_bootstrap": [
        {
            "fs_label": "",
            "fs_uuid": "0d965d41-2027-4713-ba24-3e0f53ce5ec2",
            "journal_device": "/dev/sdb",
            "journal_num": 2,
            "journal_partition": "/dev/sdb2",
            "osd_device": "/dev/sdb",
            "osd_partition": "/dev/sdb1",
            "osd_partition_num": "1",
            "store_type": "filestore",
            "use_entire_disk": true
        },
        {
            "fs_label": "",
            "fs_uuid": "",
            "journal_device": "/dev/sdc",
            "journal_num": "2",
            "journal_partition": "/dev/disk/by-partuuid/81f04fbf-f272-4073-9217-cf02805dda17",
            "osd_device": "/dev/sdc",
            "osd_partition": "/dev/sdc1",
            "osd_partition_num": "1",
            "store_type": "filestore",
            "use_entire_disk": false
        }
    ]
```

3.

The osd partition lable after successful initialization is as follows:

```
KOLLA_CEPH_BSDATA_1
KOLLA_CEPH_DATA_BS_B_1
KOLLA_CEPH_DATA_BS_D_1
KOLLA_CEPH_DATA_BS_W_1

```
The prefix is different so we can't find the disk as the filestore's logic.

So I think a good way to name it like this:

```
KOLLA_CEPH_DATA_BS_1
KOLLA_CEPH_DATA_BS_1_B
KOLLA_CEPH_DATA_BS_1_D
KOLLA_CEPH_DATA_BS_1_W
```
Regular naming can reduce some code.

Similarly, the division of each osd partition label should take the following approach:

```
KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1
KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1_B
KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1_W
KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1_D
```
The simplest label is:

```
KOLLA_CEPH_OSD_BOOTSTRAP_BS
```

4.

According to the naming method above, we can deploy in three ways.

   1) The disk has only one partition or the label of the first partition is "KOLLA_CEPH_OSD_BOOTSTRAP_BS", if you use this label, kolla will default to dividing the entire disk into 100M osd data and block partitions.

e.g:
```
sudo /sbin/parted /dev/xvdb -s -- mklabel gpt mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS 1 -1
```
result:
```
Number Start End Size File system Name Flags
 1 1049kB 106MB 105MB xfs KOLLA_CEPH_DATA_BS_2
 2 106MB 107GB 107GB KOLLA_CEPH_DATA_BS_2_B
```

   2)If a disk has only one partition, or the label of the first partition is "KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1", then you do not specify a block partition "KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1_B", then kolla will initialize the entire disk to osd data and block. If you specify additional wal and db partitions (not on the same disk as the osd partition) then kolla will initialize wal and db according to your definition.

e.g:
```
sudo /sbin/parted /dev/xvdb -s -- mklabel gpt mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1 1 2048

sudo /sbin/parted /dev/loop0 -s -- mklabel gpt mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1_W 1 -1
```
result:
```
Disk /dev/xvdb: 107GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number Start End Size File system Name Flags
 1 1049kB 106MB 105MB xfs KOLLA_CEPH_DATA_BS_2
 2 106MB 107GB 107GB KOLLA_CEPH_DATA_BS_2_B

Model: Loopback device (loopback)
Disk /dev/loop0: 10.7GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number Start End Size File system Name Flags
 1 1049kB 10.7GB 10.7GB KOLLA_CEPH_DATA_BS_2_W

```

   3)If you specify the osd partition "KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1" and specify the block partition "KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1_B", kolla will initialize the disk according to your definition. If you specify additional wal and db partitions then kolla will initialize wal and db also according to your definition.In this case, you can arbitrarily define your partition. Four partitions can be defined on the same disk or on different disks.

e.g:
```
sudo /sbin/parted /dev/xvdb -s -- mklabel gpt mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1 1 200
sudo /sbin/parted /dev/xvdb -s mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1_W 201 2249
sudo /sbin/parted /dev/xvdb -s mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1_D 2250 4298
sudo /sbin/parted /dev/xvdb -s mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1_B 4299 100%
```
result:
```
Number Start End Size File system Name Flags
 1 1049kB 200MB 199MB xfs KOLLA_CEPH_DATA_BS_1
 2 201MB 2249MB 2048MB KOLLA_CEPH_DATA_BS_1_W
 3 2250MB 4298MB 2048MB KOLLA_CEPH_DATA_BS_1_D
 4 4299MB 107GB 103GB KOLLA_CEPH_DATA_BS_1_B

```

5.

In the calculation of OSD_INITIAL_WEIGHT, if the partition is a block partition, because it is a raw device, when the following command is executed, it will give an error, but does not affect the result, so need to add "|| true" to ignore the error:

```
OSD_INITIAL_WEIGHT=$(parted --script ${WEIGHT_PARTITION} unit TB print | awk 'match($0, /^Disk.* (.*)TB/, a){printf("%.2f", a[1])}')
```
The error like this:
```
++ [[ auto == \a\u\t\o ]]
+++ parted --script /dev/xvdb2 unit TB print
+++ awk 'match($0, /^Disk.* (.*)TB/, a){printf("%.2f", a[1])}'
Error: /dev/xvdb2: unrecognised disk label
```

6.

https://review.openstack.org/#/c/575346/
This patch added support for loop devices , but adding some judgments about whether the device is a loop device, it is actually not necessary. We can get directly from find_disks.py. For example, we prepare four loopdevice:

```
sudo /sbin/parted /dev/loop0 -s -- mklabel gpt mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO_B 1 -1
sudo /sbin/parted /dev/loop1 -s -- mklabel gpt mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO 1 -1
sudo /sbin/parted /dev/loop2 -s -- mklabel gpt mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO_D 1 -1
sudo /sbin/parted /dev/loop3 -s -- mklabel gpt mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO_W 1 -1
```
the result of find_disks.py like this:
```
"osds_bootstrap": [
        {
            "bs_blk_device": "/dev/loop0",
            "bs_blk_partition": "/dev/loop0p1",
            "bs_blk_partition_num": "1",
            "bs_blk_partition_uuid": "b4ac5b80-7015-431b-8256-407769d22907",
            "bs_db_device": "/dev/loop2",
            "bs_db_partition": "/dev/loop2p1",
            "bs_db_partition_num": "1",
            "bs_db_partition_uuid": "ae16d004-c8f9-4696-a514-ad6c0f23429e",
            "bs_wal_device": "/dev/loop3",
            "bs_wal_partition": "/dev/loop3p1",
            "bs_wal_partition_num": "1",
            "bs_wal_partition_uuid": "7d2c2986-639d-4889-a664-456137ec8fb2",
            "fs_label": "",
            "fs_uuid": "000980bc-6a28-4c0a-b18c-f6726bedeb69",
            "osd_device": "/dev/loop1",
            "osd_partition": "/dev/loop1p1",
            "osd_partition_num": "1",
            "osd_partition_uuid": "27ba1939-6cb9-4de9-9f14-027c4f6c856f",
            "store_type": "bluestore",
            "use_entire_disk": false
        }
    ]
```
So only need to use the corresponding partition.

7.

If ceph luminous package is installed on the host where the osd container is located,then the osd container will fail to start after the host reboots.

```
docker logs:
               NAMES
dd94b67a13f9 xxx/pasta-os/centos-source-ceph-osd:cephT-4.0.2.0002 "kolla_start" 2 minutes ago Restarting (1) 10 seconds ago ceph_osd_2
e6110c697e1c xxx/pasta-os/centos-source-ceph-osd:cephT-4.0.2.0002 "kolla_start" 2 minutes ago Restarting (1) 11 seconds ago

df -h logs:
/dev/sdc1 97M 5.3M 92M 6% /var/lib/ceph/osd/ceph-2
/dev/sdb1 97M 5.3M 92M 6% /var/lib/ceph/osd/ceph-0
```
Need to execute the following command to fix:

```
[root@ceph-node2 ~]# systemctl stop ceph-osd@0
[root@ceph-node2 ~]# systemctl stop ceph-osd@2

[root@ceph-node2 ~]# umount /var/lib/ceph/osd/ceph-2
[root@ceph-node2 ~]# umount /var/lib/ceph/osd/ceph-0
```
At this point, restart the osd container and we can see that the correct mount should be following:
```
/dev/sdc1 97M 5.3M 92M 6% /var/lib/ceph/osd/90a9ac9d-39bc-438e-a24b-aad71757d66a
/dev/sdb1 97M 5.3M 92M 6% /var/lib/ceph/osd/2fbe7fce-2290-4bcf-9961-4227c45e0e62
```
Ceph uses udev to automount, the corresponding osd type id is "4fbd7e29-9d25-41b8-afd0-062c0ceff05d", so as long as modify the type id of osd, we can avoid this phenomenon.
https://github.com/ceph/ceph/blob/luminous/udev/95-ceph-osd.rules

I did some optimizations for the above aspects (based on tonezhang's patch):
kolla:
https://review.openstack.org/#/c/575400/
kolla-ansible:
https://review.openstack.org/#/c/575408/

wangwei (wangwei-david)
description: updated
wangwei (wangwei-david)
Changed in kolla:
assignee: nobody → wangwei (wangwei-david)
Revision history for this message
wangwei (wangwei-david) wrote :
Download full text (11.8 KiB)

This is the log of the steps for ceph-disk to initialize osd. I have simplified some of the content. Please refer to:

```
# Get the config of bluestore_block_size bluestore_block_size bluestore_block_wal_size
INFO:ceph_disk.main:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_size
INFO:ceph_disk.main:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_db_size
INFO:ceph_disk.main:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_size
INFO:ceph_disk.main:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_block_wal_size
# Get the config of xfs
INFO:ceph_disk.main:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
INFO:ceph_disk.main:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
INFO:ceph_disk.main:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
INFO:ceph_disk.main:Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
# Create the osd data partition
DEBUG:ceph_disk.main:Creating data partition num 1 size 100 on /dev/xvdb
INFO:ceph_disk.main:Running command: /usr/sbin/sgdisk --new=1:0:+100M --change-name=1:ceph data --partition-guid=1:59af5892-2460-4366-aa41-59be7ec71374 --typecode=1:89c57f98-2fe5-4dc0-89c1-f3ad0ceff2be --mbrtogpt -- /dev/xvdb
DEBUG:ceph_disk.main:Calling partprobe on created device /dev/xvdb
INFO:ceph_disk.main:Running command: /usr/bin/udevadm settle --timeout=600
INFO:ceph_disk.main:Running command: /usr/bin/flock -s /dev/xvdb /usr/sbin/partprobe /dev/xvdb

# Create block.db partition
DEBUG:ceph_disk.main:Creating block.db partition num 1 size 1024 on /dev/rbd0
INFO:ceph_disk.main:Running command: /usr/sbin/sgdisk --new=1:0:+1024M --change-name=1:ceph block.db --partition-guid=1:302a7204-e955-4cda-b8d6-459cee350086 --typecode=1:30cd0809-c2b2-499c-8879-2d6b785292be --mbrtogpt -- /dev/rbd0
DEBUG:ceph_disk.main:Calling partprobe on created device /dev/rbd0
INFO:ceph_disk.main:Running command: /usr/bin/udevadm settle --timeout=600
INFO:ceph_disk.main:Running command: /usr/bin/flock -s /dev/rbd0 /usr/sbin/partprobe /dev/rbd0
DEBUG:ceph_disk.main:Block.db is GPT partition /dev/disk/by-partuuid/302a7204-e955-4cda-b8d6-459cee350086
INFO:ceph_disk.main:Running command: /usr/sbin/sgdisk --typecode=1:30cd0809-c2b2-499c-8879-2d6b78529876 -- /dev/rbd0
INFO:ceph_disk.main:Running command: /usr/bin/chown ceph:ceph /dev/rbd0p1

# Create block.wal partition
DEBUG:ceph_disk.main:name = block.wal
DEBUG:ceph_disk.main:Creating block.wal partition num 2 size 576 on /dev/rbd0
INFO:ceph_disk.main:Running command: /usr/sbin/sgdisk --new=2:0:+576M --change-name=2:ceph block.wal --partition-guid=2:3973042a-0df2-40c9-aff0-aad75bf55198 --typecode=2:5ce17fce-4087-4169-b7ff-056cc58472be --mbrtogpt -- /dev/rbd0
DEBUG:ceph_disk.main:Calling partprobe on created device /dev/rbd0
INFO:ceph_disk.main:Running command: /usr/bin/udevadm settle --timeout=600
INFO:ceph_disk.main:Running command: /usr/bin/flock -s /dev/rbd0 /usr/sbin/partprobe /de...

Changed in kolla:
status: New → In Progress
wangwei (wangwei-david)
description: updated
Revision history for this message
Tone Zhang (tone.zhang) wrote :

could you please rerun the test with the latest Kolla and Kolla-ansible code?

In the description, when you run you test, the patches are still under review.
https://review.openstack.org/#/c/566810/
https://review.openstack.org/#/c/566801/9

Please rerun the test. Thanks a lot!

Revision history for this message
wangwei (wangwei-david) wrote :
Download full text (5.6 KiB)

hi Tone Zhang:

I just tested the latest maste branch code and the result is the same.

The disk is prepared as follows:
```
Model: Xen Virtual Block Device (xvd)
Disk /dev/xvdb: 107GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:

Number Start End Size File system Name Flags
 1 1049kB 107GB 107GB xfs KOLLA_CEPH_OSD_BOOTSTRAP_BS
```

And find_disk result is:

```
{
  u 'fs_uuid': u 'f3dfc4a9-2913-44ea-a3cd-0f5b85436a21',
  u 'partition': u '/dev/xvdb',
  u 'external_journal': False,
  u 'bs_blk_label': u '',
  u 'bs_db_partition_num': u '',
  u 'journal_device': u '',
  u 'journal': u '',
  u 'bs_wal_label': u '',
  u 'bs_wal_partition_num': u '',
  u 'fs_label': u '',
  u 'journal_num': 0,
  u 'bs_wal_device': u '',
  u 'partition_num': u '1',
  u 'bs_db_label': u '',
  u 'bs_blk_partition_num': u '',
  u 'device': u '/dev/xvdb',
  u 'bs_db_device': u '',
  u 'partition_label': u 'KOLLA_CEPH_OSD_BOOTSTRAP_BS',
  u 'bs_blk_device': u ''
 }
```

and error logs is:

```
++ [[ False == \F\a\l\s\e ]]
++ [[ bluestore == \b\l\u\e\s\t\o\r\e ]]
++ [[ /dev/xvdb =~ /dev/loop ]]
++ sgdisk --zap-all -- /dev/xvdb1
Creating new GPT entries.
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
++ '[' -n '' ']'
++ sgdisk --zap-all -- /dev/xvdb
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
++ sgdisk --new=1:0:+100M --mbrtogpt -- /dev/xvdb
Creating new GPT entries.
The operation has completed successfully.
++ sgdisk --largest-new=2 --mbrtogpt -- /dev/xvdb
The operation has completed successfully.
++ sgdisk --zap-all -- /dev/xvdb2
Creating new GPT entries.
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
++ '[' -n '' ']'
++ '[' -n '' ']'
++ partprobe
++ [[ bluestore == \b\l\u\e\s\t\o\r\e ]]
+++ uuidgen
++ OSD_UUID=5a5c1c56-7618-4fca-9847-f58542add2e8
+++ ceph osd new 5a5c1c56-7618-4fca-9847-f58542add2e8
++ OSD_ID=1
++ OSD_DIR=/var/lib/ceph/osd/ceph-1
++ mkdir -p /var/lib/ceph/osd/ceph-1
++ [[ /dev/xvdb =~ /dev/loop ]]
++ mkfs.xfs -f /dev/xvdb1
meta-data=/dev/xvdb1 isize=512 agcount=4, agsize=6400 blks
         = sectsz=512 attr=2, projid32bit=1
         = crc=1 finobt=0, sparse=0
data = bsize=4096 blocks=25600, imaxpct=25
         = sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=855, version=2
         = sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
++ mount /dev/xvdb1 /var/lib/ceph/osd/ceph-1
++ ceph-osd -i 1 --mkkey
++ echo bluestore
++ '[' -n '' ']'
++ sgdisk --change-name=2:KOLLA_CEPH_DATA_BS_B_1 --typecode=2:4FBD7E29-9D25-41B8-AFD0-062C0CEFF...

Read more...

Revision history for this message
Tone Zhang (tone.zhang) wrote :

Hi Wei,

Thanks!

Could you please show me the result of commands "lsblk" and "blkid"? And could you please show me the "parted" command you used?

In the above log, kolla-ceph identify there is two OSD at least (OSD ID is 1 with /dev/xvdb, not 0). I only see one device for Ceph OSD.

Thanks a lot.

Revision history for this message
wangwei (wangwei-david) wrote :
Download full text (14.2 KiB)

Hi Tone,

Because I deployed three osds, but I only took one to show you the error log.
I have three ceph node, there is only one disk per node:

node1:
```
sudo sgdisk --zap-all -- /dev/xvdb
sudo /sbin/parted /dev/xvdb -s -- mklabel gpt mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS 1 -1
```
node2:
```
sudo sgdisk --zap-all -- /dev/xvdb
sudo /sbin/parted /dev/xvdb -s -- mklabel gpt mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1 1 -1
```
node3:
```
sudo sgdisk --zap-all -- /dev/xvdb
sudo /sbin/parted /dev/xvdb -s -- mklabel gpt mkpart KOLLA_CEPH_OSD_BOOTSTRAP_BS_FOO1 1 -1
```

there is "lsblk" and "blkid" result:

node1:
```
[root@dev-ww-ceph001-xxx xxx]# lsblk /dev/xvdb
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvdb 202:16 0 100G 0 disk
└─xvdb1 202:17 0 100G 0 part
[root@dev-ww-ceph001-xxx xxx]# blkid /dev/xvdb
/dev/xvdb: PTTYPE="gpt"

```

node2:
```
[root@dev-ww-ceph002-xxx irteamsu]# lsblk /dev/xvdb
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvdb 202:16 0 100G 0 disk
└─xvdb1 202:17 0 100G 0 part
[root@dev-ww-ceph002-xxx irteamsu]# blkid /dev/xvdb
/dev/xvdb: PTTYPE="gpt"
```

node3:

```
[root@dev-ww-ceph003-xxx xxx]# lsblk /dev/xvdb
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvdb 202:16 0 100G 0 disk
└─xvdb1 202:17 0 100G 0 part
[root@dev-ww-ceph003-xxx xxx]# blkid /dev/xvdb
/dev/xvdb: PTTYPE="gpt"

```

And i add this command "ls -al /dev/disk" before the following command to show you why it is wrong:
```
ls -al /dev/disk
partprobe || true
ls -al /dev/disk

ln -sf /dev/disk/by-partlabel/KOLLA_CEPH_DATA_BS_B_"${OSD_ID}" "${OSD_DIR}"/block

if [ -n "${OSD_BS_WAL_DEV}" ] && [ "${OSD_BS_BLK_DEV}" != "${OSD_BS_WAL_DEV}" ] && [ -n "${OSD_BS_WAL_PARTNUM}" ]; then
    ln -sf /dev/disk/by-partlabel/KOLLA_CEPH_DATA_BS_W_"${OSD_ID}" "${OSD_DIR}"/block.wal
fi

if [ -n "${OSD_BS_DB_DEV}" ] && [ "${OSD_BS_BLK_DEV}" != "${OSD_BS_DB_DEV}" ] && [ -n "${OSD_BS_DB_PARTNUM}" ]; then
    ln -sf /dev/disk/by-partlabel/KOLLA_CEPH_DATA_BS_D_"${OSD_ID}" "${OSD_DIR}"/block.db
fi

for (( i=10; i>=0; i=i-1 )); do
    ls -al /dev/disk
    sleep 1
    echo "sleep 1s"
done

ceph-osd -d -i "${OSD_ID}" --mkfs -k "${OSD_DIR}"/keyring --osd-uuid "${OSD_UUID}"
```

there are logs for each nodes:

node1:
```
++ [[ False == \F\a\l\s\e ]]
++ [[ bluestore == \b\l\u\e\s\t\o\r\e ]]
++ [[ /dev/xvdb =~ /dev/loop ]]
++ sgdisk --zap-all -- /dev/xvdb1
Creating new GPT entries.
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot.
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
++ '[' -n '' ']'
++ sgdisk --zap-all -- /dev/xvdb
GPT data structures destroyed! You may now partition the disk using fdisk or
other utilities.
++ sgdisk --new=1:0:+100M --mbrtogpt -- /dev/xvdb
Creating new GPT entries.
The operation has completed successfully.
++ sgdisk --largest-new=2 --mbrtogpt -- /dev/xvdb
The operation has completed successfully.
++ sgdisk --zap-all -- /dev/xvdb2
Creating new GPT entries.
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot.
GPT data structures destroyed! You may now partition t...

Revision history for this message
wangwei (wangwei-david) wrote :

Hi Tone,

I only have xen's virtual machine, no other virtual machine, so I didn't test it on the others.

Revision history for this message
Tone Zhang (tone.zhang) wrote :

Hi Wei,

I have tried create partition then access it with partlable in my test bed with KVM/Qemu and xen. I tested the operation with debian, ubuntu and centos. I cannot reproduce the issue you saw.

I created VM on top of bare metal looks more reasonable, not vm-in-vm.

So could you please run your test with KVM/Qemu and observe the result? In fact, it is very hard to understand the delay, because from kernel point of view, there is no difference. Regarding it happens in virtual machine, the VM (even VM-in-VM) performance will be affected by host and guest.

Thanks!

Revision history for this message
Tone Zhang (tone.zhang) wrote :

Hi Wei,

As I mentioned that I have tested the partition operation in my test bed with KVM/Qemu and xen, and I can not reproduce the issue you saw.

I suggest you to re-run the test with other VMs.

Thanks a lot!

Revision history for this message
wangwei (wangwei-david) wrote :

Hi Tone,

I tested the partlabel in the virtual machine can be generated immediately, but running docker on a virtual machine, the above phenomenon will appear.

Because my virtual machine is our company's cloud virtual machine, our workflow is to verify the deployment on the virtual machine first, and then deploy the cluster on the physical machine to verify the performance. I think this scenario should be very common, so I think we should use the most common deployment method. As you said, ceph supports partuuid and partlabel, but partuuid is a more common one.

I'm glad you have implemented partlabel's bluestore, but the process is somewhat different from that of the filestore, I think the filestore implementation is better, with less code and easier to read.

So I've made some optimizations based on your implementation,my patch is more similar to the way filestore handles disk information, and I refer to the initialization process of ceph-disk, I hope you can review it.

Thanks very much!^^

Revision history for this message
Tone Zhang (tone.zhang) wrote :

Hi Wei,

Thanks for your information.

I think the first point is that why there is delay in you test environment. I have tested in several cases, but I cannot reproduce the issue in my company's lab (with different bare metal, VM, distro and different servers). And I believe you and I are not the only persons who test kolla ceph bluestore OSD.

I agree with you that test Kolla with VM is the common way, but there are several kinds of virtual machine projects, and the configuration of host and VM is the same significant. We cannot judge the fault only according to the special environment.

If the issue depends on special distro, HW and condition/configuration, it is better to validate with different environment and collect more information.

Thanks.

wangwei (wangwei-david)
description: updated
Revision history for this message
wangwei (wangwei-david) wrote :

Hi Tone,

Thank you for your feedback, I tested it on vmware's virtual machine and found no partlabel delay.
This issue may be related to the kernel of the cloud vm I am using, so I agree with you, we continue to use partlabel, and if there are other people who have encountered this problem, we can discuss it again.

On the basis of continuing to use partlabel, I made some optimizations for the bluestore deployment process. For the second point to the sixth point of the bug description, please review it.

Thank you very much!

Revision history for this message
Tone Zhang (tone.zhang) wrote :

Hi Wei,

I appreciate for your test and for your comments. According to your feedback, I think we can close point 1. Correct?

For point 2, I think it is not functional fault. The current version of Kolla and Kolla-ansible can handle filestore and bluestore well, correct?

For point 3, regarding the label name has been defined in spec clearly, and the spec has been released several months before. The code is align with spec. The label name does not introduce any faults, doesn't it?

For point 4, I think you plan to update document. In fact, kolla ceph supports more than three deployment manners. I have some concern with the case 3. Deploy one bluestore OSD with one device is meanful, but format the device with 4 partitions are meaningless. In Ceph document (http://docs.ceph.com/docs/master), block.wal and block.db should be faster than primary device. Allocate block, block.wal and block.db within the same device is bad solution and it impacts on performance negatively. I have tested it. So kolla ceph should not support case 3 you mentioned.

For point 5, could you please share me the error information? I run the command in my test bed, I did not find the error information. Thanks in advance.

Thanks!

Revision history for this message
wangwei (wangwei-david) wrote :

Hi Tone,

For point 1, we can close it, right.

For point 2, working well doesn't mean good code. I think the charm of the open source community is here. Good code can make it easier for others to read, easier to understand, and more convenient to maintain, isn't it?

For point 3, Can't the things released be changed? Where is the meaning of open source? Now the label logic is obviously more cumbersome to process, so why not use the previous logic?

For point 4, I am not just updating the documentation, I am re-defining the disk information of the three deployment methods in the logic of find_disks.py. I know the meaning of each partition of ceph. The third way is to provide a more free way to deploy ceph for users. It can be used for testing. It can also be used to customize the deployment that users want. I know that four partitions are deployed on a disk is meaningless, but users can test the deployment of the kolla with fewer disks, similar to deploying an osd with 4 loop devices, isn't it?

For point 5, you test no problem because you have not changed this place to bluestore:

```
OSD_INITIAL_WEIGHT=$(parted --script ${OSD_PARTITION} unit TB print | awk 'match($0, /^Disk.* (.*)TB/, a){printf("%.2f", a[1])}')
```
The partition information here should be block partitioned in the bluestore, isn't it?

Please take a look at my code flow and then compare it.

Thanks!

Revision history for this message
wangwei (wangwei-david) wrote :

Hi Tone,

I explain why I made these changes now, because you just modified the documentation a few months ago, I am not sure what your modification idea is, until I see your code, I think this way is not good, so I made these comments when reviewing your code, but you didn't accept it. So now I have made these changes in the way I think it is better.

Revision history for this message
wangwei (wangwei-david) wrote :
Download full text (3.6 KiB)

Hi Tone,
Since you don't understand why I said that your implementation is not good, then I will tell you carefully:

1. in find_disks.py

 1) First of all, this script is not only used by ceph, but also by swift. Later, there will be other components that will be used. So we want to ensure that each component can be used and isolate its logic as much as possible.
But you added a lot of logic about bluestore in the main function, although there is no problem in function, but it is easy to cause misunderstanding of others. If the user does not know "bluestore", he will not know what "_BS" is.

 2) I think your code for the bluestore disk is too redundant, and each partition has to execute "extract_disk_info_bs" four times.

 3) And if the label is "KOLLA_CEPH_OSD_BOOTSTRAP_BS", your code will only recognize the last one, not every one. This is the first deployment method I mentioned above, and your code does not support it.

 4) In the final result, you returned all the disk information, even if they are empty, I don't think this problem can be ignored, why the bluestore disk information also includes the filestore disk information, ansible support to pass null variables, since we can solve this problem in kolla-ansible, why should we pass these useless variables in the kolla?

2.in extend_start.sh

1) About ceph type code
Bluestore has four type code, but you only listed three, and set the osd type code to the block partition.
https://github.com/ceph/ceph/blob/luminous/udev/95-ceph-osd.rules

2) We all know that journal is the partition of the filestore, so why use USE_EXTERNAL_JOURNAL to determine the partition of the bluestore?
And in your code, as long as it is bluestore, this variable is false, so what is the significance of this variable?

3)You add a lot of logic to determine if it is a loop device, so why not use disk partition variables directly?

4) Finally, about calculating the weight of osd, in the filestore, the size of the osd partition is calculated. In the bluestore, the size of the block partition is calculated, but you have not noticed this problem.

3. in bootstrap_osds.yml

I don't understand why you are passing this parameter. Are you useful to this?

```
OSD_BS_LABEL: "{{ item.1.partition_label | default('') }}"
OSD_BS_BLK_LABEL: "{{ item.1.bs_blk_label | default('') }}"
OSD_BS_WAL_LABEL: "{{ item.1.bs_wal_label | default('') }}"
OSD_BS_DB_LABEL: "{{ item.1.bs_db_label | default('') }}"
```
4. in start_osds.yml
Same problem as above:

```
OSD_BS_FSUUID: "{{ item.1['fs_uuid'] }}"
```

5. in ceph-guide.rst

Your description says that if there are multiple osd on the same node, then the user should add the suffix. I think we should support the case where all the disks on the same node have the label "KOLLA_CEPH_OSD_BOOTSTRAP_BS", which can reduce the trouble of initializing a lot of tags.

Our company started using kolla to deploy ceph from the mitaka version. In the use of ceph, sometimes we will do some custom development. In the version upgrade and bluestore implementation, because of the work requirements, we are ahead of the community. In my understanding, we should treat kolla as a tool, not a product. Our users are d...

Read more...

wangwei (wangwei-david)
description: updated
description: updated
Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

Kolla-Ansible Ceph deployment and Kolla Ceph images have been deprecated and removed (in Ussuri) - I don't think that bug is relevant anymore.

Changed in kolla:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.