Some issues with the bluestore code
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
kolla |
Won't Fix
|
Undecided
|
wangwei |
Bug Description
I tested the latest bluestore code, it is the patch of Mr Tonezhang:
kolla:
https:/
kolla-ansible:
https:/
1.
PS:
I have only encountered the following problems on the cloud virtual machine. It is okay to use partlabel on the virtual machine of VMware. If anyone else encounters this problem, please leave the environment description.
The following is the original description:
In my tests, I encountered a problem that osd bootstrap failed when executing this command:
ceph-osd -i "${OSD_ID}" --mkfs -k "${OSD_
the los as follows:
```
++ partprobe
++ ln -sf /dev/disk/
++ '[' -n '' ']'
++ '[' -n '' ']'
++ ceph-osd -i 2 --mkfs -k /var/lib/
```
So I add "-d" parameter to debug the problem:
ceph-osd -i -d "${OSD_ID}" --mkfs -k "${OSD_
```
++ ceph-osd -d -i 0 --mkfs -k /var/lib/
2018-06-13 17:29:53.216034 7f808b6a0d80 0 ceph version 12.2.5 (cad919881333ac
2018-06-13 17:29:53.243358 7f808b6a0d80 0 stack NetworkStack max thread limit is 24, switching to this now. Higher thread values are unnecessary and currently unsupported.
2018-06-13 17:29:53.248479 7f808b6a0d80 1 bluestore(
2018-06-13 17:29:53.248676 7f808b6a0d80 -1 bluestore(
2018-06-13 17:29:53.248714 7f808b6a0d80 -1 bluestore(
2018-06-13 17:29:53.249134 7f808b6a0d80 -1 bluestore(
2018-06-13 17:29:53.249141 7f808b6a0d80 1 bluestore(
2018-06-13 17:29:53.249361 7f808b6a0d80 1 bdev create path /var/lib/
2018-06-13 17:29:53.249372 7f808b6a0d80 1 bdev(0x563ef1b19600 /var/lib/
2018-06-13 17:29:53.249400 7f808b6a0d80 -1 bdev(0x563ef1b19600 /var/lib/
2018-06-13 17:29:53.249654 7f808b6a0d80 -1 bluestore(
2018-06-13 17:29:53.249662 7f808b6a0d80 -1 OSD::mkfs: ObjectStore::mkfs failed with error (2) No such file or directory
2018-06-13 17:29:53.249950 7f808b6a0d80 -1 ** ERROR: error creating empty object store in /var/lib/
```
After my testing, I found that after executing this command:
```
sgdisk "--change-
```
It took 3 seconds to generate the by-partlabel folder and partlabel on my centos virtual machine, but the partuuid was generated immediately without delay.
So I think using partuuid is better than partlabel, when initializing ceph.
In the ceph-deploy tool, the command to initialize osd is 'ceph-disk preapre', which also uses partuuid.
2.
In the current find_disks.py logic, both bluestore and filestore return all disk information, including osd partition, journal partition, block partition, wal partition and db partition.
```
"bs_db_device": "",
"bs_db_label": "",
"bs_db_
"bs_wal_device": "",
"bs_wal_label": "",
"bs_wal_
"device": "/dev/xvdb",
"external_journal": false,
"fs_label": "",
"fs_uuid": "cd711f44-
"journal": "",
"journal_device": "",
"journal_num": 0,
"partition": "/dev/xvdb",
"partition_label": "KOLLA_
"partition_num": "1"
```
There is a bit of confusion here. In fact, in the filestore, there are only osd partition and journal partition. In the bluestore, there are osd data partition, block partition, wal partition and db partition.
I think we should distinguish between the bluestore and filestore disk information, like this:
```bluestore
"osds_bootstrap": [
{
},
"osds_bootstrap": [
{
}
]
```
```filestore
"osds_bootstrap": [
{
},
{
}
]
```
3.
The osd partition lable after successful initialization is as follows:
```
KOLLA_CEPH_BSDATA_1
KOLLA_CEPH_
KOLLA_CEPH_
KOLLA_CEPH_
```
The prefix is different so we can't find the disk as the filestore's logic.
So I think a good way to name it like this:
```
KOLLA_CEPH_
KOLLA_CEPH_
KOLLA_CEPH_
KOLLA_CEPH_
```
Regular naming can reduce some code.
Similarly, the division of each osd partition label should take the following approach:
```
KOLLA_CEPH_
KOLLA_CEPH_
KOLLA_CEPH_
KOLLA_CEPH_
```
The simplest label is:
```
KOLLA_CEPH_
```
4.
According to the naming method above, we can deploy in three ways.
1) The disk has only one partition or the label of the first partition is "KOLLA_
e.g:
```
sudo /sbin/parted /dev/xvdb -s -- mklabel gpt mkpart KOLLA_CEPH_
```
result:
```
Number Start End Size File system Name Flags
1 1049kB 106MB 105MB xfs KOLLA_CEPH_
2 106MB 107GB 107GB KOLLA_CEPH_
```
2)If a disk has only one partition, or the label of the first partition is "KOLLA_
e.g:
```
sudo /sbin/parted /dev/xvdb -s -- mklabel gpt mkpart KOLLA_CEPH_
sudo /sbin/parted /dev/loop0 -s -- mklabel gpt mkpart KOLLA_CEPH_
```
result:
```
Disk /dev/xvdb: 107GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 106MB 105MB xfs KOLLA_CEPH_
2 106MB 107GB 107GB KOLLA_CEPH_
Model: Loopback device (loopback)
Disk /dev/loop0: 10.7GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 10.7GB 10.7GB KOLLA_CEPH_
```
3)If you specify the osd partition "KOLLA_
e.g:
```
sudo /sbin/parted /dev/xvdb -s -- mklabel gpt mkpart KOLLA_CEPH_
sudo /sbin/parted /dev/xvdb -s mkpart KOLLA_CEPH_
sudo /sbin/parted /dev/xvdb -s mkpart KOLLA_CEPH_
sudo /sbin/parted /dev/xvdb -s mkpart KOLLA_CEPH_
```
result:
```
Number Start End Size File system Name Flags
1 1049kB 200MB 199MB xfs KOLLA_CEPH_
2 201MB 2249MB 2048MB KOLLA_CEPH_
3 2250MB 4298MB 2048MB KOLLA_CEPH_
4 4299MB 107GB 103GB KOLLA_CEPH_
```
5.
In the calculation of OSD_INITIAL_WEIGHT, if the partition is a block partition, because it is a raw device, when the following command is executed, it will give an error, but does not affect the result, so need to add "|| true" to ignore the error:
```
OSD_INITIAL_
```
The error like this:
```
++ [[ auto == \a\u\t\o ]]
+++ parted --script /dev/xvdb2 unit TB print
+++ awk 'match($0, /^Disk.* (.*)TB/, a){printf("%.2f", a[1])}'
Error: /dev/xvdb2: unrecognised disk label
```
6.
https:/
This patch added support for loop devices , but adding some judgments about whether the device is a loop device, it is actually not necessary. We can get directly from find_disks.py. For example, we prepare four loopdevice:
```
sudo /sbin/parted /dev/loop0 -s -- mklabel gpt mkpart KOLLA_CEPH_
sudo /sbin/parted /dev/loop1 -s -- mklabel gpt mkpart KOLLA_CEPH_
sudo /sbin/parted /dev/loop2 -s -- mklabel gpt mkpart KOLLA_CEPH_
sudo /sbin/parted /dev/loop3 -s -- mklabel gpt mkpart KOLLA_CEPH_
```
the result of find_disks.py like this:
```
"osds_bootstrap": [
{
}
]
```
So only need to use the corresponding partition.
7.
If ceph luminous package is installed on the host where the osd container is located,then the osd container will fail to start after the host reboots.
```
docker logs:
dd94b67a13f9 xxx/pasta-
e6110c697e1c xxx/pasta-
df -h logs:
/dev/sdc1 97M 5.3M 92M 6% /var/lib/
/dev/sdb1 97M 5.3M 92M 6% /var/lib/
```
Need to execute the following command to fix:
```
[root@ceph-node2 ~]# systemctl stop ceph-osd@0
[root@ceph-node2 ~]# systemctl stop ceph-osd@2
[root@ceph-node2 ~]# umount /var/lib/
[root@ceph-node2 ~]# umount /var/lib/
```
At this point, restart the osd container and we can see that the correct mount should be following:
```
/dev/sdc1 97M 5.3M 92M 6% /var/lib/
/dev/sdb1 97M 5.3M 92M 6% /var/lib/
```
Ceph uses udev to automount, the corresponding osd type id is "4fbd7e29-
https:/
I did some optimizations for the above aspects (based on tonezhang's patch):
kolla:
https:/
kolla-ansible:
https:/
description: | updated |
Changed in kolla: | |
assignee: | nobody → wangwei (wangwei-david) |
Changed in kolla: | |
status: | New → In Progress |
description: | updated |
description: | updated |
description: | updated |
description: | updated |
This is the log of the steps for ceph-disk to initialize osd. I have simplified some of the content. Please refer to:
``` block_size bluestore_ block_size bluestore_ block_wal_ size disk.main: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_ block_size disk.main: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_ block_db_ size disk.main: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_ block_size disk.main: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup bluestore_ block_wal_ size disk.main: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mkfs_ options_ xfs disk.main: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_ mkfs_options_ xfs disk.main: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_mount_ options_ xfs disk.main: Running command: /usr/bin/ceph-conf --cluster=ceph --name=osd. --lookup osd_fs_ mount_options_ xfs disk.main: Creating data partition num 1 size 100 on /dev/xvdb disk.main: Running command: /usr/sbin/sgdisk --new=1:0:+100M --change- name=1: ceph data --partition- guid=1: 59af5892- 2460-4366- aa41-59be7ec713 74 --typecode= 1:89c57f98- 2fe5-4dc0- 89c1-f3ad0ceff2 be --mbrtogpt -- /dev/xvdb disk.main: Calling partprobe on created device /dev/xvdb disk.main: Running command: /usr/bin/udevadm settle --timeout=600 disk.main: Running command: /usr/bin/flock -s /dev/xvdb /usr/sbin/partprobe /dev/xvdb
# Get the config of bluestore_
INFO:ceph_
INFO:ceph_
INFO:ceph_
INFO:ceph_
# Get the config of xfs
INFO:ceph_
INFO:ceph_
INFO:ceph_
INFO:ceph_
# Create the osd data partition
DEBUG:ceph_
INFO:ceph_
DEBUG:ceph_
INFO:ceph_
INFO:ceph_
# Create block.db partition disk.main: Creating block.db partition num 1 size 1024 on /dev/rbd0 disk.main: Running command: /usr/sbin/sgdisk --new=1:0:+1024M --change- name=1: ceph block.db --partition- guid=1: 302a7204- e955-4cda- b8d6-459cee3500 86 --typecode= 1:30cd0809- c2b2-499c- 8879-2d6b785292 be --mbrtogpt -- /dev/rbd0 disk.main: Calling partprobe on created device /dev/rbd0 disk.main: Running command: /usr/bin/udevadm settle --timeout=600 disk.main: Running command: /usr/bin/flock -s /dev/rbd0 /usr/sbin/partprobe /dev/rbd0 disk.main: Block.db is GPT partition /dev/disk/ by-partuuid/ 302a7204- e955-4cda- b8d6-459cee3500 86 disk.main: Running command: /usr/sbin/sgdisk --typecode= 1:30cd0809- c2b2-499c- 8879-2d6b785298 76 -- /dev/rbd0 disk.main: Running command: /usr/bin/chown ceph:ceph /dev/rbd0p1
DEBUG:ceph_
INFO:ceph_
DEBUG:ceph_
INFO:ceph_
INFO:ceph_
DEBUG:ceph_
INFO:ceph_
INFO:ceph_
# Create block.wal partition disk.main: name = block.wal disk.main: Creating block.wal partition num 2 size 576 on /dev/rbd0 disk.main: Running command: /usr/sbin/sgdisk --new=2:0:+576M --change- name=2: ceph block.wal --partition- guid=2: 3973042a- 0df2-40c9- aff0-aad75bf551 98 --typecode= 2:5ce17fce- 4087-4169- b7ff-056cc58472 be --mbrtogpt -- /dev/rbd0 disk.main: Calling partprobe on created device /dev/rbd0 disk.main: Running command: /usr/bin/udevadm settle --timeout=600 disk.main: Running command: /usr/bin/flock -s /dev/rbd0 /usr/sbin/partprobe /de...
DEBUG:ceph_
DEBUG:ceph_
INFO:ceph_
DEBUG:ceph_
INFO:ceph_
INFO:ceph_