ceph charm doesn't create all bcache backed osds
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph OSD Charm |
Invalid
|
Undecided
|
Unassigned | ||
vaultlocker |
Fix Released
|
Medium
|
Unassigned |
Bug Description
We have nodes with 11 disks for use with ceph-osd, however on deployment sometimes only some of the disks are actually created as osds. e.g. (with 5 identical nodes)
sas-ceph-osd/0* active idle 31 x.x.x.x Unit is ready (11 OSD)
sas-ceph-osd/1 active idle 32 x.x.x.x Unit is ready (3 OSD)
sas-ceph-osd/2 active idle 33 x.x.x.x Unit is ready (3 OSD)
sas-ceph-osd/3 active idle 34 x.x.x.x Unit is ready (11 OSD)
sas-ceph-osd/4 active idle 35 x.x.x.x Unit is ready (11 OSD)
lsblk shows that the bcache devices have not had the required crypt and ceph devices created on top
sdk 8:160 0 5.5T 0 disk
└─bcache5 252:640 0 5.5T 0 disk
└─crypt-
└─ceph-
sdi 8:128 0 5.5T 0 disk
└─bcache4 252:512 0 5.5T 0 disk
sdg 8:96 0 5.5T 0 disk
└─bcache9 252:1152 0 5.5T 0 disk
sde 8:64 0 5.5T 0 disk
└─bcache7 252:896 0 5.5T 0 disk
the logs show:
root@osd-
2018-07-24 09:24:17 INFO juju-log secrets-
2018-07-24 09:24:18 DEBUG juju-log secrets-
root@osd-
2018-07-24 09:24:18 INFO juju-log secrets-
2018-07-24 09:24:19 DEBUG juju-log secrets-
but these devices do exist
root@osd-
lrwxrwxrwx 1 root root 13 Jul 24 09:29 /dev/disk/
lrwxrwxrwx 1 root root 14 Jul 24 09:29 /dev/disk/
lrwxrwxrwx 1 root root 14 Jul 24 09:29 /dev/disk/
lrwxrwxrwx 1 root root 13 Jul 24 09:29 /dev/disk/
lrwxrwxrwx 1 root root 13 Jul 24 09:29 /dev/disk/
lrwxrwxrwx 1 root root 13 Jul 24 09:29 /dev/disk/
lrwxrwxrwx 1 root root 13 Jul 24 09:29 /dev/disk/
lrwxrwxrwx 1 root root 13 Jul 24 09:29 /dev/disk/
lrwxrwxrwx 1 root root 13 Jul 24 09:29 /dev/disk/
lrwxrwxrwx 1 root root 13 Jul 24 09:29 /dev/disk/
lrwxrwxrwx 1 root root 13 Jul 24 09:29 /dev/disk/
lrwxrwxrwx 1 root root 13 Jul 24 09:29 /dev/disk/
root@osd-
root@fnos-
Jul 24 08:39:25 fnos-sas01 kernel: [ 6.778545] bcache: bch_journal_
Jul 24 08:39:25 fnos-sas01 kernel: [ 6.778652] bcache: register_cache() registered cache device nvme0n1p2
Jul 24 08:39:25 fnos-sas01 kernel: [ 6.779003] bcache: register_bdev() registered backing device md3
Jul 24 08:39:25 fnos-sas01 kernel: [ 6.799664] bcache: bch_cached_
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.134760] bcache: register_bdev() registered backing device sdj
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.135887] bcache: bch_cached_
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.136131] bcache: register_bdev() registered backing device sdf
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.137269] bcache: bch_cached_
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.137471] bcache: register_bdev() registered backing device sdc
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.138554] bcache: bch_cached_
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.138743] bcache: register_bdev() registered backing device sdi
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.139625] bcache: bch_cached_
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.139822] bcache: register_bdev() registered backing device sdk
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.141545] bcache: bch_cached_
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.141887] bcache: register_bdev() registered backing device sdh
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.146830] bcache: bch_cached_
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.147120] bcache: register_bdev() registered backing device sde
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.148142] bcache: bch_cached_
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.148350] bcache: register_bdev() registered backing device sdd
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.149230] bcache: bch_cached_
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.149464] bcache: register_bdev() registered backing device sdg
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.150392] bcache: bch_cached_
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.241520] bcache: register_bdev() registered backing device sdl
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.242399] bcache: bch_cached_
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.250523] bcache: register_bdev() registered backing device sdm
Jul 24 08:39:25 fnos-sas01 kernel: [ 7.251433] bcache: bch_cached_
Jul 24 08:39:25 fnos-sas01 kernel: [ 16.102267] bcache: register_bcache() error /dev/sdj: device already registered (emitting change event)
Jul 24 08:39:25 fnos-sas01 kernel: [ 16.134227] bcache: register_bcache() error /dev/sde: device already registered (emitting change event)
Jul 24 08:39:25 fnos-sas01 kernel: [ 16.140876] bcache: register_bcache() error /dev/sdi: device already registered (emitting change event)
Jul 24 08:39:25 fnos-sas01 kernel: [ 16.141938] bcache: register_bcache() error /dev/sdg: device already registered (emitting change event)
Jul 24 08:39:25 fnos-sas01 kernel: [ 16.142135] bcache: register_bcache() error /dev/sdd: device already registered (emitting change event)
Jul 24 08:39:25 fnos-sas01 kernel: [ 16.143213] bcache: register_bcache() error /dev/sdc: device already registered (emitting change event)
Jul 24 08:39:25 fnos-sas01 kernel: [ 16.144269] bcache: register_bcache() error /dev/sdk: device already registered (emitting change event)
Jul 24 08:39:25 fnos-sas01 kernel: [ 16.146892] bcache: register_bcache() error /dev/sdh: device already registered (emitting change event)
Jul 24 08:39:25 fnos-sas01 kernel: [ 16.151928] bcache: register_bcache() error /dev/sdm: device already registered (emitting change event)
Jul 24 08:39:25 fnos-sas01 kernel: [ 16.152894] bcache: register_bcache() error /dev/sdf: device already registered (emitting change event)
Jul 24 08:39:25 fnos-sas01 kernel: [ 16.166031] bcache: register_bcache() error /dev/sdl: device already registered (emitting change event)
Jul 24 08:39:25 fnos-sas01 kernel: [ 16.193388] bcache: register_bcache() error /dev/nvme0n1p2: device already registered
Jul 24 08:39:25 fnos-sas01 kernel: [ 16.417310] bcache: register_bcache() error /dev/md3: device already registered (emitting change event)
Jul 24 09:24:32 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdi: device already registered (emitting change event)
Jul 24 09:24:32 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdc: device already registered (emitting change event)
Jul 24 09:24:32 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdg: device already registered (emitting change event)
Jul 24 09:24:32 fnos-sas01 kernel: bcache: register_bcache() error /dev/sde: device already registered (emitting change event)
Jul 24 09:24:32 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdd: device already registered (emitting change event)
Jul 24 09:24:32 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdh: device already registered (emitting change event)
Jul 24 09:24:32 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdl: device already registered (emitting change event)
Jul 24 09:24:32 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdk: device already registered (emitting change event)
Jul 24 09:24:32 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdj: device already registered (emitting change event)
Jul 24 09:24:32 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdm: device already registered (emitting change event)
Jul 24 09:24:32 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdf: device already registered (emitting change event)
Jul 24 09:24:32 fnos-sas01 kernel: bcache: register_bcache() error /dev/md3: device already registered (emitting change event)
Jul 24 09:24:32 fnos-sas01 kernel: bcache: register_bcache() error /dev/nvme0n1p2: device already registered
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdh: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdl: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdj: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdf: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdg: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdc: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdk: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdi: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdh: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdl: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdj: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdf: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdg: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdc: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdk: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdi: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdd: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdm: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/sde: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/md3: device already registered (emitting change event)
Jul 24 09:24:56 fnos-sas01 kernel: bcache: register_bcache() error /dev/nvme0n1p2: device already registered
Jul 24 09:25:19 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdh: device already registered (emitting change event)
Jul 24 09:25:19 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdd: device already registered (emitting change event)
Jul 24 09:25:19 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdf: device already registered (emitting change event)
Jul 24 09:25:19 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdg: device already registered (emitting change event)
Jul 24 09:25:19 fnos-sas01 kernel: bcache: register_bcache() error /dev/sde: device already registered (emitting change event)
Jul 24 09:25:19 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdk: device already registered (emitting change event)
Jul 24 09:25:19 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdl: device already registered (emitting change event)
Jul 24 09:25:19 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdi: device already registered (emitting change event)
Jul 24 09:25:19 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdm: device already registered (emitting change event)
Jul 24 09:25:19 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdj: device already registered (emitting change event)
Jul 24 09:25:19 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdc: device already registered (emitting change event)
Jul 24 09:25:19 fnos-sas01 kernel: bcache: register_bcache() error /dev/nvme0n1p2: device already registered
Jul 24 09:25:19 fnos-sas01 kernel: bcache: register_bcache() error /dev/md3: device already registered (emitting change event)
Jul 24 09:25:37 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdc: device already registered (emitting change event)
Jul 24 09:25:37 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdm: device already registered (emitting change event)
Jul 24 09:25:37 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdi: device already registered (emitting change event)
Jul 24 09:25:37 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdf: device already registered (emitting change event)
Jul 24 09:25:37 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdd: device already registered (emitting change event)
Jul 24 09:25:37 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdh: device already registered (emitting change event)
Jul 24 09:25:37 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdj: device already registered (emitting change event)
Jul 24 09:25:37 fnos-sas01 kernel: bcache: register_bcache() error /dev/sde: device already registered (emitting change event)
Jul 24 09:25:37 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdk: device already registered (emitting change event)
Jul 24 09:25:37 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdl: device already registered (emitting change event)
Jul 24 09:25:37 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdg: device already registered (emitting change event)
Jul 24 09:25:37 fnos-sas01 kernel: bcache: register_bcache() error /dev/nvme0n1p2: device already registered
Jul 24 09:25:37 fnos-sas01 kernel: bcache: register_bcache() error /dev/md3: device already registered (emitting change event)
Jul 24 09:29:57 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdc: device already registered (emitting change event)
Jul 24 09:29:57 fnos-sas01 kernel: bcache: register_bcache() error /dev/sde: device already registered (emitting change event)
Jul 24 09:29:57 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdd: device already registered (emitting change event)
Jul 24 09:29:57 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdj: device already registered (emitting change event)
Jul 24 09:29:57 fnos-sas01 kernel: bcache: register_bcache() error /dev/sdm: device already registered (emitting change event)
Our suspicion is the symlinks in /dev/disk/by-dname are being recreated every time udev trigger is run by the charm and this leads to a race condition if they are not recreated in time for the charm to try and create an osd from them
Example of how his manifests itself:
unit-sas- ceph-osd- 1.log:2018- 07-24 09:24:17 INFO juju-log secrets- storage: 224: Path /dev/disk/ by-dname/ bcache2 does not exist - bailing
stat /dev/disk/ by-dname/ bcache2 | grep Modify
Modify: 2018-07-24 09:29:57.220399555 +0000
/dev/disk/ by-dname/ bcache has obviously been recreated after cold plug (which happens when systemd-udevd starts replaying uvents from initramfs stage)
vaultlocker effectively triggers cold plug for block devices via udevadm
https:/ /github. com/openstack- charmers/ vaultlocker/ blob/1. 0.2/vaultlocker /dmcrypt. py#L89- L119
command = [
'--subsystem- match=block' ,
'--action= add'
'udevadm',
'trigger',
]
http:// man7.org/ linux/man- pages/man8/ udevadm. 8.html
udevadm trigger [options] [devpath|file...]
Request device events from the kernel. Primarily used to replay
events at system coldplug time.
-c, --action=ACTION
Type of event to be triggered. The default value is change.
Settle is used but only on a specific luks device:
https:/ /github. com/openstack- charmers/ vaultlocker/ blob/1. 0.2/vaultlocker /dmcrypt. py#L105- L119
'--exit- if-exists= /dev/disk/ by-uuid/ {}'.format( uuid), if-exists= FILE
command = [
'udevadm',
'settle',
]
udevadm settle [options]
Watches the udev event queue, and exits if all current events are
handled.
-E, --exit-
Stop waiting if file exists.
This makes it a good candidate for a race if a node has a lot to do. We had about 3/18 nodes hitting this condition repeatedly.