Hosts randomly 'losing' disks, breaking ceph-osd service enumeration

Bug #1828617 reported by Andrey Grebennikov on 2019-05-10
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Status tracked in Stein
Queens
High
James Page
Rocky
High
James Page
Stein
High
James Page
Train
High
James Page
ceph (Ubuntu)
Status tracked in Eoan
Bionic
High
James Page
Disco
High
James Page
Eoan
High
James Page

Bug Description

[Impact]
For deployments where the bluestore DB and WAL devices are on separate underlying OSD's, its possible on reboot that the LV's configured on these devices have not yet been scanned and detected; the OSD boot process ignores this fact and tries to boot the OSD anyway as soon as the primary LV supporting the OSD is detected, resulting in the OSD crashing as required block device symlinks are not present.

[Test Case]
Deploy ceph with bluestore + separate DB and WAL devices.
Reboot servers
OSD devices will fail to start after reboot (its a race so not always).

[Regression Potential]
Low - the fix has been landed upstream and simple ensures that if a separate LV is expected for the DB and WAL devices for an OSD, the OSD will not try to boot until they are present.

[Original Bug Report]
Ubuntu 18.04.2 Ceph deployment.

Ceph OSD devices utilizing LVM volumes pointing to udev-based physical devices.
LVM module is supposed to create PVs from devices using the links in /dev/disk/by-dname/ folder that are created by udev.
However on reboot it happens (not always, rather like race condition) that Ceph services cannot start, and pvdisplay doesn't show any volumes created. The folder /dev/disk/by-dname/ however has all necessary device created by the end of boot process.

The behaviour can be fixed manually by running "#/sbin/lvm pvscan --cache --activate ay /dev/nvme0n1" command for re-activating the LVM components and then the services can be started.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in systemd (Ubuntu):
status: New → Confirmed
David A. Desrosiers (setuid) wrote :

This manifests itself as the following, as reported by lsblk(1). Note the missing Ceph LVM volume on the 6th NVME disk:

$ cat sos_commands/block/lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 1.8T 0 disk
|-sda1 8:1 0 512M 0 part /boot/efi
`-sda2 8:2 0 1.8T 0 part
  |-foobar--vg-root 253:0 0 1.8T 0 lvm /
  `-foobar--vg-swap_1 253:1 0 976M 0 lvm [SWAP]
nvme0n1 259:0 0 1.8T 0 disk
`-ceph--c576f63e--dfd4--48f7--9d60--6a7708cbccf6-osd--block--9fdd78b2--0745--47ae--b8d4--04d9803ab448 253:6 0 1.8T 0 lvm
nvme1n1 259:1 0 1.8T 0 disk
`-ceph--6eb6565f--6392--44a8--9213--833b09f7c0bc-osd--block--a7d3629c--724f--4218--9d15--593ec64781da 253:5 0 1.8T 0 lvm
nvme2n1 259:2 0 1.8T 0 disk
`-ceph--c14f9ee5--90d0--4306--9b18--99576516f76a-osd--block--bbf5bc79--edea--4e43--8414--b5140b409397 253:4 0 1.8T 0 lvm
nvme3n1 259:3 0 1.8T 0 disk
`-ceph--a821146b--7674--4bcc--b5e9--0126c4bd5e3b-osd--block--b9371499--ff99--4d3e--ab3f--62ec3cf918c4 253:3 0 1.8T 0 lvm
nvme4n1 259:4 0 1.8T 0 disk
`-ceph--2e39f75a--5d2a--49ee--beb1--5d0a2991fd6c-osd--block--a1be083e--1fa7--4397--acfa--2ff3d3491572 253:2 0 1.8T 0 lvm
nvme5n1 259:5 0 1.8T 0 disk

Xav Paice (xavpaice) on 2019-05-22
tags: added: canonical-bootstack
Xav Paice (xavpaice) wrote :

I'm seeing this in a slightly different manner, on Bionic/Queens.

We have LVMs encrypted (thanks Vault), and rebooting a host results in at least one OSD not returning fairly consistently. The LVs appear in the list, however the difference between a working and a non-working OSD is the lack of links to block.db and block.wal on a non-working OSD.

See https://pastebin.canonical.com/p/rW3VgMMkmY/ for some info.

If I made the links manually:

cd /var/lib/ceph/osd/ceph-4
ln -s /dev/ceph-wal-4de27554-2d05-440e-874a-9921dfc6f47e/osd-db-7478edfc-f321-40a2-a105-8e8a2c8ca3f6 block.db
ln -s /dev/ceph-wal-4de27554-2d05-440e-874a-9921dfc6f47e/osd-wal-7478edfc-f321-40a2-a105-8e8a2c8ca3f6 block.wal

This resulted in a perms error accessing the device "bluestore(/var/lib/ceph/osd/ceph-4) _open_db /var/lib/ceph/osd/ceph-4/block.db symlink exists but target unusable: (13) Permission denied"

ls -l /dev/ceph-wal-4de27554-2d05-440e-874a-9921dfc6f47e/
total 0
lrwxrwxrwx 1 ceph ceph 8 May 22 23:04 osd-db-053e000a-76ed-427e-98b3-e5373e263f2d -> ../dm-20
lrwxrwxrwx 1 ceph ceph 8 May 22 23:04 osd-db-12e68fcb-d2b6-459f-97f2-d3eb4e28c75e -> ../dm-24
lrwxrwxrwx 1 ceph ceph 8 May 22 23:04 osd-db-33de740d-bd8c-4b47-a601-3e6e634e489a -> ../dm-14
lrwxrwxrwx 1 root root 8 May 22 23:04 osd-db-7478edfc-f321-40a2-a105-8e8a2c8ca3f6 -> ../dm-12
lrwxrwxrwx 1 root root 8 May 22 23:04 osd-db-c2669da2-63aa-42e2-b049-cf00a478e076 -> ../dm-22
lrwxrwxrwx 1 root root 8 May 22 23:04 osd-db-d38a7e91-cf06-4607-abbe-53eac89ac5ea -> ../dm-18
lrwxrwxrwx 1 ceph ceph 8 May 22 23:04 osd-db-eb5270dc-1110-420f-947e-aab7fae299c9 -> ../dm-16
lrwxrwxrwx 1 ceph ceph 8 May 22 23:04 osd-wal-053e000a-76ed-427e-98b3-e5373e263f2d -> ../dm-19
lrwxrwxrwx 1 ceph ceph 8 May 22 23:04 osd-wal-12e68fcb-d2b6-459f-97f2-d3eb4e28c75e -> ../dm-23
lrwxrwxrwx 1 ceph ceph 8 May 22 23:04 osd-wal-33de740d-bd8c-4b47-a601-3e6e634e489a -> ../dm-13
lrwxrwxrwx 1 root root 8 May 22 23:04 osd-wal-7478edfc-f321-40a2-a105-8e8a2c8ca3f6 -> ../dm-11
lrwxrwxrwx 1 root root 8 May 22 23:04 osd-wal-c2669da2-63aa-42e2-b049-cf00a478e076 -> ../dm-21
lrwxrwxrwx 1 root root 8 May 22 23:04 osd-wal-d38a7e91-cf06-4607-abbe-53eac89ac5ea -> ../dm-17
lrwxrwxrwx 1 ceph ceph 8 May 22 23:04 osd-wal-eb5270dc-1110-420f-947e-aab7fae299c9 -> ../dm-15

I tried to change the perms to ceph.ceph ownership, but no change.

I have also tried (using `systemctl edit lvm2-monitor.service`) adding the following to lvm2, but that's not changed the behavior either:

# cat /etc/systemd/system/lvm2-monitor.service.d/override.conf
[Service]
ExecStartPre=/bin/sleep 60

Xav Paice (xavpaice) wrote :

Added field-critical, there's a cloud deploy ongoing where I currently can't reboot any hosts, nor get some of the OSDs back from a host I rebooted, until we have a workaround.

Xav Paice (xavpaice) wrote :

Just one update, if I change the perms of the symlink made (chown -h) the OSD will actually start.

After rebooting, however, I found that the links I had made had gone again and the whole process needed repeating in order to start the OSD.

Steve Langasek (vorlon) wrote :

> LVM module is supposed to create PVs from devices using the links in /dev/disk/by-dname/
> folder that are created by udev.

Created by udev how? disk/by-dname is not part of the hierarchy that is populated by the standard udev rules, nor is this created by lvm2. Is there something in the ceph-osd packaging specifically which generates these links - and, in turn, depends on them for assembling LVs?

Can you provide udev logs (journalctl --no-pager -lu systemd-udevd.service; udevadm info -e) from the system following a boot when this race is hit?

Changed in systemd (Ubuntu):
status: Confirmed → Incomplete

Steve,
This is MAAS who creates these udev rules. We requested this feature to be implemented in order to be able to use persistent names in further services configuration (using templating). We couldn't go with /dev/sdX names as they may change after the reboot, and can't use wwn names as they are unique per node and don't allow us to use templates with FCB.

James Page (james-page) wrote :

by-dname udev rules are created by MAAS/curtin as part of the server install I think.

James Page (james-page) wrote :

The ceph-osd package provide udev rules which should switch the owner for all ceph related LVM VG's to ceph:ceph.

# OSD LVM layout example
# VG prefix: ceph-
# LV prefix: osd-
ACTION=="add", SUBSYSTEM=="block", \
  ENV{DEVTYPE}=="disk", \
  ENV{DM_LV_NAME}=="osd-*", \
  ENV{DM_VG_NAME}=="ceph-*", \
  OWNER:="ceph", GROUP:="ceph", MODE:="660"
ACTION=="change", SUBSYSTEM=="block", \
  ENV{DEVTYPE}=="disk", \
  ENV{DM_LV_NAME}=="osd-*", \
  ENV{DM_VG_NAME}=="ceph-*", \
  OWNER="ceph", GROUP="ceph", MODE="660"

Corey Bryant (corey.bryant) wrote :

This feels similar to https://bugs.launchpad.net/charm-ceph-osd/+bug/1812925. First question, are you running with the latest stable charms which have the fix for that bug?

James Page (james-page) wrote :

Please can you confirm which version of the ceph-osd package you have installed; older versions rely on a charm shipped udev ruleset, rather than it being provided by the packaging.

Yes, it is latest - the cluster is being re-deployed as part of Bootstack handover.

Corey,
The bug you point to is fixing the sequence of ceph/udev. Here however udev can't create any devices as they don't exist at the moment of udev run seems so - when the host boots and settles down - there is no PVs exist at all.

Corey Bryant (corey.bryant) wrote :

Andrey, I don't know if you saw James' comment as yours may have coincided but if you can get the ceph-osd package version that would be helpful. Thanks!

Xav Paice (xavpaice) wrote :

Charm is cs:ceph-osd-284
Ceph version is 12.2.11-0ubuntu0.18.04.2

The udev rules are created by curtin during the maas install.

Here's an example udev rule:

cat bcache4.rules

# Written by curtin
SUBSYSTEM=="block", ACTION=="add|change", ENV{CACHED_UUID}=="7b0e872b-ac78-4c4e-af18-8ccdce5962f6", SYMLINK+="disk/by-dname/bcache4"

The problem here is that when the host boots, for some OSDs (random, changes each boot), there's no symlinks for block.db and block.wal in /var/lib/ceph/osd/ceph-${thing}. If I manually create those two symlinks (and make sure the perms are right for the links themselves), then the OSD starts.

Some of the OSDs do get those links though, and that's interesting because on these hosts, the ceph wal and db for all the OSDs are LVs on the same nvme device, in fact the same partition even. The ceph OSD block dev is an LV on a different device.

Changed in systemd (Ubuntu):
status: Incomplete → New
Xav Paice (xavpaice) wrote :
Download full text (11.3 KiB)

journalctl --no-pager -lu systemd-udevd.service >/tmp/1828617-1.out

Hostname obfusticated

lsblk:

NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 88.4M 1 loop /snap/core/6964
loop1 7:1 0 89.4M 1 loop /snap/core/6818
loop2 7:2 0 8.4M 1 loop /snap/canonical-livepatch/77
sda 8:0 0 1.8T 0 disk
├─sda1 8:1 0 476M 0 part /boot/efi
├─sda2 8:2 0 3.7G 0 part /boot
└─sda3 8:3 0 1.7T 0 part
  └─bcache7 252:896 0 1.7T 0 disk /
sdb 8:16 0 1.8T 0 disk
└─bcache0 252:0 0 1.8T 0 disk
sdc 8:32 0 1.8T 0 disk
└─bcache6 252:768 0 1.8T 0 disk
  └─crypt-7478edfc-f321-40a2-a105-8e8a2c8ca3f6 253:0 0 1.8T 0 crypt
    └─ceph--7478edfc--f321--40a2--a105--8e8a2c8ca3f6-osd--block--7478edfc--f321--40a2--a105--8e8a2c8ca3f6 253:2 0 1.8T 0 lvm
sdd 8:48 0 1.8T 0 disk
└─bcache4 252:512 0 1.8T 0 disk
  └─crypt-33de740d-bd8c-4b47-a601-3e6e634e489a 253:4 0 1.8T 0 crypt
    └─ceph--33de740d--bd8c--4b47--a601--3e6e634e489a-osd--block--33de740d--bd8c--4b47--a601--3e6e634e489a 253:5 0 1.8T 0 lvm
sde 8:64 0 1.8T 0 disk
└─bcache3 252:384 0 1.8T 0 disk
  └─crypt-eb5270dc-1110-420f-947e-aab7fae299c9 253:1 ...

Xav Paice (xavpaice) wrote :
Download full text (4.7 KiB)

udevadm info -e >/tmp/1828617-2.out

~# ls -l /var/lib/ceph/osd/ceph*
-rw------- 1 ceph ceph 69 May 21 08:44 /var/lib/ceph/osd/ceph.client.osd-upgrade.keyring

/var/lib/ceph/osd/ceph-11:
total 24
lrwxrwxrwx 1 ceph ceph 93 May 28 22:12 block -> /dev/ceph-33de740d-bd8c-4b47-a601-3e6e634e489a/osd-block-33de740d-bd8c-4b47-a601-3e6e634e489a
-rw------- 1 ceph ceph 37 May 28 22:12 ceph_fsid
-rw------- 1 ceph ceph 37 May 28 22:12 fsid
-rw------- 1 ceph ceph 56 May 28 22:12 keyring
-rw------- 1 ceph ceph 6 May 28 22:12 ready
-rw------- 1 ceph ceph 10 May 28 22:12 type
-rw------- 1 ceph ceph 3 May 28 22:12 whoami

/var/lib/ceph/osd/ceph-18:
total 24
lrwxrwxrwx 1 ceph ceph 93 May 28 22:12 block -> /dev/ceph-eb5270dc-1110-420f-947e-aab7fae299c9/osd-block-eb5270dc-1110-420f-947e-aab7fae299c9
lrwxrwxrwx 1 ceph ceph 94 May 28 22:12 block.db -> /dev/ceph-wal-4de27554-2d05-440e-874a-9921dfc6f47e/osd-db-eb5270dc-1110-420f-947e-aab7fae299c9
lrwxrwxrwx 1 ceph ceph 95 May 28 22:12 block.wal -> /dev/ceph-wal-4de27554-2d05-440e-874a-9921dfc6f47e/osd-wal-eb5270dc-1110-420f-947e-aab7fae299c9
-rw------- 1 ceph ceph 37 May 28 22:12 ceph_fsid
-rw------- 1 ceph ceph 37 May 28 22:12 fsid
-rw------- 1 ceph ceph 56 May 28 22:12 keyring
-rw------- 1 ceph ceph 6 May 28 22:12 ready
-rw------- 1 ceph ceph 10 May 28 22:12 type
-rw------- 1 ceph ceph 3 May 28 22:12 whoami

/var/lib/ceph/osd/ceph-24:
total 24
lrwxrwxrwx 1 ceph ceph 93 May 28 22:12 block -> /dev/ceph-d38a7e91-cf06-4607-abbe-53eac89ac5ea/osd-block-d38a7e91-cf06-4607-abbe-53eac89ac5ea
-rw------- 1 ceph ceph 37 May 28 22:12 ceph_fsid
-rw------- 1 ceph ceph 37 May 28 22:12 fsid
-rw------- 1 ceph ceph 56 May 28 22:12 keyring
-rw------- 1 ceph ceph 6 May 28 22:12 ready
-rw------- 1 ceph ceph 10 May 28 22:12 type
-rw------- 1 ceph ceph 3 May 28 22:12 whoami

/var/lib/ceph/osd/ceph-31:
total 24
lrwxrwxrwx 1 ceph ceph 93 May 28 22:12 block -> /dev/ceph-053e000a-76ed-427e-98b3-e5373e263f2d/osd-block-053e000a-76ed-427e-98b3-e5373e263f2d
lrwxrwxrwx 1 ceph ceph 94 May 28 22:12 block.db -> /dev/ceph-wal-4de27554-2d05-440e-874a-9921dfc6f47e/osd-db-053e000a-76ed-427e-98b3-e5373e263f2d
lrwxrwxrwx 1 ceph ceph 95 May 28 22:12 block.wal -> /dev/ceph-wal-4de27554-2d05-440e-874a-9921dfc6f47e/osd-wal-053e000a-76ed-427e-98b3-e5373e263f2d
-rw------- 1 ceph ceph 37 May 28 22:12 ceph_fsid
-rw------- 1 ceph ceph 37 May 28 22:12 fsid
-rw------- 1 ceph ceph 56 May 28 22:12 keyring
-rw------- 1 ceph ceph 6 May 28 22:12 ready
-rw------- 1 ceph ceph 10 May 28 22:12 type
-rw------- 1 ceph ceph 3 May 28 22:12 whoami

/var/lib/ceph/osd/ceph-38:
total 24
lrwxrwxrwx 1 ceph ceph 93 May 28 22:12 block -> /dev/ceph-c2669da2-63aa-42e2-b049-cf00a478e076/osd-block-c2669da2-63aa-42e2-b049-cf00a478e076
lrwxrwxrwx 1 ceph ceph 94 May 28 22:12 block.db -> /dev/ceph-wal-4de27554-2d05-440e-874a-9921dfc6f47e/osd-db-c2669da2-63aa-42e2-b049-cf00a478e076
lrwxrwxrwx 1 ceph ceph 95 May 28 22:12 block.wal -> /dev/ceph-wal-4de27554-2d05-440e-874a-9921dfc6f47e/osd-wal-c2669da2-63aa-42e2-b049-cf00a478e076
-rw------- 1 ceph ceph 37 May 28 22:12 ceph_fsid
-rw------- 1 ceph ceph 37 May 28 22:12 fsid
-rw------- 1 ceph ceph 56 May 28 22:12 keyring
-rw------- 1 ceph ceph ...

Read more...

Corey Bryant (corey.bryant) wrote :

Thanks for all the details.

I need to confirm this but I think the block.db and block.wal symlinks are created as a result of 'ceph-volume lvm prepare --bluestore --data <device> --block.wal <wal-device> --block.db <db-device>'.

That's coded in the ceph-osd charm around here: https://opendev.org/openstack/charm-ceph-osd/src/branch/master/lib/ceph/utils.py#L1558

Can you confirm that the symlinks are ok prior to reboot? I'd like to figure out if they are correctly set up by the charm initially.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

affects: systemd (Ubuntu) → ceph (Ubuntu)
Changed in ceph (Ubuntu):
status: New → Confirmed
Corey Bryant (corey.bryant) wrote :
Download full text (3.6 KiB)

I didn't recreate this but I did get a deployment on serverstack with bluestore WAL and DB devices. That's done with:

1) juju deploy --series bionic --num-units 1 --constraints mem=2G --config expected-osd-count=1 --config monitor-count=1 cs:ceph-mon ceph-mon

2) juju deploy --series bionic --num-units 1 --constraints mem=2G --storage osd-devices=cinder,10G --storage bluestore-wal=cinder,1G --storage bluestore-db=cinder,1G cs:ceph-osd ceph-osd

3) juju add-relation ceph-osd ceph-mon

James Page mentioned taking a look at the systemd bits.

ceph-osd systemd unit
---------------------
/lib/systemd/system/ceph-osd@.service calls:
ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i

Where /usr/lib/ceph/ceph-osd-prestart.sh has some logic that exits with an error code when certain things aren't ready. I think we might be able to add something in there. For example it currently has:

data="/var/lib/ceph/osd/${cluster:-ceph}-$id"

if [ -L "$journal" -a ! -e "$journal" ]; then
    udevadm settle --timeout=5 || :
    if [ -L "$journal" -a ! -e "$journal" ]; then
        echo "ceph-osd(${cluster:-ceph}-$id): journal not present, not starting yet." 1>&2
        exit 0
    fi
fi

The 'udevadm settle' watches the udev event queue and exists if all current events are handled or if it's been 5 seconds. Perhaps we can do something similar for this issue.

Here's what I see in /var/log/ceph/ceph-osd.0.log during a system reboot:
-------------------------------------------------------------------------
2019-05-29 19:04:25.800237 7fa6940d1700 1 freelist shutdown
...
2019-05-29 19:04:25.800548 7fa6940d1700 1 bdev(0x557eca7a1680 /var/lib/ceph/osd/ceph-0/block.wal) close
2019-05-29 19:04:26.079227 7fa6940d1700 1 bdev(0x557eca7a1200 /var/lib/ceph/osd/ceph-0/block.db) close
2019-05-29 19:04:26.266085 7fa6940d1700 1 bdev(0x557eca7a1440 /var/lib/ceph/osd/ceph-0/block) close
2019-05-29 19:04:26.474086 7fa6940d1700 1 bdev(0x557eca7a0fc0 /var/lib/ceph/osd/ceph-0/block) close
...
2019-05-29 19:04:53.601570 7fdd2ec17e40 1 bdev create path /var/lib/ceph/osd/ceph-0/block.db type kernel
2019-05-29 19:04:53.601581 7fdd2ec17e40 1 bdev(0x561e50583200 /var/lib/ceph/osd/ceph-0/block.db) open path /var/lib/ceph/osd/ceph-0/block.db
2019-05-29 19:04:53.601855 7fdd2ec17e40 1 bdev(0x561e50583200 /var/lib/ceph/osd/ceph-0/block.db) open size 1073741824 (0x40000000, 1GiB) block_size 4096 (4KiB) rotational
2019-05-29 19:04:53.601867 7fdd2ec17e40 1 bluefs add_block_device bdev 1 path /var/lib/ceph/osd/ceph-0/block.db size 1GiB
2019-05-29 19:04:53.602131 7fdd2ec17e40 1 bdev create path /var/lib/ceph/osd/ceph-0/block type kernel
2019-05-29 19:04:53.602143 7fdd2ec17e40 1 bdev(0x561e50583440 /var/lib/ceph/osd/ceph-0/block) open path /var/lib/ceph/osd/ceph-0/block
2019-05-29 19:04:53.602464 7fdd2ec17e40 1 bdev(0x561e50583440 /var/lib/ceph/osd/ceph-0/block) open size 10733223936 (0x27fc00000, 10.0GiB) block_size 4096 (4KiB) rotational
2019-05-29 19:04:53.602480 7fdd2ec17e40 1 bluefs add_block_device bdev 2 path /var/lib/ceph/osd/ceph-0/block size 10.0GiB
2019-05-29 19:04:53.602499 7fdd2ec17e40 1 bdev create path /var/lib/ceph/osd/ceph-0/block.wal type kerne...

Read more...

Corey Bryant (corey.bryant) wrote :

Couple typos in comment #19:
I think bluestore-wal and bluestore-db needed 2G.
Also s/exists/exits

Corey Bryant (corey.bryant) wrote :

I'm building a test package for ceph with additional logic added to /usr/lib/ceph/ceph-osd-prestart.sh to allow block.wal and block.db additional time to settle. This is just a version to test the fix. I'm not sure if the behavior is the same as journal file (symlink exists but file doesn't) but that's what I have in this change. Here's the PPA: https://launchpad.net/~corey.bryant/+archive/ubuntu/bionic-queens-1828617/+packages

Xav, Any chance you could try this out once it builds?

Xav Paice (xavpaice) wrote :

Thanks, will do. FWIW, the symlinks are in place before reboot.

Wouter van Bommel (woutervb) wrote :

Hi,

Installed the packages from the above ppa, rebooted the host and 4 out of 7 osd's came up. The 3 that where missing from the `ceph osd tree` where not running the osd daemon as they lacked the symlinks to the db and the wal.

Rebooted the server, and after the reboot other osd's (again 3 out of 7) failed to start due to missing symlinks. This time it where other osd's. So the issue is not fixed with the deb's in the ppa.

Regards,
Wouter

Corey Bryant (corey.bryant) wrote :

@Wouter, Thanks for testing. I'm rebuilding the package without the checks as they're probably preventing the udevadm settle from running. In the new build the 'udevadm settle --timeout=5' will run regardless. Let's see if that helps and then we can fine tune the checks surrounding the call later. Would you mind trying again once that builds (same PPA)?

Corey Bryant (corey.bryant) wrote :

@Wouter, since ceph takes so long to build you could also manually add 'udevadm settle --timeout=5' to /usr/lib/ceph/ceph-osd-prestart.sh across the ceph-osd units to test that.

James Page (james-page) wrote :

The ceph-volume tool assembles and primes the OSD directory using the LV tags written during the prepare action - it would be good to validate these are OK with 'sudo lvs -o lv_tags'

The tags will contain UUID information about all of the block devices associated with an OSD.

James Page (james-page) wrote :

Any output in /var/log/ceph/ceph-volume-systemd.log would also be useful

James Page (james-page) wrote :

Some further references:

Each part of the OSD is queried for its underlying block device using blkid:

  https://github.com/ceph/ceph/blob/luminous/src/ceph-volume/ceph_volume/devices/lvm/activate.py#L114

I guess that if the block device was not visible/present at the point that code runs during activate, then the symlink for the block.db or block.wal devices would not be created, causing the OSD to fail to start.

Corey Bryant (corey.bryant) wrote :

Note that there may only be a short window during system startup to catch missing tags with 'sudo lvs -o lv_tags'.

Wouter van Bommel (woutervb) wrote :

Hi,

Added the udevadm settle --timeout=5 in both the 2 remaining if block's in the referenced script. That did not make a difference.

See https://pastebin.ubuntu.com/p/8f2ZXMRNgv/ for the ceph-volume-systemd.log

At this boot, the osd's with numbers 4, 11 & 18 did not start, with the missing symlinks

Corey Bryant (corey.bryant) wrote :

Thanks for testing. That should rule out udev as the cause of the race.

A couple of observations from the log:

* There is a loop for each osd that calls 'ceph-volume lvm trigger' 30 times until the OSD is activated, for example for 4:
[2019-05-31 01:27:29,235][ceph_volume.process][INFO ] Running command: ceph-volume lvm trigger 4-7478edfc-f321-40a2-a105-8e8a2c8ca3f6
[2019-05-31 01:27:35,435][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.4 with fsid 7478edfc-f321-40a2-a105-8e8a2c8ca3f6
[2019-05-31 01:27:35,530][systemd][WARNING] command returned non-zero exit status: 1
[2019-05-31 01:27:35,531][systemd][WARNING] failed activating OSD, retries left: 30
[2019-05-31 01:27:44,122][ceph_volume.process][INFO ] stderr --> RuntimeError: could not find osd.4 with fsid 7478edfc-f321-40a2-a105-8e8a2c8ca3f6
[2019-05-31 01:27:44,174][systemd][WARNING] command returned non-zero exit status: 1
[2019-05-31 01:27:44,175][systemd][WARNING] failed activating OSD, retries left: 29
...

I wonder if we can have similar 'ceph-volume lvm trigger' calls for WAL and DB devices per OSD. Does that even make sense? Or perhaps another call with a similar goal. We should be able to determine if an OSD has a DB or WAL device from the lvm tags.

* The first 3 osd's that are activated are 18, 4, and 11 and they are the 3 that are missing block.db/block.wal symlinks. That's just more confirmation this is a race:
[2019-05-31 01:28:03,370][systemd][INFO ] successfully trggered activation for: 18-eb5270dc-1110-420f-947e-aab7fae299c9
[2019-05-31 01:28:12,354][systemd][INFO ] successfully trggered activation for: 4-7478edfc-f321-40a2-a105-8e8a2c8ca3f6
[2019-05-31 01:28:12,530][systemd][INFO ] successfully trggered activation for: 11-33de740d-bd8c-4b47-a601-3e6e634e489a

Corey Bryant (corey.bryant) wrote :

The 'ceph-volume lvm trigger' call appears to come from ceph source at src/ceph-volume/ceph_volume/systemd/main.py.

Corey Bryant (corey.bryant) wrote :

Upstream ceph bug opened: https://tracker.ceph.com/issues/40100

Corey Bryant (corey.bryant) wrote :
Corey Bryant (corey.bryant) wrote :

I've cherry-picked that patch to the package in the PPA if anyone can test. I'm fairly sure this will fix it as I've been testing and removing/adding the volume backed storage in my testing environment and it will wait for the wal/db devices for a while if they don't exist.

Changed in ceph (Ubuntu):
status: New → Triaged
importance: Undecided → Critical
assignee: nobody → Corey Bryant (corey.bryant)
Xav Paice (xavpaice) wrote :

After installing that PPA update and rebooting, the PV for the wal didn't come online till I ran pvscan --cache. Seems a second reboot didn't do that though, might have been a red herring from prior attempts.

Unfortunately, the OSDs didn't seem to come online in exactly the same way after installing the update.

Xav Paice (xavpaice) wrote :

Let me word that last comment differently.

I went to the host and installed the PPA update, then rebooted.

When the box booted up, the PV which hosts the wal LVs wasn't listed in lsblk or 'pvs' or lvs. I then ran pvscan --cache, which brought the LVs back online, but not the OSDs, so I rebooted.

After that reboot, the behavior of the OSDs was exactly the same as prior to the update - I reboot, and some OSDs don't come online, and are missing symlinks.

Corey Bryant (corey.bryant) wrote :

Do you have access to the /var/log/ceph/ceph-volume-systemd.log after the latest reboot? That should give us some details such as:

"[2019-05-31 20:43:44,334][systemd][WARNING] failed to find db volume, retries left: 17"

or similar for wal volume.

If you see that the retries have been exceeded in your case you can tune them (the new loops are using the same env vars):

http://docs.ceph.com/docs/mimic/ceph-volume/systemd/#failure-and-retries

As for the pvscan issue, I'm not sure if that is a ceph issue (?).

Xav Paice (xavpaice) wrote :

The pvscan issue is likely something different, just wanted to make sure folks are aware of it for completeness.

The logs /var/log/ceph/ceph-volume-systemd.log and ceph-volume.log are empty.

Corey Bryant (corey.bryant) wrote :

Any chance the log files got rotated and zipped? What does an ls of /var/log/ceph show?

Corey Bryant (corey.bryant) wrote :

I chatted with xav in IRC and he showed me a private link to the log files. The ceph-volume-systemd.log.1 had timestamps of 2019-06-03 which matches up with the last attempt (see comment #37).

I didn't find any logs from the new code in this log file. That likely means one of the following: there were no wal/db devices found in lvs tags (ie. 'sudo lvs -o lv_tags'), the new code isn't working, or the new code wasn't installed.

I added a few more logs to the patch help understand better what's going on, and that's rebuilding in the PPA.

I'm attaching all the relevant code to show the log messages to look for.

Corey Bryant (corey.bryant) wrote :

Note that the code looks for wal/db devices in the block device's LV tags after it is found. In other words:

sudo lvs -o lv_tags | grep type=block | grep ceph.wal_device
sudo lvs -o lv_tags | grep type=block | grep ceph.db_device

This is the window where the following might not yet exist, yet we know they *should* exist based on the above tags:

sudo lvs -o lv_tags | grep type=wal
sudo lvs -o lv_tags | grep type=db

James Page (james-page) on 2019-06-11
Changed in ceph (Ubuntu):
importance: Critical → High
status: Triaged → In Progress
Corey Bryant (corey.bryant) wrote :

Py2 bug found in code review upstream. Updated PPA again with fix.

David A. Desrosiers (setuid) wrote :

Just adding that I've worked around this issue with the following added to the lvm2-monitor overrides (/etc/systemd/system/lvm2-monitor.service.d/custom.conf):

[Service]
ExecStartPre=/bin/sleep 60

This results in 100% success for every single boot, with no missed disks nor missed LVM volumes applied to those block devices.

We've also disabled nvme multipathing on every Ceph storage node with the following in /etc/d/g kernel boot args:

nvme_core.multipath=0

Note: This LP was cloned from an internal customer case where their Ceph storage nodes were directly impacted by this issue, and this is the current workaround deployed, until/unless we can find a consistent RC for this issue in an upstream package.

Corey Bryant (corey.bryant) wrote :

@David, thanks for the update. We could really use some testing of the current proposed fix if you have a chance. That's in a PPA mentioned above. The new code will wait for wal/db devices to arrive and has env vars to adjust wait times - http://docs.ceph.com/docs/mimic/ceph-volume/systemd/#failure-and-retries.

As for the pvscan issue, I don't think that is related to ceph.

James Page (james-page) wrote :

Alternative fix proposed upstream - picking this in preference to Corey's fix as its in the right part of the codebase for ceph-volume.

James Page (james-page) wrote :

Building in ppa:ci-train-ppa-service/3535 (will take a few hours).

James Page (james-page) wrote :
Changed in ceph (Ubuntu):
assignee: Corey Bryant (corey.bryant) → James Page (james-page)
James Page (james-page) on 2019-08-29
Changed in ceph (Ubuntu Bionic):
status: New → In Progress
Changed in ceph (Ubuntu Disco):
status: New → In Progress
assignee: nobody → James Page (james-page)
Changed in ceph (Ubuntu Bionic):
assignee: nobody → James Page (james-page)
importance: Undecided → High
Changed in ceph (Ubuntu Disco):
importance: Undecided → High
description: updated
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 14.2.2-0ubuntu2

---------------
ceph (14.2.2-0ubuntu2) eoan; urgency=medium

  [ Eric Desrochers ]
  * Ensure that daemons are not automatically restarted during package
    upgrades (LP: #1840347):
    - d/rules: Use "--no-restart-after-upgrade" and "--no-stop-on-upgrade"
      instead of "--no-restart-on-upgrade".
    - d/rules: Drop exclusion for ceph-[osd,mon,mds] for restarts.

  [ Jesse Williamson ]
  * d/p/civetweb-755-1.8-somaxconn-configurable*.patch: Backport changes
    to civetweb to allow tuning of SOMAXCONN in Ceph RADOS Gateway
    deployments (LP: #1838109).

  [ James Page ]
  * d/p/ceph-volume-wait-for-lvs.patch: Cherry pick inflight fix to
    ensure that required wal and db devices are present before
    activating OSD's (LP: #1828617).

  [ Steve Beattie ]
  * SECURITY UPDATE: RADOS gateway remote denial of service
    - d/p/CVE-2019-10222.patch: rgw: asio: check the remote endpoint
      before processing requests.
    - CVE-2019-10222

 -- James Page <email address hidden> Thu, 29 Aug 2019 13:54:25 +0100

Changed in ceph (Ubuntu Eoan):
status: In Progress → Fix Released
James Page (james-page) on 2019-08-30
no longer affects: cloud-archive/pike

Hello Andrey, or anyone else affected,

Accepted ceph into disco-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/13.2.6-0ubuntu0.19.04.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-disco to verification-done-disco. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-disco. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in ceph (Ubuntu Disco):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-disco
Changed in ceph (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
Łukasz Zemczak (sil2100) wrote :

Hello Andrey, or anyone else affected,

Accepted ceph into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/12.2.12-0ubuntu0.18.04.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

James Page (james-page) wrote :

Hello Andrey, or anyone else affected,

Accepted ceph into rocky-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:rocky-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-rocky-needed to verification-rocky-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-rocky-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-rocky-needed
James Page (james-page) wrote :

Hello Andrey, or anyone else affected,

Accepted ceph into stein-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:stein-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-stein-needed to verification-stein-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-stein-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-stein-needed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.