cephadm does not work with zfs root

Bug #1881747 reported by Bryant G Ly
26
This bug affects 5 people
Affects Status Importance Assigned to Milestone
zfs-linux (Arch Linux)
New
Undecided
Unassigned
zfs-linux (Ubuntu)
In Progress
Undecided
Andrea Righi

Bug Description

When trying to install ceph on ubuntu 20.04 with zfs as root file system the OSD's do not come up.

The OSD's give an error of:

May 29 16:51:11 ip-10-0-0-148 systemd[1]: <email address hidden>: Main process exited, code=exited, status=1/FAILURE
May 29 16:51:12 ip-10-0-0-148 systemd[1]: <email address hidden>: Failed with result 'exit-code'.
May 29 16:51:22 ip-10-0-0-148 systemd[1]: <email address hidden>: Scheduled restart job, restart counter is at 4.
May 29 16:51:22 ip-10-0-0-148 systemd[1]: Stopped Ceph osd.0 for a3ed1cb2-a1cb-11ea-8daf-a729fb450032.
May 29 16:51:22 ip-10-0-0-148 systemd[1]: Starting Ceph osd.0 for a3ed1cb2-a1cb-11ea-8daf-a729fb450032...
May 29 16:51:22 ip-10-0-0-148 docker[114525]: Error: No such container: ceph-a3ed1cb2-a1cb-11ea-8daf-a729fb450032-osd.0
May 29 16:51:22 ip-10-0-0-148 systemd[1]: Started Ceph osd.0 for a3ed1cb2-a1cb-11ea-8daf-a729fb450032.
May 29 16:51:23 ip-10-0-0-148 bash[114543]: Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0
May 29 16:51:23 ip-10-0-0-148 bash[114543]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
May 29 16:51:23 ip-10-0-0-148 bash[114543]: Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-b3cf0dc5-a5fb-45c5-af3c-b85ef0b115ee/osd-block-3bfa4417-18e5-49f9->
May 29 16:51:23 ip-10-0-0-148 bash[114543]: Running command: /usr/bin/ln -snf /dev/ceph-b3cf0dc5-a5fb-45c5-af3c-b85ef0b115ee/osd-block-3bfa4417-18e5-49f9-95ee-4c5912f0fa22 /var/lib/ceph/osd/ceph-0/block
May 29 16:51:23 ip-10-0-0-148 bash[114543]: Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-0/block
May 29 16:51:23 ip-10-0-0-148 bash[114543]: Running command: /usr/bin/chown -R ceph:ceph /dev/mapper/ceph--b3cf0dc5--a5fb--45c5--af3c--b85ef0b115ee-osd--block--3bfa4417--18e5--49f9--95ee--4c5912f0fa22
May 29 16:51:23 ip-10-0-0-148 bash[114543]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-0
May 29 16:51:23 ip-10-0-0-148 bash[114543]: --> ceph-volume lvm activate successful for osd ID: 0
May 29 16:51:24 ip-10-0-0-148 bash[115166]: debug 2020-05-29T16:51:24.602+0000 7f05cfb9cec0  0 set uid:gid to 167:167 (ceph:ceph)
May 29 16:51:24 ip-10-0-0-148 bash[115166]: debug 2020-05-29T16:51:24.602+0000 7f05cfb9cec0  0 ceph version 15.2.2 (0c857e985a29d90501a285f242ea9c008df49eb8) octopus (stable), process ceph-osd, pid 1
May 29 16:51:24 ip-10-0-0-148 bash[115166]: debug 2020-05-29T16:51:24.602+0000 7f05cfb9cec0  0 pidfile_write: ignore empty --pid-file
May 29 16:51:24 ip-10-0-0-148 bash[115166]: debug 2020-05-29T16:51:24.602+0000 7f05cfb9cec0 -1 missing 'type' file and unable to infer osd type

Using ubuntu 20.04 without root zfs works fine.

Andrea Righi (arighi)
Changed in zfs-linux (Ubuntu):
assignee: nobody → Andrea Righi (arighi)
Revision history for this message
Andrea Righi (arighi) wrote :

I've tried to reproduce the problem on a VM (that uses ZFS as rootfs) setting up a single-node ceph cluster, but OSD is coming up correctly:

$ sudo ceph -s | grep osd
    osd: 1 osds: 1 up (since 50m), 1 in (since 59m)

Could you provide more details about your particular ceph configuration / infrastructure, so that I can try to reproduce the problem in an environment more similar to yours? Thanks.

Revision history for this message
Bryant G Ly (bryantgly) wrote :

We are using the latest Ubuntu 20.04 and we ahve tried ceph ansible + docker deploy and both of those give us issues with zfs root fs. How are you deploying?

If you give me your list of commands + image I can retry.

Revision history for this message
Andrea Righi (arighi) wrote :

I was pretty much following this simple tutorial:
http://prashplus.blogspot.com/2018/01/ceph-single-node-setup-ubuntu.html

I'll try to add docker and ceph-ansible to the equation and see if I can reproduce it.

Changed in zfs-linux (Ubuntu):
status: New → In Progress
Revision history for this message
Andrea Righi (arighi) wrote :

BTW, how did you install ceph-ansible? I can't find a 20.04 package in the ansible ppa.

Revision history for this message
Bryant G Ly (bryantgly) wrote :
Revision history for this message
Bryant G Ly (bryantgly) wrote :

We tried docker by itself then tried ceph ansible by itself to deploy.
https://docs.ceph.com/ceph-ansible/master/
for ceph ansible we used version 5

Revision history for this message
Martin Strange (mstrange) wrote :

For what it's worth, I've now had the exact same problem, which led me here.

On a bare-metal 20.04 using full blank HDDs as OSDs (/dev/sda etc.), installing using cephadm worked fine with an XFS root, but later on when I reinstalled and tried ZFS root, I then got the same behaviour described above despite trying device zaps and everything I can think of.

It seems that the unit.run does two separate steps, first a "/usr/sbin/ceph-volume lvm activate 0" and then a "/usr/bin/ceph-osd -n osd.0"

The activate does its work inside a tmpfs "/var/lib/ceph/osd/ceph-0", which is entirely thrown away when that container ends, so the symlink "/var/lib/ceph/osd/ceph-0/block" it creates is gone before the ceph-osd container starts up, resulting it in not finding a "block" any more and then declaring unknown type because of that.

I don't understand how that could ever possibly work, so maybe the ZFS root is not relevant, or maybe it somehow causes activate to use the tmpfs?

Note that if I run a single container manually, and do the same activate followed by running ceph-osd then the OSD does come up.

How is the "/var/lib/ceph/osd/ceph-0/block" meant to persist between running the activate in one container and then running the ceph-osd in a different one afterwards, or is the "/usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-0" it does during activate that is somehow the source of this problem?

Revision history for this message
Martin Strange (mstrange) wrote :

Follow up - it does seem to be the tmpfs mount that activate creates that causes the problem.

I manually started the activate container by running the podman command from unit.run for the activate step, but just ran "bash -l" instead of the actual activate command

Then I prevented the mount tmpfs from doing anything by "rm /usr/bin/mount" and replacing it with a link to "/usr/bin/true", and then ran the original activate command

# /usr/sbin/ceph-volume lvm activate 2 56b13799-3ef5-4ea5-91d5-474f829f12dc --no-systemd

Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-2 <<< WHY DOES IT DO THIS?
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-6fc7e3e3-2ce6-47ab-aac8-adc5c6633dfb/osd-block-56b13799-3ef5-4ea5-91d5-474f829f12dc --path /var/lib/ceph/osd/ceph-2 --no-mon-config
Running command: /usr/bin/ln -snf /dev/ceph-6fc7e3e3-2ce6-47ab-aac8-adc5c6633dfb/osd-block-56b13799-3ef5-4ea5-91d5-474f829f12dc /var/lib/ceph/osd/ceph-2/block
Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-2/block
Running command: /usr/bin/chown -R ceph:ceph /dev/mapper/ceph--6fc7e3e3--2ce6--47ab--aac8--adc5c6633dfb-osd--block--56b13799--3ef5--4ea5--91d5--474f829f12dc
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
--> ceph-volume lvm activate successful for osd ID: 2

Because the tmpfs was now effectively a no-op, this activation created the necessary files in the real OSD directory, and now I was able to systemctl restart the osd service and now it came up apparently OK.

I also did another fresh install on same hardware using normal non-ZFS root, and this problem did not happen, so it does in some way appear to be an interaction with ZFS.

Revision history for this message
Martin Strange (mstrange) wrote :

I think the reason that ZFS behaves differently is because of this...

/usr/lib/python3.6/site-packages/ceph_volume/devices/raw/activate.py

from ceph_volume.util import system

    # mount on tmpfs the osd directory
    osd_path = '/var/lib/ceph/osd/%s-%s' % (conf.cluster, osd_id)
    if not system.path_is_mounted(osd_path):
        # mkdir -p and mount as tmpfs
        prepare_utils.create_osd_path(osd_id, tmpfs=tmpfs)

This "path_is_mounted" test that it does appears to misbehave on a ZFS root, causing it to then resort to using the tmpfs

The test is ultimately traced to "get_mounts" in
/usr/lib/python3.6/site-packages/ceph_volume/util/system.py

On Linux, this reads through /proc/mounts

On ZFS root, the line is should be finding is resembles this...

rpool/ROOT/ubuntu_4trzhh/var/lib /var/lib/ceph/osd/ceph-0 zfs rw,relatime,xattr,posixacl 0 0

...where as on a normal EXT4 root, it looks like this...

/dev/nvme0n1p2 /var/lib/ceph/osd/ceph-0 ext4 rw,relatime,errors=remount-ro 0 0

There's some logic in there about the device needing to start with leading "/", and I think that is what confuses the test when ZFS root has "rpool" with no leading slash.

Revision history for this message
Tobias Bossert (tobib) wrote (last edit ):

Related pull request: https://github.com/ceph/ceph/pull/46043

## Edit

Since it probably takes a while until the PR is merged here is my script to "modify" the ceph docker image:

```
#!/bin/bash
# Use this file to create a patched ceph docker image
# Usage: ./patch_ceph_image.sh <ceph_version>
# Example: ./patch_ceph_image.sh v16.2.7
#
# Relevant pull request: https://github.com/ceph/ceph/pull/46043
#
# After creating the patched image, you most likely want to upload it to a docker registry
CEPH_VERSION_TAG=$1
CEPH_IMAGE_SOURCE="quay.io/ceph/ceph"
CONTAINER_NAME=ceph_custom_$CEPH_VERSION_TAG
docker pull $CEPH_IMAGE_SOURCE:$CEPH_VERSION_TAG
IMAGE_ID=`docker images | grep -E "$CEPH_IMAGE_SOURCE|$CEPH_VERSION_TAG" | awk -F ' ' '{print \$3}'`
docker create --name $CONTAINER_NAME $IMAGE_ID
# And directly exit the container
docker cp $CONTAINER_NAME:/usr/lib/python3.6/site-packages/ceph_volume/util/system.py .
patch system.py -p0 -c --fuzz=3 --ignore-whitespace --output system_$CEPH_VERSION_TAG.py --verbose << 'EOF'
*** system.py 2021-12-07 17:15:49.000000000 +0100
--- system_n.py 2022-06-02 13:25:32.579878573 +0200
***************
*** 287,293 ****
              device = fields[0]
          path = os.path.realpath(fields[1])
          # only care about actual existing devices
          if not os.path.exists(device) or not device.startswith('/'):
! if device not in do_not_skip:
                  continue
          if device in devices_mounted.keys():
--- 287,294 ----
              device = fields[0]
          path = os.path.realpath(fields[1])
+ filesystem = fields[2]
          # only care about actual existing devices
          if not os.path.exists(device) or not device.startswith('/'):
! if device not in do_not_skip and filesystem != 'zfs':
                  continue
          if device in devices_mounted.keys():
EOF
# Verify
diff -u system.py system_$CEPH_VERSION_TAG.py
read -p "Is the diff correct? (y/N)" CONT
if [ "$CONT" = "y" ]; then
  docker cp system_$CEPH_VERSION_TAG.py $CONTAINER_NAME:/usr/lib/python3.6/site-packages/ceph_volume/util/system.py
  MOD_CONTAINER_ID=`docker ps -a | grep $CONTAINER_NAME | cut -d' ' -f 1`
  docker commit $MOD_CONTAINER_ID ceph-oep-patched:$CEPH_VERSION_TAG
  docker rm $CONTAINER_NAME
else
  docker rm $CONTAINER_NAME
  exit
fi
```

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.