ceph-osd fails to start machine restart due to incorrect ceph.conf

Bug #2049770 reported by Rafał Krzewski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Expired
Undecided
Unassigned

Bug Description

I'm setting up a Charmed Kubernetes cluster on 3 bare metal machines managed by MAAS. Due to limited number of physical machines I am running a number of Juju units in LXD containers.

The overlay file looks as follows:

applications:
  ceph-mon:
    charm: ceph-mon
    channel: quincy/stable
    revision: 195
    num_units: 3
    to:
    - lxd:0
    - lxd:1
    - lxd:2
  ceph-osd:
    charm: ceph-osd
    channel: quincy/stable
    revision: 576
    num_units: 3
    to:
    - "0"
    - "1"
    - "2"
    options:
      osd-devices: /dev/nvme0n1
  ceph-fs:
    charm: ceph-fs
    channel: quincy/stable
    revision: 60
    num_units: 3
    to:
    - lxd:0
    - lxd:1
    - lxd:2
  ceph-csi:
    charm: ceph-csi
    channel: stable
    revision: 37
    options:
      namespace: kube-system
      cephfs-enable: true
relations:
- [ceph-osd:mon, ceph-mon:osd]
- [ceph-fs:ceph-mds, ceph-mon:mds]
- [ceph-mon:client, ceph-csi:ceph-client]
- [kubernetes-control-plane:juju-info, ceph-csi:kubernetes]

The only workload that is running unconfined on the machines is kubernetes-control-plane. Everything else is running either in LXD or in Kubernetes.

Deployment works fine, all units start up and Ceph StorageClasses are available in Kubernetes.

Trouble begins after restarting a node. ceph-osd unit on the node does not come up. Juju reports the following status message: "No block devices detected using current configuration"

It turns out that the ceph-osd systemd service is not running:

root@stagnum3:/home/ubuntu# systemctl status ceph-osd@0
× ceph-osd@0.service - Ceph object storage daemon osd.0
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled)
     Active: failed (Result: exit-code) since Mon 2024-01-15 21:36:32 UTC; 2 days ago
    Process: 7258 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id 0 (code=exited, status=0/SUCCESS)
    Process: 7262 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id 0 --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
   Main PID: 7262 (code=exited, status=1/FAILURE)
        CPU: 105ms

Jan 15 21:36:32 stagnum3 systemd[1]: ceph-osd@0.service: Scheduled restart job, restart counter is at 4.
Jan 15 21:36:32 stagnum3 systemd[1]: Stopped Ceph object storage daemon osd.0.
Jan 15 21:36:32 stagnum3 systemd[1]: ceph-osd@0.service: Start request repeated too quickly.
Jan 15 21:36:32 stagnum3 systemd[1]: ceph-osd@0.service: Failed with result 'exit-code'.
Jan 15 21:36:32 stagnum3 systemd[1]: Failed to start Ceph object storage daemon osd.0.

journalctl shows the following:

Jan 15 21:36:11 stagnum3 systemd[1]: Started Ceph object storage daemon osd.0.
Jan 15 21:36:11 stagnum3 ceph-osd[6585]: 2024-01-15T21:36:11.682+0000 7f3717f67800 -1 auth: unable to find a keyring on /etc/ceph/ceph.osd.0.keyring: (2) No such file or directory
Jan 15 21:36:11 stagnum3 ceph-osd[6585]: 2024-01-15T21:36:11.682+0000 7f3717f67800 -1 auth: unable to find a keyring on /etc/ceph/ceph.osd.0.keyring: (2) No such file or directory
Jan 15 21:36:11 stagnum3 ceph-osd[6585]: 2024-01-15T21:36:11.682+0000 7f3717f67800 -1 AuthRegistry(0x56435af66138) no keyring found at /etc/ceph/ceph.osd.0.keyring, disabling cephx
Jan 15 21:36:11 stagnum3 ceph-osd[6585]: 2024-01-15T21:36:11.682+0000 7f3717f67800 -1 auth: unable to find a keyring on /etc/ceph/ceph.osd.0.keyring: (2) No such file or directory
Jan 15 21:36:11 stagnum3 ceph-osd[6585]: 2024-01-15T21:36:11.682+0000 7f3717f67800 -1 AuthRegistry(0x7ffdfa515c20) no keyring found at /etc/ceph/ceph.osd.0.keyring, disabling cephx
Jan 15 21:36:11 stagnum3 ceph-osd[6585]: failed to fetch mon config (--no-mon-config to skip)

Indeed, /etc/ceph/ceph.osd.0.keyring does not exist. The keyring is at /var/lib/ceph/osd/ceph-0/keyring and the location should be configured in /etc/ceph/ceph.conf

ceph.conf has the following contents:

[global]
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
keyring = /etc/ceph/$cluster.$name.keyring
mon host = 192.168.3.21 192.168.3.45 192.168.3.56
log to syslog = true
err to syslog = true
clog to syslog = true
mon cluster log to syslog = true
debug mon = 1/5
debug osd = 1/5

[client]
log file = /var/log/ceph.log

Notice that the file does not contain fsid setting nor public addr, cluster addr settings.

In one of the earlier iterations of the cluster I had a similar situation: two nodes had incorrect configuration but the third one (I can't remember if it was the leader node for ceph-osd or ceph-mon Juju applications) had correct configuration that contained fsid and addr settings and also [osd] section with
 keyring = /var/lib/ceph/osd/$cluster-$id/keyring setting. I was able to recover the cluster by copying the file to the other nodes, substituting addr settings and restarting ceph-osd service using systemctl. The files were overwritten with incorrect contents shortly after, presumably by juju agent.

If there is something I can do to help fixing this please let me know. I can tear down and reinstall the cluster if needed - I definitely can't hand over my cluster to the users until this is resolved.

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

As you noted the ceph.conf looks incomplete. Naturally the ceph-osd charm should manage that -- so something must have gone wrong there.

Would you be able to provide juju logs from the ceph-{osd,mon} units, and ideally sosreports for those as well?

TIA

Changed in charm-ceph-osd:
status: New → Incomplete
Revision history for this message
Rafał Krzewski (rafal-krzewski) wrote :

juju status
...
ceph-osd/0 active idle 0 192.168.3.46 Unit is ready (1 OSD)
ceph-osd/1* active idle 1 192.168.3.57 Unit is ready (1 OSD)
ceph-osd/2 blocked idle 2 192.168.3.30 No block devices detected using current configuration
...

debug-log --replay --include ceph-osd/2
unit-ceph-osd-1: 17:22:02 INFO unit.ceph-osd/2.juju-log Updating status.
unit-ceph-osd-1: 17:22:02 INFO juju.worker.uniter.operation ran "update-status" hook (via explicit, bespoke hook script)

I scrolled back and cant's see any other messages. Logs from working unit look the same:

debug-log --replay --include ceph-osd/1
unit-ceph-osd-1: 17:27:59 INFO unit.ceph-osd/1.juju-log Updating status.
unit-ceph-osd-1: 17:28:00 INFO juju.worker.uniter.operation ran "update-status" hook (via explicit, bespoke hook script)

ceph-mon/1 is the leader

juju debug-log --replay --include ceph-mon/1
unit-ceph-mon-1: 17:27:29 WARNING unit.ceph-mon/1.juju-log 0 containers are present in metadata.yaml and refresh_event was not specified. Defaulting to update_status. Metrics IP may not be set in a timely fashion.
unit-ceph-mon-1: 17:27:29 INFO unit.ceph-mon/1.juju-log Updating status
unit-ceph-mon-1: 17:27:29 INFO unit.ceph-mon/1.juju-log Status updated
unit-ceph-mon-1: 17:27:30 INFO unit.ceph-mon/1.juju-log Updating status
unit-ceph-mon-1: 17:27:30 INFO unit.ceph-mon/1.juju-log Status updated
unit-ceph-mon-1: 17:27:30 INFO juju.worker.uniter.operation ran "update-status" hook (via hook dispatching script: dispatch)

scrolling back does not reveal any other messages.

I think this is because it's been a few days since I've rebooted the machine 2 and juju logs seem to go back <24h.

I'll try rebooting another machine and see if any other messages show up and report back.

I've never used sosreports before. I see it's available on ubuntu. Do you recommend any specific options I should use? Should I attach the results to the bug or upload somewhere else (where?) and post links? Do you need sosreports from one unit of each application or from all of them?

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Maybe the juju unit log files have been rotated -- if so the older log files would be available in /var/log/juju on the units.

Let's take a look at these first, can check later if we indeed need the sosreports.

Revision history for this message
Rafał Krzewski (rafal-krzewski) wrote :
Download full text (3.6 KiB)

I did an experiment: I've installed the cluster step by step using CLI instead of bundle to catch at what point /etc/ceph/ceph.conf gets clobbered:

juju add-machine -n 3

juju deploy ceph-mon --channel quincy/stable -n 3 --to lxd:0,lxd:1,lxd:2

juju deploy ceph-osd --channel quincy/stable -n 3 --to 0,1,2\
  --config osd-devices=/dev/nvme0n1

juju integrate ceph-mon:osd ceph-osd:mon

# at this point /etc/ceph/ceph.conf shows up on machines 0..2 with correct contents

juju deploy ceph-fs --channel quincy/stable -n 3 --to lxd:0,lxd:1,lxd:2

juju integrate ceph-mon:mds ceph-fs:ceph-mds

juju deploy easyrsa --channel 1.28/stable -n 3 --to lxd:0,lxd:1,lxd:2

juju deploy etcd --channel 1.28/stable -n 3 --to lxd:0,lxd:1,lxd:2

juju integrate easyrsa:client etcd:certificates

juju deploy kubernetes-control-plane --channel 1.28/stable -n 3 --to 0,1,2 \
  --config extra_sans="127.0.0.1 192.168.3.5 k8s.stagnum.caltha.eu"\
  --config loadbalancer-ips=192.168.3.5\
  --config service-cidr=10.152.180.0/22\
  --config register-with-taints=""\
  --config proxy-extra-config="{
          mode: ipvs,
          ipvs: {
            strictARP: true
          }
        }"\
  --config sysctl="{
          net.bridge.bridge-nf-call-iptables: 0,
          net.ipv4.conf.all.forwarding: 1,
          net.ipv4.conf.all.rp_filter: 1,
          net.ipv4.neigh.default.gc_thresh1: 128,
          net.ipv4.neigh.default.gc_thresh2: 28672,
          net.ipv4.neigh.default.gc_thresh3: 32768,
          net.ipv6.neigh.default.gc_thresh1: 128,
          net.ipv6.neigh.default.gc_thresh2: 28672,
          net.ipv6.neigh.default.gc_thresh3: 32768,
          fs.inotify.max_user_instances: 8192,
          fs.inotify.max_user_watches: 1048576,
          kernel.panic: 10,
          kernel.panic_on_oops: 1,
          vm.overcommit_memory: 1
        }"\
  --config allow-privileged=true

juju integrate easyrsa:client kubernetes-control-plane:certificates

juju integrate etcd:db kubernetes-control-plane:etcd

juju deploy containerd --channel 1.28/stable

juju integrate kubernetes-control-plane:container-runtime containerd:containerd

juju deploy calico --channel 1.28/stable\
  --config cidr=92.168.64.0/20

juju integrate etcd:db calico:etcd

juju integrate kubernetes-control-plane:cni calico:cni

juju deploy kubeapi-load-balancer --channel 1.28/stable -n 3 --to lxd:0,lxd:1,lxd:2\
  --config extra_sans="127.0.0.1 192.168.3.5 k8s.stagnum.caltha.eu"

juju deploy keepalived --channel stable\
  --config vip_hostname=k8s.stagnum.caltha.eu\
  --config virtual_ip=192.168.3.5

juju integrate easyrsa:client kubeapi-load-balancer:certificates

juju integrate kubeapi-load-balancer:juju-info keepalived:juju-info

juju integrate kubernetes-control-plane:loadbalancer-internal kubeapi-load-balancer:lb-consumers

juju integrate kubernetes-control-plane:loadbalancer-external kubeapi-load-balancer:lb-consumers

juju deploy ceph-csi --channel stable\
  --config namespace=kube-system\
  --config cephfs-enable=true

juju integrate ceph-csi:kubernetes kubernetes-control-plane:juju-info

juju integrate ceph-csi:ceph-client ceph-mon:client

# as soon as ceph-csi starts, /etc/ceph/ceph.conf on machines 0...

Read more...

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Just to confirm your hypothesis, indeed ceph-csi writes to /etc/ceph/ceph.conf so you will need to separate them somehow

https://github.com/charmed-kubernetes/ceph-csi-operator/blob/52dc3d10048c46a1f903238dfd253f1eab3e10b2/src/charm.py#L189

Right as you say ceph-osd needs to be unconfined for raw disk access

Revision history for this message
Rafał Krzewski (rafal-krzewski) wrote :

After redeploying the cluster with kubernetes-control-plane units in LXD containers I've run into another problem:

  Warning FailedMount 10m kubelet MountVolume.MountDevice failed for volume "pvc-235b7bf7-b4b7-443d-a810-b1acc12eed45" : rpc error: code = Internal desc = rbd: map failed with error an error (exit status 22) occurred while running rbd args: [--id ceph-csi -m 192.168.3.45,192.168.3.46,192.168.3.60 --keyfile=***stripped*** map xfs-pool/csi-vol-79150a03-d4df-45f6-a339-d919a0184236 --device-type krbd --options noudev], rbd error output: rbd: mapping succeeded but /dev/rbd0 is not accessible, is host /dev mounted?
rbd: map failed: (22) Invalid argument

I can see /dev/rbd0 device on machine 2 but not in 2/lxd/5 container where kubelet is running.

I've tried setting ceph-csi cephfs-mounter=ceph-fuse and now I see the following message:

Warning FailedMount 64s (x5 over 9m13s) kubelet MountVolume.MountDevice failed for volume "pvc-235b7bf7-b4b7-443d-a810-b1acc12eed45" : rpc error: code = Internal desc = exit status 1

I don't know if changing the setting on already deployed cluster doesn't work or I have run into yet another limitation. I've looked into systemctl logs of snap.kubelet.daemon.service on 2/lxd/5 but it does not shows any more details about the error.

Next thing I'm going to try is destroying the model and redeploying with cephfs-mounter=ceph-fuse from the get go. Perhaps you have any other suggestions?

Revision history for this message
Rafał Krzewski (rafal-krzewski) wrote :

Oh, is cephfs-mounter used with RBD devices at all?

If not, is there any way to make host's /dev/rbd* devices available in the LXD container?

I have only 3 bare metal machines to build this cluster so I need to co-locate ceph-osd and kubernetes-control-plane units somehow.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Instead of a lxd container, you can consider a kvm instance. It will run isolated as a virtual machine, which avoids some of the limitations on the privilege restrictions in lxd containers for security reasons.

For placement, instead of --to lxd:machine#, you simply use kvm:machine# and you get a kvm instance instead.

Revision history for this message
Rafał Krzewski (rafal-krzewski) wrote :

Since running ceph-osd in kvm is a non-starter, I guess I should run kubernetes-control-plane in kvm then. But will it use host resources effectively though? Note that the actual k8s workloads would also run in the kvm in this setup.

Maybe ceph-csi charm could be modified to store it's ceph client configuration at location other than /etc/ceph/ceph.conf? That would solve the problem I'm facing, and I think that I'm not the only person who would like to setup Charmed Kubernetes on a small number of bare metal machines.

Revision history for this message
Rafał Krzewski (rafal-krzewski) wrote :
Download full text (5.2 KiB)

I'm trying to deploy kubernetes-control-plane to kvm but it turns out to be difficult.

I've used the following placement directive for kubernetes-control-plane in the bundle definition:

    to:
    - kvm:0
    - kvm:1
    - kvm:2
    constraints: cores=60 mem=240G

It failed to deploy, with the following message displayed for the machine:

no obvious space for container "1/lxd/0", host machine has spaces: "stagnum", "undefined"

MAAS shows following sppaces:

stagnum untagged MAAS-provided fabric-0 192.168.3.0/24 16%
No space untagged No DHCP fabric-1 192.168.122.0/24 100%

I gather that "undefined" is "No space" one created for 192.168.122.0/24 likely coming from KVM/QEMU

I tried changing the constraints to "cores=60 mem=240G spaces=stagnum" but now I'm getting the following deployment error:

matching subnets to zones: cannot use space "alpha" as deployment target: no subnets

$ juju spaces
Name Space ID Subnets
alpha 0
stagnum 1 192.168.3.0/24
undefined 2 192.168.122.0/24
$ juju show-space alpha
space:
  id: "0"
  name: alpha
  subnets: []
applications:
- calico
- ceph-csi
- ceph-fs
- ceph-mon
- ceph-osd
- containerd
- easyrsa
- etcd
- keepalived
- kubeapi-load-balancer
- kubernetes-control-plane
machine-count: 0

Where did the "alpha" space come from? And why is kubernetes-control-plane assigned to it despite spaces=stagnum constraint? I didn't have to touch anything space-related until now...

I went back and redeployed the cluster with unconfiened kubernetes-control-plane. Apparently all applications are assigned to "alpha" space also with this setup

$ juju spaces
Name Space ID Subnets
alpha 0
stagnum 1 192.168.3.0/24
undefined 2 192.168.122.0/24
$ juju show-space alpha
space:
  id: "0"
  name: alpha
  subnets: []
applications:
- calico
- ceph-fs
- ceph-mon
- ceph-osd
- containerd
- easyrsa
- etcd
- keepalived
- kubeapi-load-balancer
- kubernetes-control-plane
machine-count: 0
$ juju show-space stagnum
space:
  id: "1"
  name: stagnum
  subnets:
  - cidr: 192.168.3.0/24
    provider-id: "1"
    vlan-tag: 0
applications: []
machine-count: 18

Despite that, all machines in the model are getting IPs in 192.168.3.0/24 block:

Machine State Address Inst id Base AZ Message
0 started 192.168.3.51 stagnum1 ubuntu@22.04 default Deployed
0/lxd/0 started 192.168.3.49 juju-02d0c2-0-lxd-0 ubuntu@22.04 default Container started
0/lxd/1 started 192.168.3.64 juju-02d0c2-0-lxd-1 ubuntu@22.04 default Container started
0/lxd/2 started 192.168.3.27 juju-02d0c2-0-lxd-2 ubuntu@22.04 default Container started
0/lxd/3 started 192.168.3.39 juju-02d0c2-0-lxd-3 ubuntu@22.04 default Container started
0/lxd/4 started 192.168.3.37 juju-02d0c2-0-lxd-4 ubuntu@22.04 default Container started

For the next test, I've destroyed the model, recreated it and ran `juju mod...

Read more...

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Ceph OSD Charm because there has been no activity for 60 days.]

Changed in charm-ceph-osd:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.