ceph-osd fails to start machine restart due to incorrect ceph.conf
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph OSD Charm |
Expired
|
Undecided
|
Unassigned |
Bug Description
I'm setting up a Charmed Kubernetes cluster on 3 bare metal machines managed by MAAS. Due to limited number of physical machines I am running a number of Juju units in LXD containers.
The overlay file looks as follows:
applications:
ceph-mon:
charm: ceph-mon
channel: quincy/stable
revision: 195
num_units: 3
to:
- lxd:0
- lxd:1
- lxd:2
ceph-osd:
charm: ceph-osd
channel: quincy/stable
revision: 576
num_units: 3
to:
- "0"
- "1"
- "2"
options:
osd-devices: /dev/nvme0n1
ceph-fs:
charm: ceph-fs
channel: quincy/stable
revision: 60
num_units: 3
to:
- lxd:0
- lxd:1
- lxd:2
ceph-csi:
charm: ceph-csi
channel: stable
revision: 37
options:
namespace: kube-system
cephfs-
relations:
- [ceph-osd:mon, ceph-mon:osd]
- [ceph-fs:ceph-mds, ceph-mon:mds]
- [ceph-mon:client, ceph-csi:
- [kubernetes-
The only workload that is running unconfined on the machines is kubernetes-
Deployment works fine, all units start up and Ceph StorageClasses are available in Kubernetes.
Trouble begins after restarting a node. ceph-osd unit on the node does not come up. Juju reports the following status message: "No block devices detected using current configuration"
It turns out that the ceph-osd systemd service is not running:
root@stagnum3:
× ceph-osd@0.service - Ceph object storage daemon osd.0
Loaded: loaded (/lib/systemd/
Active: failed (Result: exit-code) since Mon 2024-01-15 21:36:32 UTC; 2 days ago
Process: 7258 ExecStartPre=
Process: 7262 ExecStart=
Main PID: 7262 (code=exited, status=1/FAILURE)
CPU: 105ms
Jan 15 21:36:32 stagnum3 systemd[1]: ceph-osd@0.service: Scheduled restart job, restart counter is at 4.
Jan 15 21:36:32 stagnum3 systemd[1]: Stopped Ceph object storage daemon osd.0.
Jan 15 21:36:32 stagnum3 systemd[1]: ceph-osd@0.service: Start request repeated too quickly.
Jan 15 21:36:32 stagnum3 systemd[1]: ceph-osd@0.service: Failed with result 'exit-code'.
Jan 15 21:36:32 stagnum3 systemd[1]: Failed to start Ceph object storage daemon osd.0.
journalctl shows the following:
Jan 15 21:36:11 stagnum3 systemd[1]: Started Ceph object storage daemon osd.0.
Jan 15 21:36:11 stagnum3 ceph-osd[6585]: 2024-01-
Jan 15 21:36:11 stagnum3 ceph-osd[6585]: 2024-01-
Jan 15 21:36:11 stagnum3 ceph-osd[6585]: 2024-01-
Jan 15 21:36:11 stagnum3 ceph-osd[6585]: 2024-01-
Jan 15 21:36:11 stagnum3 ceph-osd[6585]: 2024-01-
Jan 15 21:36:11 stagnum3 ceph-osd[6585]: failed to fetch mon config (--no-mon-config to skip)
Indeed, /etc/ceph/
ceph.conf has the following contents:
[global]
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
keyring = /etc/ceph/
mon host = 192.168.3.21 192.168.3.45 192.168.3.56
log to syslog = true
err to syslog = true
clog to syslog = true
mon cluster log to syslog = true
debug mon = 1/5
debug osd = 1/5
[client]
log file = /var/log/ceph.log
Notice that the file does not contain fsid setting nor public addr, cluster addr settings.
In one of the earlier iterations of the cluster I had a similar situation: two nodes had incorrect configuration but the third one (I can't remember if it was the leader node for ceph-osd or ceph-mon Juju applications) had correct configuration that contained fsid and addr settings and also [osd] section with
keyring = /var/lib/
If there is something I can do to help fixing this please let me know. I can tear down and reinstall the cluster if needed - I definitely can't hand over my cluster to the users until this is resolved.
As you noted the ceph.conf looks incomplete. Naturally the ceph-osd charm should manage that -- so something must have gone wrong there.
Would you be able to provide juju logs from the ceph-{osd,mon} units, and ideally sosreports for those as well?
TIA