ceph-csi units blocked with `Config manifests require the definition of 'fsid'`

Bug #2064309 reported by Marian Gasparovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph CSI Charm
Fix Committed
Undecided
Unassigned

Bug Description

When running jobs/integration/validation.py::test_kubelet_extra_config ceph-csi units go to blocked state reporting `Config manifests require the definition of 'fsid'`

Logs - https://oil-jenkins.canonical.com/artifacts/585af18b-43be-4813-bb27-b44b232af98b/index.html

Tags: cdo-qa
Changed in charmed-kubernetes-testing:
milestone: none → 1.30
status: New → Triaged
importance: Undecided → High
Revision history for this message
Adam Dyess (addyess) wrote :
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

Hi @marian, from the logs this looks like your env is a microk8s charm based deployment that has ceph-csi charm units in the model. Ceph/mk8s integration wouldn't use the ceph-csi charm, so perhaps the env is dirty from previous validation runs?

If not, it looks similar to: https://bugs.launchpad.net/charm-ceph-csi/+bug/2035393

Which we determined to be invalid because the actual root cause was an invalid namespace prior to having the ceph manifests deployed: https://bugs.launchpad.net/charm-ceph-csi/+bug/2053265

That was fixed in charmed k8s 1.29+ck1 (released late april 2024). I'm gonna mark this bug as invalid for now and ask that you confirm with charmed-k8s + ceph-csi at 1.29+ck1 (or greater) to weed out any env and/or duplicate fixes. Thank you!

Changed in charmed-kubernetes-testing:
status: Triaged → Invalid
importance: High → Undecided
milestone: 1.30 → none
Revision history for this message
Jeffrey Chang (modern911) wrote :

ceph-csi node in blocked state before we start k8s-suite.
And we skipped checking ceph-csi status in the sku, can be some configuration issue.

This is not a mk8s deployment, but a charmed k8s 1.30 beta.
And this problem doesn't happen with 1.29/stable.
We might need to tweak config for 1.30 a bit.

affects: charmed-kubernetes-testing → charm-ceph-csi
Revision history for this message
Adam Dyess (addyess) wrote :
Changed in charm-ceph-csi:
milestone: none → 1.30
status: Invalid → Fix Committed
Revision history for this message
Adam Dyess (addyess) wrote :

https://solutions.qa.canonical.com/testruns/da629ee9-0528-4dfe-9e8f-59fe8bf3e538

aight, on this run -- i notice that the charm is failing to get the `ceph fsid` on the ceph-csi units. (same as the kubernetes-control-plane units)

* the charm apt installs the ceph tools
* configures the ceph client using connection details from the ceph-mon units
* runs ceph cli commands on that unit

i see reports in the charm

```
2024-06-25 00:26:51 ERROR unit.ceph-csi/1.juju-log server.go:325 get_ceph_fsid: Failed to get CephFS ID, reporting as empty string
```

which comes from the charm code:
```python
    def get_ceph_fsid(self) -> str:
        """Get the Ceph FSID (cluster ID)"""
        try:
            return self.ceph_cli("fsid").strip()
        except subprocess.SubprocessError:
            logger.error("get_ceph_fsid: Failed to get CephFS ID, reporting as empty string")
            return ""
```

i guess i need more details since either ceph-mon isn't providing the correct connection details, or the cli tools cannot reach the ceph-mon hosts

Revision history for this message
Adam Dyess (addyess) wrote :

Thanks so much to @asbalderson in helping me troubleshoot this.

it turned out that ceph-csi charm and ceph-mon charm were using a juju space `public-space` to communicate over the `ceph-client` relation.

Howver, since `ceph-csi` is a subordinate -- the primary `kubernetes-control-plane` didn't have an interface in the `public-space` space and was using an address from the `oam-space` to communicate.

This resulted in the ceph client on the ceph-csi unit not being able to contact the ceph-mon agent

Perhaps this bug could BEST be resolved by providing some indication to the user that the ceph-mon is timing out trying to contact it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.