Ceph Dashboard Charm

Opening NFS tab in the dashboard leads to ceph mgr crash - orchestrator._interface.NoOrchestrator: No orchestrator configured

Bug #2039955 reported by Nobuto Murata on 2023-10-20

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Ceph Dashboard Charm	New	Undecided	Unassigned
	ceph (Ubuntu)	Confirmed	Undecided	Unassigned

Bug Description

Whenever the NFS tab in the Ceph dashboard is opened, NoOrchestrator exception is raised and it's considered as a ceph mgr module crash (although it's not an actual process crash).

Other tabs that require orchestrator handle the situation well, those tabs prints the following message but no exception is raised.

====
Orchestrator is not available
Orchestrator is unavailable: No orchestrator configured (try `ceph orch set backend`)
Please consult the documentation on how to configure and enable the management functionality.
====

In the meantime, with the NFS tab, exception is raised.

https://dashboard.example.com:8443/#/nfs
====
NFS-Ganesha is not configured

Remote method threw exception: Traceback (most recent call last): File "/usr/share/ceph/mgr/nfs/module.py", line 169, in cluster_ls return available_clusters(self) File "/usr/share/ceph/mgr/nfs/utils.py", line 38, in available_clusters completion = mgr.describe_service(service_type='nfs') File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1488, in inner completion = self._oremote(method_name, args, kwargs) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 1555, in _oremote raise NoOrchestrator() orchestrator._interface.NoOrchestrator: No orchestrator configured (try `ceph orch set backend`)
Please consult the documentation on how to configure and enable the management functionality.
====

# ceph health
HEALTH_WARN 2 mgr modules have recently crashed

# ceph crash ls
ID ENTITY NEW
2023-10-20T00:40:40.362363Z_2f461bb5-343c-4cb4-8134-99ae29ddc60c mgr.juju-ffeb43-0-lxd-0 *
2023-10-20T02:24:37.980204Z_9bf106e2-0dd2-4a88-b0f4-647dfa82697f mgr.juju-ffeb43-0-lxd-0 *

# ceph crash info 2023-10-20T00:40:40.362363Z_2f461bb5-343c-4cb4-8134-99ae29ddc60c
{
    "backtrace": [
        " File \"/usr/share/ceph/mgr/nfs/module.py\", line 169, in cluster_ls\n return available_clusters(self)",
        " File \"/usr/share/ceph/mgr/nfs/utils.py\", line 38, in available_clusters\n completion = mgr.describe_service(service_type='nfs')",
        " File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1488, in inner\n completion = self._oremote(method_name, args, kwargs)",
        " File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1555, in _oremote\n raise NoOrchestrator()",
        "orchestrator._interface.NoOrchestrator: No orchestrator configured (try `ceph orch set backend`)"
    ],
    "ceph_version": "17.2.6",
    "crash_id": "2023-10-20T00:40:40.362363Z_2f461bb5-343c-4cb4-8134-99ae29ddc60c",
    "entity_name": "mgr.juju-ffeb43-0-lxd-0",
    "mgr_module": "nfs",
    "mgr_module_caller": "ActivePyModule::dispatch_remote cluster_ls",
    "mgr_python_exception": "NoOrchestrator",
    "os_id": "22.04",
    "os_name": "Ubuntu 22.04.3 LTS",
    "os_version": "22.04.3 LTS (Jammy Jellyfish)",
    "os_version_id": "22.04",
    "process_name": "ceph-mgr",
    "stack_sig": "b01db59d356dd52f69bfb0b128a216e7606f54a60674c3c82711c23cf64832ce",
    "timestamp": "2023-10-20T00:40:40.362363Z",
    "utsname_hostname": "juju-ffeb43-0-lxd-0",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.0-87-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#97-Ubuntu SMP Mon Oct 2 21:09:21 UTC 2023"
}

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: ceph-mgr-dashboard 17.2.6-0ubuntu0.22.04.1
ProcVersionSignature: Ubuntu 5.15.0-87.97-generic 5.15.122
Uname: Linux 5.15.0-87-generic x86_64
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
CasperMD5CheckResult: unknown
CloudArchitecture: x86_64
CloudID: lxd
CloudName: lxd
CloudPlatform: lxd
CloudSubPlatform: LXD socket API v. 1.0 (/dev/lxd/sock)
Date: Fri Oct 20 09:49:25 2023
PackageArchitecture: all
ProcEnviron:
TERM=screen-256color
PATH=(custom, no user)
LANG=C.UTF-8
SHELL=/bin/bash
SourcePackage: ceph
UpgradeStatus: No upgrade log present (probably fresh install)

Tags:

Revision history for this message

Nobuto Murata (nobuto) wrote on 2023-10-20:

NonfreeKernelModules.txt Edit (1.1 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (7.0 KiB, text/plain; charset="utf-8")
ProcCpuinfoMinimal.txt Edit (1.2 KiB, text/plain; charset="utf-8")

tags:

added: field-ceph-dashboard

Revision history for this message

Nobuto Murata (nobuto) wrote on 2023-10-20:

Subscribing ~field-high.

Even though it may be an upstream issue, we should look into this since whenever somebody clicks the tab, the whole Ceph cluster status will turn into WARNING and that will trigger alerts for operators.

Revision history for this message

Nobuto Murata (nobuto) wrote on 2023-10-20:

ceph-mgr.juju-ffeb43-0-lxd-0.log Edit (16.8 MiB, text/plain)

The test cluster was deployed with the steps in:
https://bugs.launchpad.net/charm-ceph-dashboard/+bug/2039763/comments/1

Revision history for this message

Nobuto Murata (nobuto) wrote on 2023-10-20:

https://tracker.ceph.com/issues/56246

Revision history for this message

Samuel Allan (samuelallan) wrote on 2023-11-20:

Download full text (5.7 KiB)

Definitely an upstream issue, not related to the ceph-dashboard charm.

Exploring the ceph repository:

`src/pybind/mgr/dashboard/controllers/nfs.py`

```
    @Endpoint()
    @ReadPermission
    def status(self):
        status = {'available': True, 'message': None}
        try:

            # this is where the call happens that causes the crash - the crash is coming from ceph though, not the fault of this
            # NOTE: running `sudo ceph nfs cluster ls` prints:
            # Error ENOENT: No orchestrator configured (try `ceph orch set backend`)
            # but does not show a traceback.
            # This may be limited to the python api?
            mgr.remote('nfs', 'cluster_ls')

        except (ImportError, RuntimeError) as error:
            logger.exception(error)
            status['available'] = False
            status['message'] = str(error) # type: ignore

return status
```

When the orchestrator is not present, we see this traceback:

```
{
    "archived": "2023-11-20 04:58:57.151697",
    "backtrace": [
        " File \"/usr/share/ceph/mgr/nfs/module.py\", line 169, in cluster_ls\n return available_clusters(self)",
        " File \"/usr/share/ceph/mgr/nfs/utils.py\", line 38, in available_clusters\n completion = mgr.describe_service(service_type='nfs')",
        " File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1488, in inner\n completion = self._oremote(method_name, args, kwargs)",
        " File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1555, in _oremote\n raise NoOrchestrator()",
        "orchestrator._interface.NoOrchestrator: No orchestrator configured (try `ceph orch set backend`)"
    ],
    "ceph_version": "17.2.6",
    "crash_id": "2023-11-20T04:47:16.737623Z_8a944527-1cc1-4ed5-b58b-86bf97bcf3b1",
    "entity_name": "mgr.juju-108031-1-lxd-1",
    "mgr_module": "nfs",
    "mgr_module_caller": "ActivePyModule::dispatch_remote cluster_ls",
    "mgr_python_exception": "NoOrchestrator",
    "os_id": "22.04",
    "os_name": "Ubuntu 22.04.3 LTS",
    "os_version": "22.04.3 LTS (Jammy Jellyfish)",
    "os_version_id": "22.04",
    "process_name": "ceph-mgr",
    "stack_sig": "b01db59d356dd52f69bfb0b128a216e7606f54a60674c3c82711c23cf64832ce",
    "timestamp": "2023-11-20T04:47:16.737623Z",
    "utsname_hostname": "juju-108031-1-lxd-1",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.0-88-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023"
}

```

I guess this is the part that maps directly to the `cluster_ls` method:
```
"mgr_module_caller": "ActivePyModule::dispatch_remote cluster_ls",
```

This is `cluster_ls`, in `src/pybind/mgr/nfs/module.py`.

```
    # this raises an error, causing a module crash, if orchestrator is not available
    def cluster_ls(self) -> List[str]:
        return available_clusters(self)
```

^ This is the root of the traceback we're seeing.

I guess the reason we're seeing a crash, is because this method doesn't catch any errors thrown from `available_clusters`.
For reference, other methods I've checked here will handle the error.
For example:

(in `src/pybind/mgr/nfs/...

Definitely an upstream issue, not related to the ceph-dashboard charm.

Exploring the ceph repository:

`src/pybind/mgr/dashboard/controllers/nfs.py`

```
    @Endpoint()
    @ReadPermission
    def status(self):
        status = {'available': True, 'message': None}
        try:

# this is where the call happens that causes the crash - the crash is coming from ceph though, not the fault of this
            # NOTE: running `sudo ceph nfs cluster ls` prints:
            #   Error ENOENT: No orchestrator configured (try `ceph orch set backend`)
            # but does not show a traceback.
            # This may be limited to the python api?
            mgr.remote('nfs', 'cluster_ls')

except (ImportError, RuntimeError) as error:
            logger.exception(error)
            status['available'] = False
            status['message'] = str(error)  # type: ignore

return status
```

When the orchestrator is not present, we see this traceback:

```
{
    "archived": "2023-11-20 04:58:57.151697",
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/nfs/module.py\", line 169, in cluster_ls\n    return available_clusters(self)",
        "  File \"/usr/share/ceph/mgr/nfs/utils.py\", line 38, in available_clusters\n    completion = mgr.describe_service(service_type='nfs')",
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1488, in inner\n    completion = self._oremote(method_name, args, kwargs)",
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1555, in _oremote\n    raise NoOrchestrator()",
        "orchestrator._interface.NoOrchestrator: No orchestrator configured (try `ceph orch set backend`)"
    ],
    "ceph_version": "17.2.6",
    "crash_id": "2023-11-20T04:47:16.737623Z_8a944527-1cc1-4ed5-b58b-86bf97bcf3b1",
    "entity_name": "mgr.juju-108031-1-lxd-1",
    "mgr_module": "nfs",
    "mgr_module_caller": "ActivePyModule::dispatch_remote cluster_ls",
    "mgr_python_exception": "NoOrchestrator",
    "os_id": "22.04",
    "os_name": "Ubuntu 22.04.3 LTS",
    "os_version": "22.04.3 LTS (Jammy Jellyfish)",
    "os_version_id": "22.04",
    "process_name": "ceph-mgr",
    "stack_sig": "b01db59d356dd52f69bfb0b128a216e7606f54a60674c3c82711c23cf64832ce",
    "timestamp": "2023-11-20T04:47:16.737623Z",
    "utsname_hostname": "juju-108031-1-lxd-1",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.0-88-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023"
}

```

I guess this is the part that maps directly to the `cluster_ls` method:
```
"mgr_module_caller": "ActivePyModule::dispatch_remote cluster_ls",
```

This is `cluster_ls`, in `src/pybind/mgr/nfs/module.py`.

```
    # this raises an error, causing a module crash, if orchestrator is not available
    def cluster_ls(self) -> List[str]:
        return available_clusters(self)
```

^ This is the root of the traceback we're seeing.

(in `src/pybind/mgr/nfs/cluster.py`, called from `ceph nfs cluster ls` handler in `_cmd_nfs_cluster_ls()` in `src/pybind/mgr/nfs/module.py`)

```
    def list_nfs_cluster(self) -> List[str]:
        try:
            return available_clusters(self.mgr)
        except Exception as e:
            log.exception("Failed to list NFS Cluster")
            raise ErrorResponse.wrap(e)
```

I tried the same pattern of catching the error, and raising `ErrorResponse` within `cluster_ls`,
but that still resulted in a crash:

```
{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/nfs/module.py\", line 173, in cluster_ls\n    return available_clusters(self)",
        "  File \"/usr/share/ceph/mgr/nfs/utils.py\", line 38, in available_clusters\n    completion = mgr.describe_service(service_type='nfs')",
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1488, in inner\n    completion = self._oremote(method_name, args, kwargs)",
        "  File \"/usr/share/ceph/mgr/orchestrator/_interface.py\", line 1555, in _oremote\n    raise NoOrchestrator()",
        "orchestrator._interface.NoOrchestrator: No orchestrator configured (try `ceph orch set backend`)",
        "\nThe above exception was the direct cause of the following exception:\n",
        "Traceback (most recent call last):",
        "  File \"/usr/share/ceph/mgr/nfs/module.py\", line 175, in cluster_ls\n    raise ErrorResponse.wrap(e)",
        "object_format.ErrorResponse: No orchestrator configured (try `ceph orch set backend`)"
    ],
    "ceph_version": "17.2.6",
    "crash_id": "2023-11-20T04:59:04.018086Z_2a16b6a4-85e5-49ee-93f0-c1b552f1df06",
    "entity_name": "mgr.juju-108031-1-lxd-1",
    "mgr_module": "nfs",
    "mgr_module_caller": "ActivePyModule::dispatch_remote cluster_ls",
    "mgr_python_exception": "ErrorResponse",
    "os_id": "22.04",
    "os_name": "Ubuntu 22.04.3 LTS",
    "os_version": "22.04.3 LTS (Jammy Jellyfish)",
    "os_version_id": "22.04",
    "process_name": "ceph-mgr",
    "stack_sig": "6a64a2a392fc0ad969c705c51ccec3206fab079f3c53ef566d1ed1d6f5088851",
    "timestamp": "2023-11-20T04:59:04.018086Z",
    "utsname_hostname": "juju-108031-1-lxd-1",
    "utsname_machine": "x86_64",
    "utsname_release": "5.15.0-88-generic",
    "utsname_sysname": "Linux",
    "utsname_version": "#98-Ubuntu SMP Mon Oct 2 15:18:56 UTC 2023"
}
```

I'm not sure what kind of pattern is required here for this kind of remote module method call where it's not a cli command.
We still need to convey an error response to the remote called (eg. ceph-dashboard in this case),
but without "crashing".

Revision history for this message

Samuel Allan (samuelallan) wrote on 2023-11-20:

Opened patch: https://github.com/ceph/ceph/pull/54583

Revision history for this message

Luciano Lo Giudice (lmlogiudice) wrote on 2023-11-21:

I'm not sure that patch is the correct fix. Looking at how other controllers for the dashboard operate, it would appear that the `NFSGanesha.status` and `NFSGaneshaCluster.list` methods should be decorated with the
@raise_if_no_orchestrator and @handle_orchestrator_error functions. Like so:

```
class NFSGanesha(RESTController):

  @EndPoint()
  @ReadPermission
  @raise_if_no_orchestrator()
  @handle_orchestrator_error('nfs')
  def status(self):
    ...
```

Revision history for this message

Samuel Allan (samuelallan) wrote on 2023-11-21:

Hmm I see what you mean for there, but I'm not sure about it in the context of the status check endpoints - the others don't have this error handler, and the client expects a specific json response.

The core issue here too is that the mgr module crashes when calling the cluster_ls method too, which is what I was trying to solve by catching the missing orchestrator error there, and introducing a new method to check if nfs is available.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2024-04-04:

Status changed to 'Confirmed' because the bug affects multiple users.