Cannot scale-out ceph-radosgw application if multi-site replication is enabled

Bug #2062405 reported by Ionut-Madalin Balutoiu
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph RADOS Gateway Charm
Fix Committed
Undecided
Ionut-Madalin Balutoiu

Bug Description

If multisite replication feature is enabled, the "ceph-radosgw" Juju application (primary or secondary) cannot be scaled out.

See this doc for the multi-site replication feature details:
https://ubuntu.com/ceph/docs/setting-up-multi-site

After the multi-site replication is established via the following Juju relation:

```
juju relate primary-ceph-radosgw:primary secondary-ceph-radosgw:secondary
```

any scale-out operation via the following commands will fail:

```
juju add-unit primary-ceph-radosgw
```

or

```
juju add-unit secondary-ceph-radosgw
```

This is the error from "juju debug-log UNIT_ID" of any new Juju unit:

```
2024-04-18T18:55:00.016+0000 7f5d891b1080 -1 Errors while parsing config file!
2024-04-18T18:55:00.016+0000 7f5d891b1080 -1 can't open ceph.conf: (2) No such file or directory
unable to get monitor info from DNS SRV with service name: ceph-mon
2024-04-18T18:55:00.072+0000 7f5d891b1080 -1 failed for service _ceph-mon._tcp
2024-04-18T18:55:00.072+0000 7f5d891b1080 -1 monclient: get_monmap_and_config cannot identify monitors to contact
failed to fetch mon config (--no-mon-config to skip)
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-ceph-radosgw-2/charm/hooks/leader-settings-changed", line 1210, in <module>
    assess_status(CONFIGS)
  File "/var/lib/juju/agents/unit-ceph-radosgw-2/charm/hooks/utils.py", line 334, in assess_status
    assess_status_func(configs)()
  File "/var/lib/juju/agents/unit-ceph-radosgw-2/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1828, in _assess_status_func
    state, message = _determine_os_workload_status(*args, **kwargs)
  File "/var/lib/juju/agents/unit-ceph-radosgw-2/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1042, in _determine_os_workload_status
    state, message = _ows_check_charm_func(
  File "/var/lib/juju/agents/unit-ceph-radosgw-2/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1206, in _ows_check_charm_func
    charm_state, charm_message = charm_func_with_configs()
  File "/var/lib/juju/agents/unit-ceph-radosgw-2/charm/hooks/charmhelpers/contrib/openstack/utils.py", line 1043, in <lambda>
    state, message, lambda: charm_func(configs))
  File "/var/lib/juju/agents/unit-ceph-radosgw-2/charm/hooks/utils.py", line 249, in check_optional_config_and_relations
    if not multisite.is_multisite_configured(config('zone'),
  File "/var/lib/juju/agents/unit-ceph-radosgw-2/charm/hooks/multisite.py", line 701, in is_multisite_configured
    local_zones = list_zones()
  File "/var/lib/juju/agents/unit-ceph-radosgw-2/charm/hooks/charmhelpers/core/decorators.py", line 40, in _retry_on_exception_inner_2
    return f(*args, **kwargs)
  File "/var/lib/juju/agents/unit-ceph-radosgw-2/charm/hooks/multisite.py", line 125, in list_zones
    _zones = _list('zone')
  File "/var/lib/juju/agents/unit-ceph-radosgw-2/charm/hooks/multisite.py", line 72, in _list
    result = json.loads(_check_output(cmd))
  File "/var/lib/juju/agents/unit-ceph-radosgw-2/charm/hooks/charmhelpers/core/decorators.py", line 40, in _retry_on_exception_inner_2
    return f(*args, **kwargs)
  File "/var/lib/juju/agents/unit-ceph-radosgw-2/charm/hooks/multisite.py", line 33, in _check_output
    return subprocess.check_output(cmd).decode('UTF-8')
  File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['radosgw-admin', '--id=rgw.juju-bafcdf-11', 'zone', 'list']' returned non-zero exit status 1.
```

This is happening because the multi-site functions are part of `check_optional_config_and_relations`, which
is called by `assess_status` after every successful hook in the main hook entrypoint:

```
if __name__ == '__main__':
    try:
        hooks.execute(sys.argv)
    except UnregisteredHookError as e:
        log('Unknown hook {} - skipping.'.format(e))
    except ValueError as e:
        # Handle any invalid configuration values
        status_set(WORKLOAD_STATES.BLOCKED, str(e))
    else:
        assess_status(CONFIGS)
```

It seems that the method `check_optional_config_and_relations` doesn't return early if the unit is not ready for service (ceph conf and keyring files are not created yet).

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-radosgw (master)
Changed in charm-ceph-radosgw:
status: New → In Progress
Changed in charm-ceph-radosgw:
assignee: nobody → Ionut-Madalin Balutoiu (ionutbalutoiu)
Chris Valean (cvalean)
Changed in charm-ceph-radosgw:
status: In Progress → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-radosgw (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-radosgw/+/916331
Committed: https://opendev.org/openstack/charm-ceph-radosgw/commit/1cac43fadcc6623c7805c715f665346f48fbd521
Submitter: "Zuul (22348)"
Branch: master

commit 1cac43fadcc6623c7805c715f665346f48fbd521
Author: Ionut Balutoiu <email address hidden>
Date: Thu Oct 19 13:40:26 2023 +0300

    Fix scale-out in the multi-site replication scenario

    If the multi-site relation is established, the `ceph-radosgw` application
    cannot be scaled out.

    This is happening because the multi-site functions are part of
    `check_optional_config_and_relations`, which is called by `assess_status`
    after every successful hook in the main hook entrypoint:
    ```
    if __name__ == '__main__':
        try:
            hooks.execute(sys.argv)
        except UnregisteredHookError as e:
            log('Unknown hook {} - skipping.'.format(e))
        except ValueError as e:
            # Handle any invalid configuration values
            status_set(WORKLOAD_STATES.BLOCKED, str(e))
        else:
            assess_status(CONFIGS)
    ```

    The multi-site functions (for example: `is_multisite_configured` or
    `check_cluster_has_buckets`) will fail since the unit is not be ready
    for service.

    This change ensures that the unit is ready for service before calling
    any multi-site functions.

    Closes-Bug: #2062405
    Change-Id: I63c21a0b545bb456df9b09d8c16cc43cd7eec2f3
    Signed-off-by: Ionut Balutoiu <email address hidden>

Changed in charm-ceph-radosgw:
status: Confirmed → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.