ceph-mon does not process broker requests when it should be

Bug #1942224 reported by Dmitrii Shcherbakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Fix Released
Critical
Dmitrii Shcherbakov

Bug Description

A xenial-mitaka deployment got blocked on an unexpected state of the radosgw service on a radosgw unit: the service is masked while it should not be.

https://paste.ubuntu.com/p/WjXfc4WK5Y/ (tox -e func)
https://paste.ubuntu.com/p/9pBxmm7gbR/ (juju show-status-log -n 100 ceph-radosgw/0

juju ssh ceph-radosgw/0

root@juju-41bcdc-zaza-5cde965dcd88-5:~# grep mask /var/log/juju/unit-ceph-radosgw-0.log
2021-08-31 13:34:17 WARNING unit.ceph-radosgw/0.config-changed logger.go:60 Failed to restart radosgw.service: Unit radosgw.service is masked.
2021-08-31 13:34:23 WARNING unit.ceph-radosgw/0.config-changed logger.go:60 Failed to restart radosgw.service: Unit radosgw.service is masked.
2021-08-31 13:35:33 WARNING unit.ceph-radosgw/0.identity-service-relation-changed logger.go:60 Failed to restart radosgw.service: Unit radosgw.service is masked.

ubuntu@juju-41bcdc-zaza-5cde965dcd88-5:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.7 LTS
Release: 16.04
Codename: xenial

ubuntu@juju-41bcdc-zaza-5cde965dcd88-5:~$ dpkg -l | grep radosgw
ii radosgw 10.2.11-0ubuntu0.16.04.3 amd64 REST gateway for RADOS distributed object store

ubuntu@juju-41bcdc-zaza-5cde965dcd88-5:~$ systemctl status radosgw
● radosgw.service
   Loaded: masked (/dev/null; bad)
   Active: inactive (dead)
ubuntu@juju-41bcdc-zaza-5cde965dcd88-5:~$ service radosgw status
● radosgw.service
   Loaded: masked (/dev/null; bad)
   Active: inactive (dead)

The radosgw unit determines whether systemd should be used or not correctly by the looks of it at the time of a manual check so the service name is "radosgw", however, this service got previously masked somehow which is only possible from this charm if it thinks that the systemd code path should be used.

https://github.com/openstack/charm-ceph-radosgw/blob/c97fced7947e55870240557ae41dad7872a3c9d2/hooks/utils.py#L374-L398
https://github.com/openstack/charm-ceph-radosgw/blob/c97fced7947e55870240557ae41dad7872a3c9d2/hooks/hooks.py#L306-L314

root@juju-41bcdc-zaza-5cde965dcd88-5:~# for i in {0..2} ; do juju-run -u ceph-radosgw/0 -- 'relation-get -r `relation-ids mon` - ceph-mon/'$i ; done
egress-subnets: 10.5.1.50/32
ingress-address: 10.5.1.50
private-address: 10.5.1.50
egress-subnets: 10.5.3.121/32
ingress-address: 10.5.3.121
private-address: 10.5.3.121
auth: cephx
ceph-public-address: 10.5.2.242
egress-subnets: 10.5.2.242/32
fsid: 33bc4e50-0a60-11ec-8059-fa163e8fda4d
ingress-address: 10.5.2.242
private-address: 10.5.2.242
radosgw_key: AQBLMC5hyoihCxAAn+2rFM7Y1Z1yM10SG2RO6Q==

(Pdb) debug systemd_based_radosgw()
ENTERING RECURSIVE DEBUGGER
> <string>(1)<module>()->None
((Pdb)) s
--Call--
> /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/utils.py(374)systemd_based_radosgw()
-> def systemd_based_radosgw():

((Pdb)) relation_get(rid=rid, unit='ceph-mon/0')
{'private-address': '10.5.1.50', 'ingress-address': '10.5.1.50', 'egress-subnets': '10.5.1.50/32'}
((Pdb)) relation_get(rid=rid, unit='ceph-mon/1')
{'private-address': '10.5.3.121', 'ingress-address': '10.5.3.121', 'egress-subnets': '10.5.3.121/32'}
((Pdb)) relation_get(rid=rid, unit='ceph-mon/2')
{'private-address': '10.5.2.242', 'ingress-address': '10.5.2.242', 'egress-subnets': '10.5.2.242/32', 'radosgw_key': 'AQBLMC5hyoihCxAAn+2rFM7Y1Z1yM10SG2RO6Q==', 'auth': 'cephx', 'ceph-public-address': '10.5.2.24
2', 'fsid': '33bc4e50-0a60-11ec-8059-fa163e8fda4d'}

((Pdb)) relation_get('key_name', rid=rid, unit='ceph-radosgw/0')
((Pdb))

# ...

> /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/utils.py(377)systemd_based_radosgw()
-> for rid in relation_ids('mon'):
((Pdb)) n
> /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/utils.py(381)systemd_based_radosgw()
-> return False

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

The radosgw service is masked earlier at the install hook execution time where it is paused:

https://github.com/openstack/charm-ceph-radosgw/blob/c97fced7947e55870240557ae41dad7872a3c9d2/hooks/hooks.py#L174-L181
def install():
# ...
    log('Disable service "{}" until we have keys for it.'
        .format(service_name()), level=DEBUG)
    service_pause(service_name())

https://github.com/openstack/charm-ceph-radosgw/blob/c97fced7947e55870240557ae41dad7872a3c9d2/hooks/charmhelpers/core/host.py#L178-L200 (service_pause)
def service_pause(service_name, init_dir="/etc/init", initd_dir="/etc/init.d",
                  **kwargs):
# ...
    if init_is_systemd(service_name=service_name):
        service('disable', service_name)
        service('mask', service_name)

https://github.com/openstack/charm-ceph-radosgw/blob/c97fced7947e55870240557ae41dad7872a3c9d2/hooks/charmhelpers/core/host.py#L316-L326 (init_is_systemd)

(Pdb) from charmhelpers.core.host import init_is_systemd
(Pdb) init_is_systemd()
True

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (11.2 KiB)

Looks like a pool creation request to the mon was sent but not complete:

(Pdb)
> /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/contrib/storage/linux/ceph.py(2163)is_request_complete()
-> for rid in states.keys():
(Pdb) l
2158 Returns True if a similair request has been completed
2159
2160 @param request: A CephBrokerRq object
2161 """
2162 states = get_request_states(request, relation=relation)
2163 -> for rid in states.keys():
2164 if not states[rid]['complete']:
2165 return False
2166
2167 return True
2168
(Pdb) states
{'mon:15': {'complete': False, 'sent': True}}

(Pdb) where
  /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/config-changed(750)<module>()
-> hooks.execute(sys.argv)
  /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/core/hookenv.py(963)execute()
-> self._hooks[hook_name]()
  /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/contrib/hardening/harden.py(93)_harden_inner2()
-> return f(*args, **kwargs)
  /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/config-changed(261)config_changed()
-> _config_changed()
  /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/contrib/openstack/utils.py(1910)wrapped_f()
-> pre_restarts_wait_f)
  /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/core/host.py(837)restart_on_change_helper()
-> r = lambda_f()
  /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/contrib/openstack/utils.py(1904)<lambda>()
-> (lambda: f(*args, **kwargs)),
  /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/config-changed(231)_config_changed()
-> mon_relation(r_id, unit)
  /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/config-changed(374)mon_relation()
-> _mon_relation()
  /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/contrib/openstack/utils.py(1910)wrapped_f()
-> pre_restarts_wait_f)
  /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/core/host.py(837)restart_on_change_helper()
-> r = lambda_f()
  /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/contrib/openstack/utils.py(1904)<lambda>()
-> (lambda: f(*args, **kwargs)),
  /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/config-changed(290)_mon_relation()
-> if is_request_complete(rq, relation='mon'):
> /var/lib/juju/agents/unit-ceph-radosgw-0/charm/hooks/charmhelpers/contrib/storage/linux/ceph.py(2163)is_request_complete()
-> for rid in states.keys():

(Pdb) request.__dict__
{'request_id': '6b4eb3fa-0b17-11ec-8cfe-fa163e92b932', 'ops': [{'group-namespace': None, 'replicas': 3, 'max-objects': None, 'compression-min-blob-size-ssd': None, 'compression-algorithm': None, 'compression-max
-blob-size-hdd': None, 'group': 'objects', 'max-bytes': None, 'name': 'default.rgw.buckets.data', 'compression-min-blob-size': None, 'pg_num': None, 'weight': 20, 'op': 'create-pool', 'compression-mode': None, '
compression-max-blob-size': None, 'app-name': 'rgw', 'rbd-mirroring-mode': 'pool', 'compression-min-blob-size-hdd': None, 'compression-max-blob-size-ssd': None, 'compression-required-ratio': None}, {'group-names
pace': None, 'replicas': 3, ...

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Based on further investigation by ~gnuoy it appears to be that this code path does not cause notify_radosgws (and the rest of the notification functions) to be called:

https://github.com/openstack/charm-ceph-mon/blob/93e9bb39afdcf2e6093307cede8c09720d464ad0/hooks/ceph_hooks.py#L488

summary: - [xenial-mitaka] radosgw service is incorrectly masked resulting in:
- "Services not running that should be: radosgw"
+ ceph-mon does not process broker requests when it should
summary: - ceph-mon does not process broker requests when it should
+ ceph-mon does not process broker requests when it should be
no longer affects: charm-ceph-radosgw
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (master)
Changed in charm-ceph-mon:
status: New → In Progress
Changed in charm-ceph-mon:
importance: Undecided → Critical
assignee: nobody → Dmitrii Shcherbakov (dmitriis)
milestone: none → 21.10
Revision history for this message
Felipe Reyes (freyes) wrote :

I might have found this same behavior in a CI run for ceph-fs[0]

2021-09-01 20:25:09 DEBUG update-status inactive
2021-09-01 20:25:09 WARNING update-status Unit /<email address hidden> is masked, ignoring.

And on the ceph-mon units I see units sometimes being the leader and processing requests, and then they are not.

[0] https://openstack-ci-reports.ubuntu.com/artifacts/015/806936/1/check/hirsute-wallaby-bluestore/015a757/index.html

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/806980
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/82743ab7e5b8d0c98fd09a6c12ac23110acd89ec
Submitter: "Zuul (22348)"
Branch: master

commit 82743ab7e5b8d0c98fd09a6c12ac23110acd89ec
Author: Dmitrii Shcherbakov <email address hidden>
Date: Wed Sep 1 23:19:53 2021 +0300

    Notify more relations when cluster is bootstrapped

    Currently mon_relation only calls notify_rbd_mirrors when the cluster is
    already bootstrapped which leads to broker requests not being handled
    for other relations in some cases.

    The change also moves the bootstrap attempt code into a separate
    function and adds unit tests for mon_relation to cover different
    branches for various inputs.

    Closes-Bug: #1942224
    Change-Id: Id9b611d128acb7d49a9a9ad9c096b232fefd6c68

Changed in charm-ceph-mon:
status: In Progress → Fix Committed
Changed in charm-ceph-mon:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.