Ceph pools requested over relations are not created (and failed without any notification)

Bug #1872043 reported by Vladimir Grevtsev
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Triaged
Medium
Unassigned
charms.ceph
Triaged
Medium
Unassigned

Bug Description

bionic-train, latest charms revs from charmstore.

juju status: https://pastebin.canonical.com/p/RxKc4cHyDj/
full bundle: https://pastebin.canonical.com/p/K2zT4jnMcV/

When I tried to upload an image, Glance returned http500:

ubuntu@OrangeBox84:~/fce-demo$ openstack image create --file /home/ubuntu/cirros-0.4.0-x86_64-disk.img cirros
HTTP 500 Internal Server Error: The server has either erred or is incapable of performing the requested operation.

(verbose: https://pastebin.canonical.com/p/6MVh3RfWMg/)

Here's an error from glance.log:

2020-04-10 09:36:28.841 81289 DEBUG glance_store.capabilities [req-278632e9-1b4b-4035-987f-d0337e0d4201 9eb079961543484faeb64056888cb2d6 3eb266f3a48a41cf8869136e3d1566ed - fb0b7aaddfd8479c9477e4d033707e47 fb0b7aaddfd8479c9477e4d033
707e47] Store glance_store._drivers.rbd.Store doesn't support updating dynamic storage capabilities. Please overwrite 'update_capabilities' method of the store to implement updating logics if needed. update_capabilities /usr/lib/py
thon3/dist-packages/glance_store/capabilities.py:95
2020-04-10 09:36:28.841 81289 DEBUG glance_store.driver [req-278632e9-1b4b-4035-987f-d0337e0d4201 9eb079961543484faeb64056888cb2d6 3eb266f3a48a41cf8869136e3d1566ed - fb0b7aaddfd8479c9477e4d033707e47 fb0b7aaddfd8479c9477e4d033707e47
] Late loading location class glance_store._drivers.rbd.StoreLocation get_store_location_class /usr/lib/python3/dist-packages/glance_store/driver.py:117
2020-04-10 09:36:28.841 81289 DEBUG glance_store.location [req-278632e9-1b4b-4035-987f-d0337e0d4201 9eb079961543484faeb64056888cb2d6 3eb266f3a48a41cf8869136e3d1566ed - fb0b7aaddfd8479c9477e4d033707e47 fb0b7aaddfd8479c9477e4d033707e
47] Registering scheme rbd with {'store': <glance_store._drivers.rbd.Store object at 0x7f76bdc53908>, 'location_class': <class 'glance_store._drivers.rbd.StoreLocation'>, 'store_entry': 'glance.store.rbd.Store'} register_scheme_map
 /usr/lib/python3/dist-packages/glance_store/location.py:142
2020-04-10 09:36:28.858 81289 ERROR glance.api.v2.image_data [req-278632e9-1b4b-4035-987f-d0337e0d4201 9eb079961543484faeb64056888cb2d6 3eb266f3a48a41cf8869136e3d1566ed - fb0b7aaddfd8479c9477e4d033707e47 fb0b7aaddfd8479c9477e4d0337
07e47] Failed to upload image data due to internal error: rados.ObjectNotFound: [errno 2] error opening pool 'b'glance''
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi [req-278632e9-1b4b-4035-987f-d0337e0d4201 9eb079961543484faeb64056888cb2d6 3eb266f3a48a41cf8869136e3d1566ed - fb0b7aaddfd8479c9477e4d033707e47 fb0b7aaddfd8479c9477e4d033707e47]
 Caught error: [errno 2] error opening pool 'b'glance'': rados.ObjectNotFound: [errno 2] error opening pool 'b'glance''
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi Traceback (most recent call last):
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance/common/wsgi.py", line 1474, in __call__
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi request, **action_args)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance/common/wsgi.py", line 1517, in dispatch
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi return method(*args, **kwargs)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance/common/utils.py", line 417, in wrapped
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi return func(self, req, *args, **kwargs)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance/api/v2/image_data.py", line 299, in upload
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi self._restore(image_repo, image)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi self.force_reraise()
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi six.reraise(self.type_, self.value, self.tb)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi raise value
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance/api/v2/image_data.py", line 164, in upload
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi image.set_data(data, size, backend=backend)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance/domain/proxy.py", line 198, in set_data
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi self.base.set_data(data, size, backend=backend)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance/notifier.py", line 485, in set_data
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi _send_notification(notify_error, 'image.upload', msg)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi self.force_reraise()
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi six.reraise(self.type_, self.value, self.tb)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/six.py", line 693, in reraise
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi raise value
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance/notifier.py", line 432, in set_data
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi self.repo.set_data(data, size, backend=backend)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance/api/policy.py", line 208, in set_data
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi return self.image.set_data(*args, **kwargs)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance/quota/__init__.py", line 318, in set_data
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi self.image.set_data(data, size=size, backend=backend)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance/location.py", line 489, in set_data
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi verifier=verifier)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance_store/backend.py", line 491, in add_to_backend_with_multihash
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi image_id, data, size, hashing_algo, store, context, verifier)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance_store/backend.py", line 468, in store_add_to_backend_with_multihash
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi image_id, data, size, hashing_algo, context=context, verifier=verifier)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance_store/driver.py", line 279, in add_adapter
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi metadata_dict) = store_add_fun(*args, **kwargs)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance_store/capabilities.py", line 176, in op_checker
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi return store_op_fun(store, *args, **kwargs)
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "/usr/lib/python3/dist-packages/glance_store/_drivers/rbd.py", line 508, in add
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi with conn.open_ioctx(self.pool) as ioctx:
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "rados.pyx", line 516, in rados.requires.wrapper.validate_func
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi File "rados.pyx", line 1203, in rados.Rados.open_ioctx
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi rados.ObjectNotFound: [errno 2] error opening pool 'b'glance''
2020-04-10 09:36:28.898 81289 ERROR glance.common.wsgi
2020-04-10 09:36:28.941 81289 INFO eventlet.wsgi.server [req-278632e9-1b4b-4035-987f-d0337e0d4201 9eb07996

And there are no pool created by glance charm:

ubuntu@OrangeBox84:~/fce-demo$ sudo juju ssh ceph-mon/0 'sudo ceph df'
RAW STORAGE:
    CLASS SIZE AVAIL USED RAW USED %RAW USED
    ssd 108 GiB 104 GiB 5.1 MiB 4.0 GiB 3.71
    TOTAL 108 GiB 104 GiB 5.1 MiB 4.0 GiB 3.71

POOLS:
    POOL ID STORED OBJECTS USED %USED MAX AVAIL
    default.rgw.buckets.data 1 0 B 0 0 B 0 33 GiB
    default.rgw.control 2 0 B 8 0 B 0 33 GiB
    default.rgw.data.root 3 0 B 0 0 B 0 33 GiB
    default.rgw.gc 4 0 B 0 0 B 0 33 GiB
    default.rgw.log 5 0 B 175 0 B 0 33 GiB
    default.rgw.intent-log 6 0 B 0 0 B 0 33 GiB
    default.rgw.meta 7 0 B 0 0 B 0 33 GiB
    default.rgw.usage 8 0 B 0 0 B 0 33 GiB
    default.rgw.users.keys 9 0 B 0 0 B 0 33 GiB
    default.rgw.users.email 10 0 B 0 0 B 0 33 GiB
    default.rgw.users.swift 11 0 B 0 0 B 0 33 GiB
    default.rgw.users.uid 12 0 B 0 0 B 0 33 GiB
    default.rgw.buckets.extra 13 0 B 0 0 B 0 33 GiB
    default.rgw.buckets.index 14 0 B 0 0 B 0 33 GiB
    .rgw.root 15 1.2 KiB 4 48 KiB 0 33 GiB
Connection to 172.27.85.9 closed.
ubuntu@OrangeBox84:~/fce-demo$

(compared with another deployment, there are some pools missing - glance, gnocchi etc... https://pastebin.canonical.com/p/gq4yqrB7hM/)

Looking into ceph-mon unit log, I found the following (full log: http://paste.ubuntu.com/p/YJJwtRYgRW/ )

unit-ceph-mon-0: 09:10:48 DEBUG unit.ceph-mon/0.client-relation-changed Error ERANGE: pg_num 8 size 3 would mean 1020 total pgs, which exceeds max 1000 (mon_max_pg_per_osd 250 * num_in_osds 4)
unit-ceph-mon-0: 09:10:48 ERROR unit.ceph-mon/0.juju-log client:41: Command '['ceph', '--id', 'admin', 'osd', 'pool', 'create', 'glance', '8']' returned non-zero exit status 34.
unit-ceph-mon-0: 09:10:48 ERROR unit.ceph-mon/0.juju-log client:41: Unexpected error occurred while processing requests: {'api-version': 1, 'ops': [{'op': 'create-pool', 'name': 'glance', 'replicas': 3, 'pg_num': None, 'weight': 5, 'group': 'images', 'group-namespace': None, 'app-name': 'rbd', 'max-bytes': None, 'max-objects': None}], 'request-id': '0ccc9d81-7b0b-11ea-931c-00163e06321a'}

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Problem got workarounded with customizing ceph-mon config and restarting MON daemons:

$ juju config ceph-mon config-flags
{'global': {'mon max pg per osd': 100000}}

Changed in charm-ceph-mon:
assignee: nobody → Rodrigo Barbieri (rodrigo-barbieri2010)
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Hello Vladimir,

For this particular scenario I believe the cause is that by the time the pools were attempted to be created, there were not enough OSD up to allow for the proper amount of pgs per OSD to be calculated. The calculation assumes the intended amount of OSDs, but then when the pools are actually being created it considers the amount of OSDs up instead.

Assuming this is the cause, the recommended workaround would be to retry the failed hook (with juju resolve ceph-mon/x) a bit later when is most likely that all OSDs are up, so the pgs per OSDs are calculated properly and don't hit the limit.

Have you by any chance tried to retry the hook before proceeding to change the "config-flags" in ceph-mon charm?

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

After talking to Vlad, I see that no hook failed, no there was no hook to retry. This is clearly another problem which should be addressed first.

Changed in charm-ceph-mon:
assignee: Rodrigo Barbieri (rodrigo-barbieri2010) → nobody
Revision history for this message
James Page (james-page) wrote :

Pools are created as part of broker request handling in the ceph-mon charm - AFAICT the broker handlers don't deal with the pool creation step actually failing - errors are only returned to the consuming application when some sort of pre-creation validation fails:

    try:
        pool = ReplicatedPool(service=service,
                              op=request)
    except KeyError:
        msg = "Missing parameter."
        log(msg, level=ERROR)
        return {'exit-code': 1, 'stderr': msg}

    if not pool_exists(service=service, name=pool_name):
        log("Creating pool '{}' (replicas={})".format(pool.name, replicas),
            level=INFO)
        pool.create()
    else:
        log("Pool '{}' already exists - skipping create".format(pool.name),
            level=DEBUG)

    # Set/update properties that are allowed to change after pool creation.
    pool.update()

Revision history for this message
James Page (james-page) wrote :

Actually that's not quite true - the process_requests handler catches all exceptions, and a CalledProcess exception was probably generated during the pool.create() call.

It might be that the request-id is missing from the response data bag in this instance:

    except Exception as exc:
        log(str(exc), level=ERROR)
        msg = ("Unexpected error occurred while processing requests: %s" %
               reqs)
        log(msg, level=ERROR)
        return {'exit-code': 1, 'stderr': msg}

Revision history for this message
James Page (james-page) wrote :

I think the missing request-id key is probably not helping here:

    broker_key = get_broker_rsp_key()
    for unit in related_units(rid):
        rdata = relation_get(rid=rid, unit=unit)
        if rdata.get(broker_key):
            rsp = CephBrokerRsp(rdata.get(broker_key))
            if rsp.request_id == request.request_id:
                if not rsp.exit_code:
                    return True

as the last part of this code block will never match when an unexpected error like this occurs.

James Page (james-page)
Changed in charms.ceph:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charms.ceph (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charms.ceph/+/824519

Revision history for this message
James Page (james-page) wrote :

Note that using the 'expected-osd-count' configuration option will help avoid this situation - and also ensure that the total of weights configured on consuming applications is not more than 100 otherwise this type of situation may occur.

Later ceph releases use pg autoscaling so most likely won't hit this issue as the initial pg counts are typically much smaller.

Changed in charm-ceph-mon:
status: New → Triaged
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.