Glance

Bug #2059768
Comment #0

Comment 0 for bug 2059768

Revision history for this message

Guillaume Boutry (gboutry) wrote on 2024-03-29: glance hangs when rbd pool in read only

When the ceph pool backing glance is full (goes into read only), Glance IO calls never respond, and the worker taking care of the API call is basically a zombie.

If enough IO requests are made, for example 4 when you have 4 workers, glance will not be able to respond to any kind of requests. You need to restart glance to have responses again.

ceph status:
  cluster:
    id: ce9a32e4-9768-457a-b811-225b710aeb58
    health: HEALTH_ERR
            3 full osd(s)
            3 pool(s) full
            1 pool(s) have no replicas configured

  services:
    mon: 1 daemons, quorum bm0.lxd (age 2h)
    mgr: bm0.lxd(active, since 2h)
    osd: 3 osds: 3 up (since 2h), 3 in (since 2h)

  data:
    pools: 3 pools, 161 pgs
    objects: 6.92k objects, 47 GiB
    usage: 143 GiB used, 6.8 GiB / 150 GiB avail
    pgs: 161 active+clean

ceph osd dump | grep ratio:
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85

Here's a response from the apache 2 http proxying for glance:

openstack image delete 0a582014-832a-4f2a-9944-4111812fe6b2
Failed to delete image with name or ID '0a582014-832a-4f2a-9944-4111812fe6b2': HttpException: 502: Server Error for url: http://10.206.54.243:80/openstack-glance/v2/images/0a582014-832a-4f2a-9944-4111812fe6b2, The proxy server could not handle the requestReason: Error reading from remote server: 502 Proxy Error: Proxy Error: Apache/2.4.52 (Ubuntu) Server at 10.206.54.243 Port 9292: The proxy server received an invalid: response from an upstream server.
Failed to delete 1 of 1 images.

The last log for these requests at debug level is:

DEBUG glance_store.location [None req-4cdf1de9-fbe2-49a8-92d4-db0902773af2 e7cc50bfcb1246479c5b9397048377fe d0c1adff192b40e9989460336bab7c8c - - e152fb5db324433ba53d8ead347c6802 e15
2fb5db324433ba53d8ead347c6802] Registering scheme rbd with {'ceph': {'store': <glance_store._drivers.rbd.Store object at 0x7ff3c5d1b820>, 'location_class': <class 'glance_store._drivers.rbd.StoreLocation'>, 'store_entry': 'rbd'}} register_scheme_bac
kend_map /usr/lib/python3/dist-packages/glance_store/location.py:132

To fix this, I adjusted the full_ratio to allow writing again, and deleted images. But glance should have a mechanism to detect this / a timeout.