cinder-ceph and glance are stuck at waiting - Creating pool 'glance' (replicas=0) - float floor division by zero

Bug #2065470 reported by Nobuto Murata
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juju Charmed Operator - MicroCeph
Fix Committed
Undecided
Hemanth Nakkina
OpenStack Snap
Fix Committed
High
Hemanth Nakkina

Bug Description

I haven't taken a deeper look yet but those two units are stuck at waiting.

$ snap list openstack
Name Version Rev Tracking Publisher Notes
openstack 2024.1 503 2024.1/edge canonical✓ -

$ sunbeam cluster bootstrap --role control --role compute --role storage
Use proxy to access external network resources? [y/n] (n):
Management networks shared by hosts (CIDRs, separated by comma) (192.168.123.0/24):
MetalLB address allocation range (supports multiple ranges, comma separated) (192.168.123.81-192.168.123.90):
Disks to attach to MicroCeph (comma separated list) (/dev/sdc): /dev/vdc
⠇ Deploying OpenStack Control Plane to Kubernetes (this may take a while) ... waiting for services to come online (23/25)Timed out while waiting for model 'openstack' to be ready
Error: Timed out while waiting for model 'openstack' to be ready

$ juju status -m openstack | grep waiting
cinder-ceph waiting 1 cinder-ceph-k8s 2024.1/edge 64 10.152.183.178 no installing agent
glance waiting 1 glance-k8s 2024.1/edge 84 10.152.183.75 no installing agent
cinder-ceph/0* waiting idle 10.1.32.209 (workload) Not all relations are ready
glance/0* waiting idle 10.1.32.212 (ceph) integration incomplete

Revision history for this message
Nobuto Murata (nobuto) wrote :
Revision history for this message
Nobuto Murata (nobuto) wrote :
Download full text (3.2 KiB)

Looks like the Ceph client side (e.g. glance) is requesting a pool with replicas=0 and it causes "float floor division by zero" on the Ceph cluster side.

unit-microceph-0: 22:05:51 INFO juju.worker.uniter.operation ran "ceph-relation-changed" hook (via hook dispatching script: dispatch)
unit-microceph-0: 22:06:26 INFO unit.microceph/0.juju-log ceph:3: _on_relation_changed event
unit-microceph-0: 22:06:27 INFO unit.microceph/0.juju-log ceph:3: mon cluster in quorum and osds bootstrapped - providing client with keys, processing broker requests
unit-microceph-0: 22:06:28 INFO unit.microceph/0.juju-log ceph:3: Processing broker req {"api-version": 1, "ops": [{"op": "create-pool", "name": "glance", "replicas": 0, "pg_num": n
ull, "crush-profile": null, "app-name": "rbd", "compression-algorithm": null, "compression-mode": null, "compression-required-ratio": null, "compression-min-blob-size": null, "compr
ession-min-blob-size-hdd": null, "compression-min-blob-size-ssd": null, "compression-max-blob-size": null, "compression-max-blob-size-hdd": null, "compression-max-blob-size-ssd": nu
ll, "group": null, "max-bytes": null, "max-objects": null, "group-namespace": null, "rbd-mirroring-mode": "pool", "weight": 40}], "request-id": "bf3080dc48147200ada9e7b6031e32a526b4
8cc9"}
unit-microceph-0: 22:06:28 INFO unit.microceph/0.juju-log ceph:3: Processing 1 ceph broker requests
unit-microceph-0: 22:06:29 INFO unit.microceph/0.juju-log ceph:3: Creating pool 'glance' (replicas=0)
unit-microceph-0: 22:06:31 ERROR unit.microceph/0.juju-log ceph:3: float floor division by zero
unit-microceph-0: 22:06:31 ERROR unit.microceph/0.juju-log ceph:3: Unexpected error occurred while processing requests: {'api-version': 1, 'ops': [{'op': 'create-pool', 'name': 'glance', 'replicas': 0, 'pg_num': None, 'crush-profile': None, 'app-name': 'rbd', 'compression-algorithm': None, 'compression-mode': None, 'compression-required-ratio': None, 'compression-min-blob-size': None, 'compression-min-blob-size-hdd': None, 'compression-min-blob-size-ssd': None, 'compression-max-blob-size': None, 'compression-max-blob-size-hdd': None, 'compression-max-blob-size-ssd': None, 'group': None, 'max-bytes': None, 'max-objects': None, 'group-namespace': None, 'rbd-mirroring-mode': 'pool', 'weight': 40}], 'request-id': 'bf3080dc48147200ada9e7b6031e32a526b48cc9'}
unit-microceph-0: 22:06:31 INFO unit.microceph/0.juju-log ceph:3: {"exit-code": 1, "stderr": "Unexpected error occurred while processing requests: {'api-version': 1, 'ops': [{'op': 'create-pool', 'name': 'glance', 'replicas': 0, 'pg_num': None, 'crush-profile': None, 'app-name': 'rbd', 'compression-algorithm': None, 'compression-mode': None, 'compression-required-ratio': None, 'compression-min-blob-size': None, 'compression-min-blob-size-hdd': None, 'compression-min-blob-size-ssd': None, 'compression-max-blob-size': None, 'compression-max-blob-size-hdd': None, 'compression-max-blob-size-ssd': None, 'group': None, 'max-bytes': None, 'max-objects': None, 'group-namespace': None, 'rbd-mirroring-mode': 'pool', 'weight': 40}], 'request-id': 'bf3080dc48147200ada9e7b6031e32a526b48cc9'}"}
unit-microceph-0: 22:06:32 WARNING unit.microceph/0.ceph-relat...

Read more...

affects: snap-openstack → sunbeam-charms
Revision history for this message
Nobuto Murata (nobuto) wrote :

"0" is explicitly set to the glance charm for example.

$ juju config -m openstack glance ceph-osd-replication-count
-> 0

And it's likely from this logic.
https://github.com/canonical/snap-openstack/blob/599e01aa263729d8f411241531bc424934b9ce05/sunbeam-python/sunbeam/commands/openstack.py#L139-L153

I have one OSD at least since I'm in the bootstrap phase. I will dig further why it's considered as 0.

$ sudo ceph status
  cluster:
    id: 78bb216d-fa9a-4938-b32b-6ac1f43448e9
    health: HEALTH_WARN
            1 pool(s) have no replicas configured

  services:
    mon: 1 daemons, quorum sunbeam-1 (age 2h)
    mgr: sunbeam-1(active, since 2h)
    osd: 1 osds: 1 up (since 2h), 1 in (since 2h)

  data:
    pools: 1 pools, 1 pgs
    objects: 2 objects, 449 KiB
    usage: 27 MiB used, 16 GiB / 16 GiB avail
    pgs: 1 active+clean

Changed in sunbeam-charms:
status: New → Invalid
Changed in charm-microceph:
status: New → Invalid
summary: - cinder-ceph and glance are stuck at waiting
+ cinder-ceph and glance are stuck at waiting - Creating pool 'glance'
+ (replicas=0) - float floor division by zero
Revision history for this message
Nobuto Murata (nobuto) wrote :

Okay, the snap relies on the "list-disks" action in the microceph charm, and it returns no disks.

$ juju run microceph/leader list-disks --format yaml
Running operation 15 with 1 task
  - task 16 on unit-microceph-0

Waiting for task 16...
microceph/0:
  id: "16"
  results:
    osds: '[]'
    return-code: 0
    stdout: |
      {'osds': [], 'unpartitioned-disks': []}
    unpartitioned-disks: '[]'
  status: completed
  timing:
    completed: 2024-05-14 08:07:40 +0000 UTC
    enqueued: 2024-05-14 08:07:39 +0000 UTC
    started: 2024-05-14 08:07:39 +0000 UTC
  unit: microceph/0

Changed in charm-microceph:
status: Invalid → New
Revision history for this message
Nobuto Murata (nobuto) wrote :

The microceph charm reef/edge rev.32 gives the following in the debug output.

unit-microceph-0: 09:09:38 DEBUG unit.microceph/0.juju-log Emitting Juju event list_disks_action.
unit-microceph-0: 09:09:38 DEBUG unit.microceph/0.juju-log Running command sudo microceph disk list
unit-microceph-0: 09:09:38 DEBUG unit.microceph/0.juju-log Command finished. stdout=Disks configured in MicroCeph:
+-----+-----------+-------------------------------------------+
| OSD | LOCATION | PATH |
+-----+-----------+-------------------------------------------+
| 1 | sunbeam-1 | /dev/disk/by-path/virtio-pci-0000:06:00.0 |
+-----+-----------+-------------------------------------------+
, stderr=
unit-microceph-0: 09:09:38 DEBUG unit.microceph/0.list-disks {'osds': [], 'unpartitioned-disks': []}

So it doesn't add `--json` to the `microceph disk list` and that's why it fails to parse the output.

reef/edge: 32 2024-03-21 should be missing the following changes based on the dates:
https://github.com/canonical/charm-microceph/pull/51

Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

Seems to be fixed in charm microceph latest/edge. Will find out the next release date of charm microceph

Changed in snap-openstack:
status: New → Triaged
Revision history for this message
Nobuto Murata (nobuto) wrote :

Nah, it wasn't fixed in the latest/edge either.

This time, the "list-disks" action returns something instead of an empty result.

$ juju run microceph/leader list-disks --format yaml
Running operation 19 with 1 task
  - task 20 on unit-microceph-0

Waiting for task 20...
microceph/0:
  id: "20"
  results:
    osds: '[{''osd'': 1, ''path'': ''/dev/disk/by-path/virtio-pci-0000:06:00.0'',
      ''location'': ''sunbeam-1''}]'
    return-code: 0
    unpartitioned-disks: '[]'
  status: completed
  timing:
    completed: 2024-05-14 13:55:25 +0000 UTC
    enqueued: 2024-05-14 13:55:24 +0000 UTC
    started: 2024-05-14 13:55:24 +0000 UTC
  unit: microceph/0

And the replica count is set as 1 instead of 0.

$ juju config -m openstack glance ceph-osd-replication-count
1

Then the microceph charm errors out as "Error EPERM: configuring pool size as 1 is disabled by default".

unit-microceph-0: 09:46:30 INFO unit.microceph/0.juju-log ceph:3: Creating pool 'glance' (replicas=1)
unit-microceph-0: 09:46:31 DEBUG unit.microceph/0.juju-log ceph:3: dict_items([('op', 'create-pool'), ('name', 'glance'), ('replicas', 1), ('pg_num', None), ('crush-profile', None), ('app-name', 'rbd'), ('compression-algorithm', None), ('compression-mode', None), ('compression-required-ratio', None), ('compression-min-blob-size', None), ('compression-min-blob-size-hdd', None), ('compression-min-blob-size-ssd', None), ('compression-max-blob-size', None), ('compression-max-blob-size-hdd', None), ('compression-max-blob-size-ssd', None), ('group', None), ('max-bytes', None), ('max-objects', None), ('group-namespace', None), ('rbd-mirroring-mode', 'pool'), ('weight', 40)])
unit-microceph-0: 09:46:31 DEBUG unit.microceph/0.juju-log ceph:3: validating rbd-mirroring-mode pool <class 'str'>, ('image', 'pool')
unit-microceph-0: 09:46:33 WARNING unit.microceph/0.ceph-relation-changed pool 'glance' created
unit-microceph-0: 09:46:34 WARNING unit.microceph/0.ceph-relation-changed Error EPERM: configuring pool size as 1 is disabled by default.
unit-microceph-0: 09:46:34 ERROR unit.microceph/0.juju-log ceph:3: Command '['ceph', '--id', 'admin', 'osd', 'pool', 'set', 'glance', 'size', '1', '--yes-i-really-mean-it']' returned non-zero exit status 1.
unit-microceph-0: 09:46:34 ERROR unit.microceph/0.juju-log ceph:3: Unexpected error occurred while processing requests: {'api-version': 1, 'ops': [{'op': 'create-pool', 'name': 'glance', 'replicas': 1, 'pg_num': None, 'crush-profile': None, 'app-name': 'rbd', 'compression-algorithm': None, 'compression-mode': None, 'compression-required-ratio': None, 'compression-min-blob-size': None, 'compression-min-blob-size-hdd': None, 'compression-min-blob-size-ssd': None, 'compression-max-blob-size': None, 'compression-max-blob-size-hdd': None, 'compression-max-blob-size-ssd': None, 'group': None, 'max-bytes': None, 'max-objects': None, 'group-namespace': None, 'rbd-mirroring-mode': 'pool', 'weight': 40}], 'request-id': 'dfbce574f05a92706b57c8991e06697d94739d43'}

Revision history for this message
Nobuto Murata (nobuto) wrote :

Okay, an ugly workaround is:
- use latest/edge for the microceph charm
- have two or more OSD disks on the bootstrap node

So the snap-openstack gets misdirected to set more than 1 to ceph-osd-replication-count (because of the incorrect logic in https://bugs.launchpad.net/snap-openstack/+bug/2065698) so the model can get into green.

Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

There is a new release in reef/edge, i will test that

Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

Fix is now available in charm microceph reef/candidate channel.

Changed in charm-microceph:
status: New → Fix Committed
Changed in snap-openstack:
status: Triaged → Fix Committed
Revision history for this message
Nobuto Murata (nobuto) wrote :

Have you actually tested with one OSD per host scenarios?

https://bugs.launchpad.net/sunbeam-charms/+bug/2065470/comments/7

replicas=0 can be fixed by edge but replica=1 also fails to complete the bootstrapping.

Changed in charm-microceph:
status: Fix Committed → New
Changed in snap-openstack:
status: Fix Committed → New
James Page (james-page)
no longer affects: sunbeam-charms
Changed in snap-openstack:
status: New → In Progress
assignee: nobody → Hemanth Nakkina (hemanth-n)
importance: Undecided → High
Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :
Changed in snap-openstack:
status: In Progress → Fix Committed
Changed in charm-microceph:
status: New → Fix Committed
assignee: nobody → Hemanth Nakkina (hemanth-n)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.