Ceph OSD Charm

Bug #2047584
Comment #1

Comment 1 for bug 2047584

Revision history for this message

Boris Lukashev (rageltman) wrote on 2024-01-06 (last edit on 2024-01-06):

Destroying the unit and re-adding it solved the pending state, but having added disks to a few nodes, i am seeing the state hang occur on `juju run ceph-osd/X add-disk osd-devices='/dev/sdX'`.

Disks are not failing to add, additions are not starting in the units nor being picked up in any way - unit logs show nothing in status update (except the weird discard disablement thing):
```
2024-01-06 23:20:38 INFO unit.ceph-osd/3.juju-log server.go:325 Updating status.
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd0 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd1 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd2 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd6 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:39 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via explicit, bespoke hook script)
2024-01-06 23:20:39 DEBUG juju.worker.uniter.operation executor.go:135 committing operation "run update-status hook" for ceph-osd/3
2024-01-06 23:20:39 DEBUG juju.machinelock machinelock.go:206 created rotating log file "/var/log/juju/machine-lock.log" with max size 10 MB and max backups 5
2024-01-06 23:20:39 DEBUG juju.machinelock machinelock.go:190 machine lock "machine-lock" released for ceph-osd/3 uniter (run update-status hook)
2024-01-06 23:20:39 DEBUG juju.worker.uniter.operation executor.go:124 lock released for ceph-osd/3
2024-01-06 23:20:39 DEBUG juju.worker.uniter resolver.go:194 no operations in progress; waiting for changes
2024-01-06 23:20:39 DEBUG juju.worker.uniter.relation resolver.go:285 unit "ceph-mon/0" already joined relation 2
2024-01-06 23:20:39 DEBUG juju.worker.uniter.relation resolver.go:285 unit "ceph-mon/1" already joined relation 2
2024-01-06 23:20:39 DEBUG juju.worker.uniter.relation resolver.go:285 unit "ceph-mon/2" already joined relation 2
```

What can cause actions to not be picked up by units like this? The new OSD units (on new machines) added their disks just fine - older ones are hanging out in this indeterminate state. Seems dangerous if any OSDs fail since "no task execution" means "no ability to replace OSDs"

Cancelling all pending ops and restarting them doesn't help, operation tasks are still "pending" no matter what i do in the CLI:
```
$ juju operations --status running,pending
  ID Status Started Finished Task IDs Summary
2517 pending 2518 add-disk run on unit-ceph-osd-9
2519 pending 2520 add-disk run on unit-ceph-osd-8
2521 pending 2522 add-disk run on unit-ceph-osd-1
2523 pending 2524 add-disk run on unit-ceph-osd-6
2525 pending 2526 add-disk run on unit-ceph-osd-2
2527 pending 2528 add-disk run on unit-ceph-osd-7
2529 pending 2530 add-disk run on unit-ceph-osd-5
2531 pending 2532 add-disk run on unit-ceph-osd-3
2533 pending 2534 add-disk run on unit-ceph-osd-0
2535 pending 2536 add-disk run on unit-ceph-osd-4
$ juju show-operation 2517
summary: add-disk run on unit-ceph-osd-9
status: pending
action:
  name: add-disk
  parameters:
    osd-devices: /dev/sdj
timing:
  enqueued: 2024-01-06 23:16:41 +0000 UTC
tasks:
  "2518":
    host: ceph-osd/9
    status: pending
    timing:
      enqueued: 2024-01-06 23:16:41 +0000 UTC
```
methinks a plunger of some sort is needed.... "juju reset-unit-state" or whatnot

Destroying the unit and re-adding it solved the pending state, but having added disks to a few nodes, i am seeing the state hang occur on `juju run ceph-osd/X add-disk osd-devices='/dev/sdX'`.

Cancelling all pending ops and restarting them doesn't help, operation tasks are still "pending" no matter what i do in the CLI:
```
$ juju operations --status running,pending
  ID  Status   Started  Finished  Task IDs  Summary
2517  pending                     2518      add-disk run on unit-ceph-osd-9
2519  pending                     2520      add-disk run on unit-ceph-osd-8
2521  pending                     2522      add-disk run on unit-ceph-osd-1
2523  pending                     2524      add-disk run on unit-ceph-osd-6
2525  pending                     2526      add-disk run on unit-ceph-osd-2
2527  pending                     2528      add-disk run on unit-ceph-osd-7
2529  pending                     2530      add-disk run on unit-ceph-osd-5
2531  pending                     2532      add-disk run on unit-ceph-osd-3
2533  pending                     2534      add-disk run on unit-ceph-osd-0
2535  pending                     2536      add-disk run on unit-ceph-osd-4
$ juju show-operation 2517
summary: add-disk run on unit-ceph-osd-9
status: pending
action:
  name: add-disk
  parameters:
    osd-devices: /dev/sdj
timing:
  enqueued: 2024-01-06 23:16:41 +0000 UTC
tasks:
  "2518":
    host: ceph-osd/9
    status: pending
    timing:
      enqueued: 2024-01-06 23:16:41 +0000 UTC
```
methinks a plunger of some sort is needed.... "juju reset-unit-state" or whatnot