Destroying the unit and re-adding it solved the pending state, but having added disks to a few nodes, i am seeing the state hang occur on `juju run ceph-osd/X add-disk osd-devices='/dev/sdX'`.
Disks are not failing to add, additions are not starting in the units nor being picked up in any way - unit logs show nothing in status update (except the weird discard disablement thing):
```
2024-01-06 23:20:38 INFO unit.ceph-osd/3.juju-log server.go:325 Updating status.
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd0 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd1 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd2 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd6 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:39 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via explicit, bespoke hook script)
2024-01-06 23:20:39 DEBUG juju.worker.uniter.operation executor.go:135 committing operation "run update-status hook" for ceph-osd/3
2024-01-06 23:20:39 DEBUG juju.machinelock machinelock.go:206 created rotating log file "/var/log/juju/machine-lock.log" with max size 10 MB and max backups 5
2024-01-06 23:20:39 DEBUG juju.machinelock machinelock.go:190 machine lock "machine-lock" released for ceph-osd/3 uniter (run update-status hook)
2024-01-06 23:20:39 DEBUG juju.worker.uniter.operation executor.go:124 lock released for ceph-osd/3
2024-01-06 23:20:39 DEBUG juju.worker.uniter resolver.go:194 no operations in progress; waiting for changes
2024-01-06 23:20:39 DEBUG juju.worker.uniter.relation resolver.go:285 unit "ceph-mon/0" already joined relation 2
2024-01-06 23:20:39 DEBUG juju.worker.uniter.relation resolver.go:285 unit "ceph-mon/1" already joined relation 2
2024-01-06 23:20:39 DEBUG juju.worker.uniter.relation resolver.go:285 unit "ceph-mon/2" already joined relation 2
```
What can cause actions to not be picked up by units like this? The new OSD units (on new machines) added their disks just fine - older ones are hanging out in this indeterminate state. Seems dangerous if any OSDs fail since "no task execution" means "no ability to replace OSDs"
Cancelling all pending ops and restarting them doesn't help, operation tasks are still "pending" no matter what i do in the CLI:
```
$ juju operations --status running,pending
ID Status Started Finished Task IDs Summary
2517 pending 2518 add-disk run on unit-ceph-osd-9
2519 pending 2520 add-disk run on unit-ceph-osd-8
2521 pending 2522 add-disk run on unit-ceph-osd-1
2523 pending 2524 add-disk run on unit-ceph-osd-6
2525 pending 2526 add-disk run on unit-ceph-osd-2
2527 pending 2528 add-disk run on unit-ceph-osd-7
2529 pending 2530 add-disk run on unit-ceph-osd-5
2531 pending 2532 add-disk run on unit-ceph-osd-3
2533 pending 2534 add-disk run on unit-ceph-osd-0
2535 pending 2536 add-disk run on unit-ceph-osd-4
$ juju show-operation 2517
summary: add-disk run on unit-ceph-osd-9
status: pending
action:
name: add-disk
parameters:
osd-devices: /dev/sdj
timing:
enqueued: 2024-01-06 23:16:41 +0000 UTC
tasks:
"2518":
host: ceph-osd/9
status: pending
timing:
enqueued: 2024-01-06 23:16:41 +0000 UTC
```
methinks a plunger of some sort is needed.... "juju reset-unit-state" or whatnot
Destroying the unit and re-adding it solved the pending state, but having added disks to a few nodes, i am seeing the state hang occur on `juju run ceph-osd/X add-disk osd-devices= '/dev/sdX' `.
Disks are not failing to add, additions are not starting in the units nor being picked up in any way - unit logs show nothing in status update (except the weird discard disablement thing): osd/3.juju- log server.go:325 Updating status. osd/3.juju- log server.go:325 SSD Discard autodetection: /dev/disk/ by-dname/ osd0 is forcing discard off(sata <= 3.0) osd/3.juju- log server.go:325 SSD Discard autodetection: /dev/disk/ by-dname/ osd1 is forcing discard off(sata <= 3.0) osd/3.juju- log server.go:325 SSD Discard autodetection: /dev/disk/ by-dname/ osd2 is forcing discard off(sata <= 3.0) osd/3.juju- log server.go:325 SSD Discard autodetection: /dev/disk/ by-dname/ osd6 is forcing discard off(sata <= 3.0) uniter. operation runhook.go:186 ran "update-status" hook (via explicit, bespoke hook script) uniter. operation executor.go:135 committing operation "run update-status hook" for ceph-osd/3 juju/machine- lock.log" with max size 10 MB and max backups 5 uniter. operation executor.go:124 lock released for ceph-osd/3 uniter. relation resolver.go:285 unit "ceph-mon/0" already joined relation 2 uniter. relation resolver.go:285 unit "ceph-mon/1" already joined relation 2 uniter. relation resolver.go:285 unit "ceph-mon/2" already joined relation 2
```
2024-01-06 23:20:38 INFO unit.ceph-
2024-01-06 23:20:38 WARNING unit.ceph-
2024-01-06 23:20:38 WARNING unit.ceph-
2024-01-06 23:20:38 WARNING unit.ceph-
2024-01-06 23:20:38 WARNING unit.ceph-
2024-01-06 23:20:39 INFO juju.worker.
2024-01-06 23:20:39 DEBUG juju.worker.
2024-01-06 23:20:39 DEBUG juju.machinelock machinelock.go:206 created rotating log file "/var/log/
2024-01-06 23:20:39 DEBUG juju.machinelock machinelock.go:190 machine lock "machine-lock" released for ceph-osd/3 uniter (run update-status hook)
2024-01-06 23:20:39 DEBUG juju.worker.
2024-01-06 23:20:39 DEBUG juju.worker.uniter resolver.go:194 no operations in progress; waiting for changes
2024-01-06 23:20:39 DEBUG juju.worker.
2024-01-06 23:20:39 DEBUG juju.worker.
2024-01-06 23:20:39 DEBUG juju.worker.
```
What can cause actions to not be picked up by units like this? The new OSD units (on new machines) added their disks just fine - older ones are hanging out in this indeterminate state. Seems dangerous if any OSDs fail since "no task execution" means "no ability to replace OSDs"
Cancelling all pending ops and restarting them doesn't help, operation tasks are still "pending" no matter what i do in the CLI:
```
$ juju operations --status running,pending
ID Status Started Finished Task IDs Summary
2517 pending 2518 add-disk run on unit-ceph-osd-9
2519 pending 2520 add-disk run on unit-ceph-osd-8
2521 pending 2522 add-disk run on unit-ceph-osd-1
2523 pending 2524 add-disk run on unit-ceph-osd-6
2525 pending 2526 add-disk run on unit-ceph-osd-2
2527 pending 2528 add-disk run on unit-ceph-osd-7
2529 pending 2530 add-disk run on unit-ceph-osd-5
2531 pending 2532 add-disk run on unit-ceph-osd-3
2533 pending 2534 add-disk run on unit-ceph-osd-0
2535 pending 2536 add-disk run on unit-ceph-osd-4
$ juju show-operation 2517
summary: add-disk run on unit-ceph-osd-9
status: pending
action:
name: add-disk
parameters:
osd-devices: /dev/sdj
timing:
enqueued: 2024-01-06 23:16:41 +0000 UTC
tasks:
"2518":
host: ceph-osd/9
status: pending
timing:
enqueued: 2024-01-06 23:16:41 +0000 UTC
```
methinks a plunger of some sort is needed.... "juju reset-unit-state" or whatnot