Ceph-osd unit operations stuck in pending
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph OSD Charm |
New
|
Undecided
|
Unassigned |
Bug Description
Added a dozen machines to my ceph model (3.3), and used the add-disk action to inform them of the drive layout. All but one succeeded in their configuration, but the third machine added is stuck with the add-disks action in pending state and unable to execute any action at all like this. Logs show no errors, just sits there saying:
```
2023-12-27 19:11:42 DEBUG juju.worker.uniter resolver.go:194 no operations in progress; waiting for changes
```
which i'm guessing means the juju controller is not telling the unit to do anything. Not trying to destroy units and re-provision machines if i dont have to. Is this a known/fixable issue for which my google-fu is failing, or did i break something novel again? :)
Destroying the unit and re-adding it solved the pending state, but having added disks to a few nodes, i am seeing the state hang occur on `juju run ceph-osd/X add-disk osd-devices= '/dev/sdX' `.
Disks are not failing to add, additions are not starting in the units nor being picked up in any way - unit logs show nothing in status update (except the weird discard disablement thing): osd/3.juju- log server.go:325 Updating status. osd/3.juju- log server.go:325 SSD Discard autodetection: /dev/disk/ by-dname/ osd0 is forcing discard off(sata <= 3.0) osd/3.juju- log server.go:325 SSD Discard autodetection: /dev/disk/ by-dname/ osd1 is forcing discard off(sata <= 3.0) osd/3.juju- log server.go:325 SSD Discard autodetection: /dev/disk/ by-dname/ osd2 is forcing discard off(sata <= 3.0) osd/3.juju- log server.go:325 SSD Discard autodetection: /dev/disk/ by-dname/ osd6 is forcing discard off(sata <= 3.0) uniter. operation runhook.go:186 ran "update-status" hook (via explicit, bespoke hook script) uniter. operation executor.go:135 committing operation "run update-status hook" for ceph-osd/3 juju/machine- lock.log" with max size 10 MB and max backups 5 uniter. operation executor.go:124 lock released for ceph-osd/3 uniter. relation resolver.go:285 unit "ceph-mon/0" already joined relation 2 uniter. relation resolver.go:285 unit "ceph-mon/1" already joined relation 2 uniter. relation resolver.go:285 unit "ceph-mon/2" already joined relation 2
```
2024-01-06 23:20:38 INFO unit.ceph-
2024-01-06 23:20:38 WARNING unit.ceph-
2024-01-06 23:20:38 WARNING unit.ceph-
2024-01-06 23:20:38 WARNING unit.ceph-
2024-01-06 23:20:38 WARNING unit.ceph-
2024-01-06 23:20:39 INFO juju.worker.
2024-01-06 23:20:39 DEBUG juju.worker.
2024-01-06 23:20:39 DEBUG juju.machinelock machinelock.go:206 created rotating log file "/var/log/
2024-01-06 23:20:39 DEBUG juju.machinelock machinelock.go:190 machine lock "machine-lock" released for ceph-osd/3 uniter (run update-status hook)
2024-01-06 23:20:39 DEBUG juju.worker.
2024-01-06 23:20:39 DEBUG juju.worker.uniter resolver.go:194 no operations in progress; waiting for changes
2024-01-06 23:20:39 DEBUG juju.worker.
2024-01-06 23:20:39 DEBUG juju.worker.
2024-01-06 23:20:39 DEBUG juju.worker.
```
What can cause actions to not be picked up by units like this? The new OSD units (on new machines) added their disks just fine - older ones are hanging out in this indeterminate state. Seems dangerous if any OSDs fail since "no task execution" means "no ability to replace OSDs"
Cancelling all pending ops and restarting them doesn't help, operation tasks are still "pending" no matter what i do in the CLI:
```
$ juju operations --status running,pending
ID Status Started Finished Task IDs Summary
2517 pending 2518 add-disk run on unit-ceph-osd-9
2519 pending 2520 add-disk run on unit-ceph-osd-8
2521 pending 2522 add-disk run on unit-ceph-osd-1
2523 pending 2524 add-disk run on unit-ceph-osd-6
2525 pending 2526 add-disk run on unit-ceph-osd-2
2527 pending 2528 add-disk run on unit-ceph-osd-7
2529 pending ...