Ceph-osd unit operations stuck in pending

Bug #2047584 reported by Boris Lukashev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
New
Undecided
Unassigned

Bug Description

Added a dozen machines to my ceph model (3.3), and used the add-disk action to inform them of the drive layout. All but one succeeded in their configuration, but the third machine added is stuck with the add-disks action in pending state and unable to execute any action at all like this. Logs show no errors, just sits there saying:
```
2023-12-27 19:11:42 DEBUG juju.worker.uniter resolver.go:194 no operations in progress; waiting for changes

```
which i'm guessing means the juju controller is not telling the unit to do anything. Not trying to destroy units and re-provision machines if i dont have to. Is this a known/fixable issue for which my google-fu is failing, or did i break something novel again? :)

Revision history for this message
Boris Lukashev (rageltman) wrote (last edit ):
Download full text (3.8 KiB)

Destroying the unit and re-adding it solved the pending state, but having added disks to a few nodes, i am seeing the state hang occur on `juju run ceph-osd/X add-disk osd-devices='/dev/sdX'`.

Disks are not failing to add, additions are not starting in the units nor being picked up in any way - unit logs show nothing in status update (except the weird discard disablement thing):
```
2024-01-06 23:20:38 INFO unit.ceph-osd/3.juju-log server.go:325 Updating status.
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd0 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd1 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd2 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd6 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:39 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via explicit, bespoke hook script)
2024-01-06 23:20:39 DEBUG juju.worker.uniter.operation executor.go:135 committing operation "run update-status hook" for ceph-osd/3
2024-01-06 23:20:39 DEBUG juju.machinelock machinelock.go:206 created rotating log file "/var/log/juju/machine-lock.log" with max size 10 MB and max backups 5
2024-01-06 23:20:39 DEBUG juju.machinelock machinelock.go:190 machine lock "machine-lock" released for ceph-osd/3 uniter (run update-status hook)
2024-01-06 23:20:39 DEBUG juju.worker.uniter.operation executor.go:124 lock released for ceph-osd/3
2024-01-06 23:20:39 DEBUG juju.worker.uniter resolver.go:194 no operations in progress; waiting for changes
2024-01-06 23:20:39 DEBUG juju.worker.uniter.relation resolver.go:285 unit "ceph-mon/0" already joined relation 2
2024-01-06 23:20:39 DEBUG juju.worker.uniter.relation resolver.go:285 unit "ceph-mon/1" already joined relation 2
2024-01-06 23:20:39 DEBUG juju.worker.uniter.relation resolver.go:285 unit "ceph-mon/2" already joined relation 2
```

What can cause actions to not be picked up by units like this? The new OSD units (on new machines) added their disks just fine - older ones are hanging out in this indeterminate state. Seems dangerous if any OSDs fail since "no task execution" means "no ability to replace OSDs"

Cancelling all pending ops and restarting them doesn't help, operation tasks are still "pending" no matter what i do in the CLI:
```
$ juju operations --status running,pending
  ID Status Started Finished Task IDs Summary
2517 pending 2518 add-disk run on unit-ceph-osd-9
2519 pending 2520 add-disk run on unit-ceph-osd-8
2521 pending 2522 add-disk run on unit-ceph-osd-1
2523 pending 2524 add-disk run on unit-ceph-osd-6
2525 pending 2526 add-disk run on unit-ceph-osd-2
2527 pending 2528 add-disk run on unit-ceph-osd-7
2529 pending ...

Read more...

Revision history for this message
Boris Lukashev (rageltman) wrote (last edit ):

Debugging the hooks manually, it seems that i have to trip `/var/lib/juju/agents/unit-ceph-osd-*/charm/hooks/config-changed` in order for it to pick up the updated OSD...
Wondering if this has anything to do with the v3 semantic changes around `juju run ...`

In any case, seems units dont pick up on config changes so they dont run pending actions.
The status update refreshed OSD states, and it correctly reports everything about prior state, but does not try to find the new disk:
```
unit-ceph-osd-9: 01:21:17 DEBUG unit.ceph-osd/9.juju-log got journal devs: set()
unit-ceph-osd-9: 01:21:17 INFO unit.ceph-osd/9.juju-log Skipping osd devices previously processed by this unit: ...
unit-ceph-osd-9: 01:21:17 DEBUG unit.ceph-osd/9.juju-log Checking for pristine devices: "[]"
unit-ceph-osd-9: 01:21:17 INFO unit.ceph-osd/9.juju-log ceph bootstrapped, rescanning disks
unit-ceph-osd-9: 01:21:17 INFO unit.ceph-osd/9.juju-log Making dir /var/lib/charm/ceph-osd ceph:ceph 555

```

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.