Ceph OSD Charm

Ceph-osd unit operations stuck in pending

Bug #2047584 reported by Boris Lukashev on 2023-12-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ceph OSD Charm	New	Undecided	Unassigned

Bug Description

Added a dozen machines to my ceph model (3.3), and used the add-disk action to inform them of the drive layout. All but one succeeded in their configuration, but the third machine added is stuck with the add-disks action in pending state and unable to execute any action at all like this. Logs show no errors, just sits there saying:
```
2023-12-27 19:11:42 DEBUG juju.worker.uniter resolver.go:194 no operations in progress; waiting for changes

```
which i'm guessing means the juju controller is not telling the unit to do anything. Not trying to destroy units and re-provision machines if i dont have to. Is this a known/fixable issue for which my google-fu is failing, or did i break something novel again? :)

Revision history for this message

Boris Lukashev (rageltman) wrote on 2024-01-06 (last edit on 2024-01-06):

Download full text (3.8 KiB)

Destroying the unit and re-adding it solved the pending state, but having added disks to a few nodes, i am seeing the state hang occur on `juju run ceph-osd/X add-disk osd-devices='/dev/sdX'`.

Disks are not failing to add, additions are not starting in the units nor being picked up in any way - unit logs show nothing in status update (except the weird discard disablement thing):
```
2024-01-06 23:20:38 INFO unit.ceph-osd/3.juju-log server.go:325 Updating status.
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd0 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd1 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd2 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:38 WARNING unit.ceph-osd/3.juju-log server.go:325 SSD Discard autodetection: /dev/disk/by-dname/osd6 is forcing discard off(sata <= 3.0)
2024-01-06 23:20:39 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via explicit, bespoke hook script)
2024-01-06 23:20:39 DEBUG juju.worker.uniter.operation executor.go:135 committing operation "run update-status hook" for ceph-osd/3
2024-01-06 23:20:39 DEBUG juju.machinelock machinelock.go:206 created rotating log file "/var/log/juju/machine-lock.log" with max size 10 MB and max backups 5
2024-01-06 23:20:39 DEBUG juju.machinelock machinelock.go:190 machine lock "machine-lock" released for ceph-osd/3 uniter (run update-status hook)
2024-01-06 23:20:39 DEBUG juju.worker.uniter.operation executor.go:124 lock released for ceph-osd/3
2024-01-06 23:20:39 DEBUG juju.worker.uniter resolver.go:194 no operations in progress; waiting for changes
2024-01-06 23:20:39 DEBUG juju.worker.uniter.relation resolver.go:285 unit "ceph-mon/0" already joined relation 2
2024-01-06 23:20:39 DEBUG juju.worker.uniter.relation resolver.go:285 unit "ceph-mon/1" already joined relation 2
2024-01-06 23:20:39 DEBUG juju.worker.uniter.relation resolver.go:285 unit "ceph-mon/2" already joined relation 2
```

What can cause actions to not be picked up by units like this? The new OSD units (on new machines) added their disks just fine - older ones are hanging out in this indeterminate state. Seems dangerous if any OSDs fail since "no task execution" means "no ability to replace OSDs"

Destroying the unit and re-adding it solved the pending state, but having added disks to a few nodes, i am seeing the state hang occur on `juju run ceph-osd/X add-disk osd-devices='/dev/sdX'`.

Cancelling all pending ops and restarting them doesn't help, operation tasks are still "pending" no matter what i do in the CLI:
```
$ juju operations --status running,pending
  ID  Status   Started  Finished  Task IDs  Summary
2517  pending                     2518      add-disk run on unit-ceph-osd-9
2519  pending                     2520      add-disk run on unit-ceph-osd-8
2521  pending                     2522      add-disk run on unit-ceph-osd-1
2523  pending                     2524      add-disk run on unit-ceph-osd-6
2525  pending                     2526      add-disk run on unit-ceph-osd-2
2527  pending                     2528      add-disk run on unit-ceph-osd-7
2529  pending                     2530      add-disk run on unit-ceph-osd-5
2531  pending                     2532      add-disk run on unit-ceph-osd-3
2533  pending                     2534      add-disk run on unit-ceph-osd-0
2535  pending                     2536      add-disk run on unit-ceph-osd-4
$ juju show-operation 2517
summary: add-disk run on unit-ceph-osd-9
status: pending
action:
  name: add-disk
  parameters:
    osd-devices: /dev/sdj
timing:
  enqueued: 2024-01-06 23:16:41 +0000 UTC
tasks:
  "2518":
    host: ceph-osd/9
    status: pending
    timing:
      enqueued: 2024-01-06 23:16:41 +0000 UTC
```
methinks a plunger of some sort is needed.... "juju reset-unit-state" or whatnot

Revision history for this message

Boris Lukashev (rageltman) wrote on 2024-01-07 (last edit on 2024-01-07):

Debugging the hooks manually, it seems that i have to trip `/var/lib/juju/agents/unit-ceph-osd-*/charm/hooks/config-changed` in order for it to pick up the updated OSD...
Wondering if this has anything to do with the v3 semantic changes around `juju run ...`

In any case, seems units dont pick up on config changes so they dont run pending actions.
The status update refreshed OSD states, and it correctly reports everything about prior state, but does not try to find the new disk:
```
unit-ceph-osd-9: 01:21:17 DEBUG unit.ceph-osd/9.juju-log got journal devs: set()
unit-ceph-osd-9: 01:21:17 INFO unit.ceph-osd/9.juju-log Skipping osd devices previously processed by this unit: ...
unit-ceph-osd-9: 01:21:17 DEBUG unit.ceph-osd/9.juju-log Checking for pristine devices: "[]"
unit-ceph-osd-9: 01:21:17 INFO unit.ceph-osd/9.juju-log ceph bootstrapped, rescanning disks
unit-ceph-osd-9: 01:21:17 INFO unit.ceph-osd/9.juju-log Making dir /var/lib/charm/ceph-osd ceph:ceph 555

```

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.