config_changed->check_for_upgrade incorrectly assumes upgrade in progress if OSD not yet mounted / ceph-disk@ has not started
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph OSD Charm |
Fix Released
|
High
|
Trent Lloyd |
Bug Description
The charm sometimes gets stuck waiting for another host to upgrade, even when an upgrade is not really in progress. You will observe "Waiting on <other_host> to finish upgrading" in the juju unit status, and the config-changed hooks stays executing for a very long time.
The hook stays running forever, days or weeks and blocks the charm from any further actions. The code appears to have a 10 minute timeout but this was observed not to apply in practice. (Perhaps the timeout was added after the version in use on some sites?)
= Analysis of the cause =
On startup for a busy server with a large number of OSDs, the ceph-disk process sometimes fails to start in a timely manner -- or -- at all.
There are a number of causes of this that will be filed in a separate bug but it comes down to the following observed situations
(a) They are starting but have not yet finished starting before config_changed fires - particularly likely on hyper-converged nodes (this is a race condition generally seen where the charm assumes OSDs are always mounted then they have not really checked)
(b) They timeout while starting due to the aggressive 120 second timeout which sometimes is not enough (this is a bug)
(c) They fail while starting due to "ceph_disk.
The root cause is mostly irrelevant however, if for any reason the Ceph OSD disk FS is not mounted when config_changed is called the charm incorrectly assumes an upgrade to jewel is in progress but previously failed and then attempts to resume it. It then gets stuck waiting on another unit to signal it's upgrade has completed - but that never happens since the other unit is not in an upgrading state.
The cause is that check_for_upgrade calls ceph.dirs_
However this condition is also true if the OSD simply isn't mounted yet, as the underlying mountpoint directory exists and is owned by root:root (which would later be replaced with the owner of the Ceph FS once it is mounted).
When this condition occurs, ceph.roll_
The workaround is to restart any failed ceph-disk@xxxx units, ensure the relevant ceph-osd@NN then starts, and then kill the config-changed hook. Next time it runs, it will work as expected.
In addition to this issue, it was observed in production that even after a longer period of time (20-30 minutes) the hook was still executing and stuck. It's not clear why the 10 minute timeout did not occur.
This occurred in production on a production cloud, and I also once triggered it in my test bench where I have 4 VMs using an openstack-base setup - unfortunately I do not currently have lots from either case but the above describes the scenario fairly accurately.
Lastly, I tried to create an easy reproducer for this scenario using the latest charm, however when unmounting the filesystem you now seem to trigger "Non-pristine devices detected" instead due to the recent changes related to that. However I did manage to trigger this issue with both a version of the charm with and without that code somehow. So the underlying issue should still be fixed - although this change may mask it in most cases.
Additionally, one of the items I raised at the start of the bug is the general race that the charm assumes all OSDs are mounted when a hook runs, which may not be the case. There are more general changes needed to fix logic like that, but thiss chown case is still true, and udevadm settle won't fix the case where the ceph-disk service times out (as opposed to isn't yet finished executing).
Changed in charm-ceph-osd: | |
importance: | Undecided → High |
Changed in charm-ceph-osd: | |
milestone: | 18.11 → 19.04 |
Changed in charm-ceph-osd: | |
status: | New → Confirmed |
description: | updated |
Changed in charm-ceph-osd: | |
assignee: | nobody → Trent Lloyd (lathiat) |
Changed in charm-ceph-osd: | |
status: | Confirmed → In Progress |
Changed in charm-ceph-osd: | |
milestone: | 19.04 → 19.07 |
Changed in charm-ceph-osd: | |
milestone: | 19.07 → 19.10 |
Changed in charm-ceph-osd: | |
milestone: | 19.10 → 20.01 |
tags: | added: series-upgrade |
Changed in charm-ceph-osd: | |
milestone: | 20.01 → 19.10 |
Changed in charm-ceph-osd: | |
status: | In Progress → Fix Released |
Reason for the timeout not working is possibly Bug #1664435 ?