Upgrade timeout logic is faulty
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph Monitor Charm |
Triaged
|
Medium
|
Unassigned | ||
Ceph OSD Charm |
Triaged
|
Medium
|
Unassigned | ||
OpenStack Ceph Charm (Retired) |
Invalid
|
Medium
|
Unassigned | ||
charms.ceph |
Triaged
|
Medium
|
Unassigned |
Bug Description
When upgrading a ceph cluster, the charm code orders the nodes and starts performing upgrades a single node at a time. The charms make use of ceph's ability to provide an arbitrary key/value store in the monitor and will mark the progress of the upgrades in this key storage.
This allows nodes to watch this central storage for progress of the upgrade. As a node begins its upgrade path, it stores its start time (via time.time()) in the ceph monitor's key value storage. The node which is to upgrade after the current node will read the value stored in the key and compare it to a timestamp from 10 minutes ago in order to determine if the previous node should be considered timed out or not.
The problem is that the value read in from reading the monitor's key is stored and returned as a string and then compared to a floating point value from the time.time() call. This results in the node never timing out the previous node.
This is however, a good thing. In the current released form of the charms (16.10), the upgrade path will always recursively chown the OSD directories, which in a production cluster is unlikely to finish in 10 minutes. Since the ceph charms will stop all services on the cluster at the same time, this would effectively lead to an entire cluster outage if the code were to work correctly.
Instead of fixing this code to add a timeout, I propose the timeout logic be removed completely and error conditions be revisited in order to prevent a sweeping cluster outage.
no longer affects: | ceph (Ubuntu) |
Changed in ceph (Juju Charms Collection): | |
status: | New → Invalid |
Changed in ceph-mon (Juju Charms Collection): | |
status: | New → Invalid |
Changed in ceph-osd (Juju Charms Collection): | |
status: | New → Invalid |
Changed in charm-ceph-osd: | |
status: | New → Triaged |
Changed in charm-ceph-mon: | |
status: | New → Triaged |
Changed in charm-ceph: | |
status: | New → Triaged |
importance: | Undecided → Medium |
Changed in charm-ceph-mon: | |
importance: | Undecided → Medium |
Changed in charm-ceph-osd: | |
importance: | Undecided → Medium |
no longer affects: | ceph (Juju Charms Collection) |
no longer affects: | ceph-mon (Juju Charms Collection) |
no longer affects: | ceph-osd (Juju Charms Collection) |
Changed in charm-ceph: | |
status: | Triaged → Invalid |
milestone: | none → 18.08 |
Changed in charm-ceph-mon: | |
milestone: | none → 18.08 |
Changed in charm-ceph-osd: | |
milestone: | none → 18.08 |
Changed in charm-ceph: | |
milestone: | 18.08 → none |
Changed in charm-ceph-mon: | |
milestone: | 18.08 → 18.11 |
Changed in charm-ceph-osd: | |
milestone: | 18.08 → 18.11 |
Changed in charm-ceph-mon: | |
milestone: | 18.11 → 19.04 |
Changed in charm-ceph-osd: | |
milestone: | 18.11 → 19.04 |
Changed in charm-ceph-mon: | |
milestone: | 19.04 → 19.07 |
Changed in charm-ceph-osd: | |
milestone: | 19.04 → 19.07 |
tags: | added: 4010 |
Changed in charm-ceph-mon: | |
milestone: | 19.07 → 19.10 |
Changed in charm-ceph-osd: | |
milestone: | 19.07 → 19.10 |
Changed in charm-ceph-mon: | |
milestone: | 19.10 → 20.01 |
Changed in charm-ceph-osd: | |
milestone: | 19.10 → 20.01 |
tags: | added: ceph-upgrade |
Changed in charm-ceph-mon: | |
milestone: | 20.01 → 20.05 |
Changed in charm-ceph-osd: | |
milestone: | 20.01 → 20.05 |
Changed in charm-ceph-mon: | |
milestone: | 20.05 → 20.08 |
Changed in charm-ceph-osd: | |
milestone: | 20.05 → 20.08 |
Changed in charm-ceph-mon: | |
milestone: | 20.08 → none |
Changed in charm-ceph-osd: | |
milestone: | 20.08 → none |
I'm marking this as Medium because the logic doesn't work and therefore the timeout never triggers (thankfully). I think the error conditions need to be rethought in this path and what constitutes a viable timeout (or if one even makes sense). On many machines, if a restart is required for any reason during this code path then the reboot alone could take upwards of 10 minutes. If the upgrade path goes to Jewel from Hammer then upgrading the permissions on the OSDs in a production cluster is known to take longer (I've seen 40-60 minutes or more per OSD).
The bug that prevents the timeout from triggering can easily be fixed by casting the value read in from the keystore to a float, however I think it is a bad idea to fix that part.