Upgrade timeout logic is faulty

Bug #1664435 reported by Billy Olsen
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Triaged
Medium
Unassigned
Ceph OSD Charm
Triaged
Medium
Unassigned
OpenStack Ceph Charm (Retired)
Invalid
Medium
Unassigned
charms.ceph
Triaged
Medium
Unassigned

Bug Description

When upgrading a ceph cluster, the charm code orders the nodes and starts performing upgrades a single node at a time. The charms make use of ceph's ability to provide an arbitrary key/value store in the monitor and will mark the progress of the upgrades in this key storage.

This allows nodes to watch this central storage for progress of the upgrade. As a node begins its upgrade path, it stores its start time (via time.time()) in the ceph monitor's key value storage. The node which is to upgrade after the current node will read the value stored in the key and compare it to a timestamp from 10 minutes ago in order to determine if the previous node should be considered timed out or not.

The problem is that the value read in from reading the monitor's key is stored and returned as a string and then compared to a floating point value from the time.time() call. This results in the node never timing out the previous node.

This is however, a good thing. In the current released form of the charms (16.10), the upgrade path will always recursively chown the OSD directories, which in a production cluster is unlikely to finish in 10 minutes. Since the ceph charms will stop all services on the cluster at the same time, this would effectively lead to an entire cluster outage if the code were to work correctly.

Instead of fixing this code to add a timeout, I propose the timeout logic be removed completely and error conditions be revisited in order to prevent a sweeping cluster outage.

no longer affects: ceph (Ubuntu)
Revision history for this message
Billy Olsen (billy-olsen) wrote :

I'm marking this as Medium because the logic doesn't work and therefore the timeout never triggers (thankfully). I think the error conditions need to be rethought in this path and what constitutes a viable timeout (or if one even makes sense). On many machines, if a restart is required for any reason during this code path then the reboot alone could take upwards of 10 minutes. If the upgrade path goes to Jewel from Hammer then upgrading the permissions on the OSDs in a production cluster is known to take longer (I've seen 40-60 minutes or more per OSD).

The bug that prevents the timeout from triggering can easily be fixed by casting the value read in from the keystore to a float, however I think it is a bad idea to fix that part.

James Page (james-page)
Changed in ceph (Juju Charms Collection):
status: New → Invalid
Changed in ceph-mon (Juju Charms Collection):
status: New → Invalid
Changed in ceph-osd (Juju Charms Collection):
status: New → Invalid
James Page (james-page)
Changed in charm-ceph-osd:
status: New → Triaged
Changed in charm-ceph-mon:
status: New → Triaged
Changed in charm-ceph:
status: New → Triaged
importance: Undecided → Medium
Changed in charm-ceph-mon:
importance: Undecided → Medium
Changed in charm-ceph-osd:
importance: Undecided → Medium
no longer affects: ceph (Juju Charms Collection)
no longer affects: ceph-mon (Juju Charms Collection)
no longer affects: ceph-osd (Juju Charms Collection)
Changed in charm-ceph:
status: Triaged → Invalid
milestone: none → 18.08
Changed in charm-ceph-mon:
milestone: none → 18.08
Changed in charm-ceph-osd:
milestone: none → 18.08
Changed in charm-ceph:
milestone: 18.08 → none
James Page (james-page)
Changed in charm-ceph-mon:
milestone: 18.08 → 18.11
Changed in charm-ceph-osd:
milestone: 18.08 → 18.11
David Ames (thedac)
Changed in charm-ceph-mon:
milestone: 18.11 → 19.04
Changed in charm-ceph-osd:
milestone: 18.11 → 19.04
David Ames (thedac)
Changed in charm-ceph-mon:
milestone: 19.04 → 19.07
Changed in charm-ceph-osd:
milestone: 19.04 → 19.07
Revision history for this message
Gábor Mészáros (gabor.meszaros) wrote :

I've just stumbled upon this bug, and have to add a slightly connected issue with this:

if the upgrade is started with at least 1 ceph-osd node down, it will never add it's start_time to the KV, but the next node calculates its position and will wait until that DOWN node would come back. This effectively blocks the upgrade until we have that node repaired or removed from the cluster.

My suggestion is to check if the previous node is IN, and only consider those during the upgrade.
In case the DOWN node comes back during that time, it would anyway cause rebalancing, so I consider that the upgrade of that node would not cause much troubles during the upgrade (but it's rebalancing might). Also putting back nodes is rather an OPS task, and the engineer can decide whether it's acceptable to bring the node back during an upgrade or not. But blocking the upgrade indefinitely is not acceptable.

The code path that are effected (in def wait_on_previous_node(upgrade_key, service, previous_node, version):
        previous_node_start_time = monitor_key_get(
            upgrade_key,

Here the monitor_key_get errors out with this message in the logs:
2019-07-30 09:58:13 DEBUG config-changed Error ENOENT: error obtaining 'osd_wa0b1b-sto200j313m50-pl_jewel_start': (2) No such file or directory
2019-07-30 09:58:13 INFO juju-log Monitor config-key get failed with message: b''

tags: added: 4010
David Ames (thedac)
Changed in charm-ceph-mon:
milestone: 19.07 → 19.10
Changed in charm-ceph-osd:
milestone: 19.07 → 19.10
David Ames (thedac)
Changed in charm-ceph-mon:
milestone: 19.10 → 20.01
Changed in charm-ceph-osd:
milestone: 19.10 → 20.01
tags: added: ceph-upgrade
James Page (james-page)
Changed in charm-ceph-mon:
milestone: 20.01 → 20.05
Changed in charm-ceph-osd:
milestone: 20.01 → 20.05
David Ames (thedac)
Changed in charm-ceph-mon:
milestone: 20.05 → 20.08
Changed in charm-ceph-osd:
milestone: 20.05 → 20.08
Revision history for this message
Michael Quiniola (qthepirate) wrote :

This is still happening.

I had removed a node from the stack and the node was still listed in the CRUSH map. So, when the charms started upgrading one by one, I noticed one of the charms was waiting on another (that no longer existed). In CRUSH that bucket (host) and associated OSDs we're labeled down. I removed them, but there is currently no way (that I can see) to get the upgrades completed normally. I tried restarting the waiting nodes service (juju-unit-ceph-osd-*) and the waiting node continued, but it breaks the upgrade logic and the NEXT node in line doesnt get the notification that the ORIGINAL waiting node was completed.

Revision history for this message
Michael Quiniola (qthepirate) wrote :

Actually I just resolved a workaround.

Set source to the original setting (I was going from stein to ussuri), so:

juju config ceph-osd source="cloud:bionic-stein"

Wait for agents to receive updated config

juju ssh into each osd box thats waiting and to a restart of the service for the juju agent:

sudo service jujud-unit-ceph-osd-<number>

Then, once everything evens out and is "ready", make sure the crush map is proper, and then run the upgrade again:

juju config ceph-osd source="cloud:bionic-ussuri"

James Page (james-page)
Changed in charm-ceph-mon:
milestone: 20.08 → none
Changed in charm-ceph-osd:
milestone: 20.08 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.