Ceph OSD Charm

jewel upgrade takes down pgs when chown takes >10 min

Bug #1762852 reported by Shane Peters on 2018-04-10

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ceph OSD Charm	Fix Released	High	Unassigned	Ceph OSD Charm 20.08

Bug Description

Part of the upgrade process to jewel involves a recursive 'chown' on files under /var/lib/ceph. If this step takes longer than 10 minutes, the node is assumed dead and moves on upgrading the next node in the cluster.

Subsequently, if enough osd's are taken down for upgrade, it can bring down whole placement groups.
The following log shows this timeout being reached even though the chown is still being performed on the node.

juju log:
2018-04-09 15:56:27 INFO juju-log node-15 is not finished. Waiting
2018-04-09 15:56:27 INFO config-changed obtained 'osd_node-15_jewel_start'
2018-04-09 15:56:27 DEBUG worker.uniter.jujuc server.go:178 hook context id "ceph-osd/3-config-changed-2244033196891562383"; dir "/var/lib/juju/agents/unit-ceph-osd-3/charm"
2018-04-09 15:56:27 INFO juju-log Waited 10 mins on node node-15. current time: 1523288787.3321168 > previous node start time: 1523288768.8523903 Moving on

Tags:

Revision history for this message

James Page (james-page) wrote on 2018-04-11:

I think general best practice would be to use the 'pause-health' and 'resume-health' actions on one on the ceph-mon units before commencing the ceph-osd upgrade; however there is another bug here in the fact that the charm moves on after 10 minutes of waiting - which is insufficient.

Revision history for this message

James Page (james-page) wrote on 2018-04-11:

OSD's are upgraded in turn so I think a general heartbeat timestamp from the current unit performing the upgrade is probably a good idea - for units with lots of osd and lots of data, 10 minutes is just to short...

Changed in charm-ceph-osd:
status:	New → Triaged
importance:	Undecided → High

Alex Kavanagh (ajkavanagh) on 2019-11-08

tags:

added: ceph-upgrade

Alex Kavanagh (ajkavanagh) on 2020-07-02

Changed in charm-ceph-osd:
assignee:	nobody → Alex Kavanagh (ajkavanagh)
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-06: Fix proposed to charm-ceph-osd (master)

Fix proposed to branch: master
Review: https://review.opendev.org/739535

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-07-17: Fix merged to charm-ceph-osd (master)

Reviewed: https://review.opendev.org/739535
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=6b0a11b4048027d7de11d28bbf388bd0c337400f
Submitter: Zuul
Branch: master

commit 6b0a11b4048027d7de11d28bbf388bd0c337400f
Author: Alex Kavanagh <email address hidden>
Date: Mon Jul 6 17:13:32 2020 +0100

Add a progress watchdog for OSD upgrades

    This patch (in charms.ceph [1], copied here) add the concept of a
    watchdog to the upgrade_monitor so that the charm can achieve two
    objectives of 1. Waiting for much longer, but 2. detecting whether the
    previous node has died / gone away. This is needed for 'large' OSDs
    where the time to upgrade a node may exceed the current limit of 10
    minutes, but also not to wait for 30 minutes on a dead previous node.
    The watchdog implements two timeouts and an addition 'alive' key from
    the previous node to indicate that it is still running. Otherwise,
    functionality is identical.

    [1] See depends on below
    Depends-On: Ia450e936c2096f092af3be5a369b7abaf5023b16
    Closes-Bug: #1762852

Change-Id: I6204a5ade684f0564c4be2d30df467c75baa6dba

Changed in charm-ceph-osd:
status:	In Progress → Fix Committed

Alex Kavanagh (ajkavanagh) on 2020-07-23

Changed in charm-ceph-osd:
assignee:	Alex Kavanagh (ajkavanagh) → nobody

James Page (james-page) on 2020-08-03

Changed in charm-ceph-osd:
milestone:	none → 20.08

Alex Kavanagh (ajkavanagh) on 2020-08-14

Changed in charm-ceph-osd:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.