Ceph-mon units are stuck on waiting state after charm upgrade

Bug #1861996 reported by Giuseppe Petralia
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Invalid
High
Unassigned
Ceph OSD Charm
Fix Released
High
James Page

Bug Description

After charm upgrade from 18.05 to 19.10 the ceph-mon are stuck in "waiting" state with message:

"Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count "

Seems like all the relations between ceph-mon and ceph-osds are returning bootstrapped_osds = 0

However the osds are running and this is not affecting the cluster health.

$ juju status ceph-mon
Model Controller Cloud/Region Version SLA Timestamp
openstack juju-controller1 prodmaas 2.7.0 unsupported 10:46:57+01:00

App Version Status Scale Charm Store Rev OS Notes
ceph-mon 10.2.11 waiting 3 ceph-mon local 1 ubuntu
filebeat 6.8.6 active 3 filebeat jujucharms 25 ubuntu
nrpe-lxd active 3 nrpe jujucharms 60 ubuntu
telegraf active 3 telegraf jujucharms 29 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-mon/0 waiting idle 4/lxd/0 10.116.178.19 Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count (36)
  filebeat/6 active idle 10.116.178.19 Filebeat ready.
  nrpe-lxd/0 active idle 10.116.178.19 icmp,5666/tcp ready
  telegraf/12 active idle 10.116.178.19 9103/tcp Monitoring ceph-mon/0
ceph-mon/1 waiting idle 7/lxd/0 10.116.178.21 Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count (36)
  filebeat/16 active idle 10.116.178.21 Filebeat ready.
  nrpe-lxd/2 active idle 10.116.178.21 icmp,5666/tcp ready
  telegraf/14 active idle 10.116.178.21 9103/tcp Monitoring ceph-mon/1
ceph-mon/2* waiting idle 10/lxd/0 10.116.178.22 Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count (36)
  filebeat/24 active idle 10.116.178.22 Filebeat ready.
  nrpe-lxd/9 active idle 10.116.178.22 icmp,5666/tcp ready
  telegraf/21 active idle 10.116.178.22 9103/tcp Monitoring ceph-mon/2

$ juju status ceph-osd
Unit Workload Agent Machine Public address Ports Message
ceph-osd/0 active idle 3 213.173.196.139 Unit is ready (3 OSD)
  nrpe-physical/29 active idle 213.173.196.139 ready
ceph-osd/1 active idle 4 213.173.196.193 Unit is ready (3 OSD)
  nrpe-physical/13 active idle 213.173.196.193 icmp,5666/tcp ready
ceph-osd/2 active idle 5 213.173.196.138 Unit is ready (3 OSD)
  nrpe-physical/9 active idle 213.173.196.138 icmp,5666/tcp ready
ceph-osd/3 active idle 6 213.173.196.192 Unit is ready (3 OSD)
  nrpe-physical/8 active idle 213.173.196.192 icmp,5666/tcp ready
ceph-osd/4 active idle 7 213.173.196.140 Unit is ready (3 OSD)
  nrpe-physical/33 active idle 213.173.196.140 ready
ceph-osd/5 active idle 8 213.173.196.190 Unit is ready (3 OSD)
  nrpe-physical/18 active idle 213.173.196.190 icmp,5666/tcp ready
ceph-osd/6 active idle 9 213.173.196.141 Unit is ready (3 OSD)
  nrpe-physical/11 active idle 213.173.196.141 icmp,5666/tcp ready
ceph-osd/7 active idle 10 213.173.196.142 Unit is ready (3 OSD)
  nrpe-physical/14 active idle 213.173.196.142 ready
ceph-osd/8 active idle 11 213.173.196.143 Unit is ready (3 OSD)
  nrpe-physical/7 active idle 213.173.196.143 ready
ceph-osd/9* active idle 12 213.173.196.200 Unit is ready (3 OSD)
  nrpe-physical/21 active idle 213.173.196.200 ready
ceph-osd/10 active idle 13 213.173.196.135 Unit is ready (3 OSD)
  nrpe-physical/3 active idle 213.173.196.135 ready
ceph-osd/11 active idle 14 213.173.196.136 Unit is ready (3 OSD)
  nrpe-physical/1 active idle 213.173.196.136 icmp,5666/tcp ready

Revision history for this message
Giuseppe Petralia (peppepetra) wrote :

Workaround:

To check the bootstrapped_osds relation data, run:
juju run --unit ceph-mon/0 'relation-list -r $(relation-ids osd) | xargs -I{} sh -c '\''echo {}; relation-get -r $(relation-ids osd) - {}; echo'\'''
It should show bootstrapped-osds: "0"
To fix it, run on all ceph-osd manually (making sure the ceph-osd count is of course the correct one):
You get the RELATION by executing:
juju run --unit ceph-mon/0 relation-ids osd (for example osd:46)
juju run --application ceph-osd "relation-set -r RELATION bootstrapped-osds=3"
Monitor (with the first command) and if some OSDs do not report the correct data, run manually per osd
juju run --unit ceph-osd/8 "relation-set -r RELATION bootstrapped-osds=3"

Andrew McLeod (admcleod)
Changed in charm-ceph-mon:
importance: Undecided → Medium
Changed in charm-ceph-osd:
importance: Undecided → Medium
tags: added: charm-upgrade
Revision history for this message
Andrew McLeod (admcleod) wrote :

What distribution/cloud/source/release does this relate to?

Changed in charm-ceph-mon:
status: New → Triaged
Changed in charm-ceph-osd:
status: New → Incomplete
status: Incomplete → Triaged
Changed in charm-ceph-mon:
status: Triaged → Incomplete
Changed in charm-ceph-osd:
status: Triaged → Incomplete
Revision history for this message
Giuseppe Petralia (peppepetra) wrote :

Cloud is OpenStack Ocata.
Ubuntu 16.04

Charms:

ceph-osd revision 268
ceph-mon revision 25

Ceph packages:

~# dpkg -l | grep ceph
ii ceph 10.2.11-0ubuntu0.16.04.1 amd64 distributed storage and file system
ii ceph-common 10.2.11-0ubuntu0.16.04.1 amd64 common utilities to mount and interact with a ceph storage cluster
ii libcephfs1 10.2.11-0ubuntu0.16.04.1 amd64 Ceph distributed file system client library
ii python-ceph 10.2.11-0ubuntu0.16.04.1 all Meta-package for python libraries for the Ceph libraries
ii python-cephfs 10.2.11-0ubuntu0.16.04.1 amd64 Python libraries for the Ceph libcephfs library

Changed in charm-ceph-mon:
status: Incomplete → New
Changed in charm-ceph-osd:
status: Incomplete → New
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Some additional notes:

 - In ceph-osd commit 63f9ac2c7cd0db8f212bba6d278af2b0316b7760 the
   bootstrapped-osds rel attrib was added, didn't exist before
   then

 - So to get the bootstrapped-osds relvalue at all you need to upgrade
   not only ceph-mon but also ceph-osd

 - In commit ed99bd2b5708d0494a9139e6805a03940b41eefb for ceph-mon
   checking for ~sufficient_osds()~ was introduced; the commit message
   notes upgraders should get the required reldata post-upgrade but
   the bootstrapped-osds val is only set in
   ~prepare_disks_and_activate()~ which we don't trigger during an
   upgrade

Revision history for this message
James Troup (elmo) wrote :

I just ran into this on another cloud upgrade (same versions as in #3), subscribing field-high.

Ryan Beisner (1chb1n)
Changed in charm-ceph-mon:
importance: Medium → High
Changed in charm-ceph-osd:
importance: Medium → High
Revision history for this message
James Page (james-page) wrote :

I wonder if this is a casualty of the skipping of config-changed if configuration has not actually changed for the charm - prepare_disks_and_activate is called as part of a config-changed execution, but not as part of the upgrade-charm hook.

Revision history for this message
James Page (james-page) wrote :
Changed in charm-ceph-osd:
assignee: nobody → James Page (james-page)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-osd (master)

Reviewed: https://review.opendev.org/732101
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-osd/commit/?id=21ffd9d18b6d91b0526b1714a0eb029401610266
Submitter: Zuul
Branch: master

commit 21ffd9d18b6d91b0526b1714a0eb029401610266
Author: James Page <email address hidden>
Date: Mon Jun 1 08:41:42 2020 +0100

    Ensure bootstrapped OSD presented to ceph-mon

    On charm upgrade ensure that the number of bootstrapped OSD's
    is presented to the ceph-mon application.

    This ensures that the ceph-mon application does not switch
    into a 'waiting' state after upgrade from earlier versions
    of the ceph-* charms.

    Change-Id: If1425ef837a74212f002985f648ac1ecf9257201
    Closes-Bug: 1861996

Changed in charm-ceph-osd:
status: In Progress → Fix Committed
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Is there still a bug on ceph-mon. i.e. was it only a bug on ceph-osd and ceph-mon was already doing the "right thing"?

Changed in charm-ceph-mon:
status: New → Incomplete
James Page (james-page)
Changed in charm-ceph-mon:
status: Incomplete → Invalid
Changed in charm-ceph-osd:
milestone: none → 20.08
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-osd (stable/20.05)

Fix proposed to branch: stable/20.05
Review: https://review.opendev.org/742411

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-ceph-osd (stable/20.05)

Change abandoned by James Page (<email address hidden>) on branch: stable/20.05
Review: https://review.opendev.org/742411
Reason: Close to release so abandoning this change

Changed in charm-ceph-osd:
status: Fix Committed → Fix Released
tags: added: openstack-upgrade
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.