Luminous -> Mimic upgrade fails on one osd unit, no alert in juju status

Bug #1923200 reported by Michael Skalka
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Triaged
High
Unassigned

Bug Description

While upgrading the ceph cluster in a Bionic Queens openstack deployment on the latest stable charms from luminous to mimic 6/7 of the ceph-osd units successfully upgraded, however one unit stayed on luminous:

root@obayifo:~# ceph-osd --version
ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)

root@obayifo:~# apt-cache policy ceph-osd
ceph-osd:
  Installed: 12.2.13-0ubuntu0.18.04.6
  Candidate: 12.2.13-0ubuntu0.18.04.6
  Version table:
 *** 12.2.13-0ubuntu0.18.04.6 500
        500 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     12.2.13-0ubuntu0.18.04.4 500
        500 http://archive.ubuntu.com/ubuntu bionic-security/main amd64 Packages
     12.2.4-0ubuntu1 500
        500 http://archive.ubuntu.com/ubuntu bionic/main amd64 Packages

Whereas on a healthy node:

root@waldron:~# ceph-osd --version
ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)

root@waldron:~# apt-cache policy ceph-osd
ceph-osd:
  Installed: 13.2.8-0ubuntu0.18.10.1~cloud0
  Candidate: 13.2.8-0ubuntu0.18.10.1~cloud0
  Version table:
 *** 13.2.8-0ubuntu0.18.10.1~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu bionic-updates/rocky/main amd64 Packages
        100 /var/lib/dpkg/status
     12.2.13-0ubuntu0.18.04.6 500
        500 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
     12.2.13-0ubuntu0.18.04.4 500
        500 http://archive.ubuntu.com/ubuntu bionic-security/main amd64 Packages
     12.2.4-0ubuntu1 500
        500 http://archive.ubuntu.com/ubuntu bionic/main amd64 Packages

From what I can tell in the logs the charm caught the upgrade request, but somehow thought it was on mimic already:

2021-04-09 12:47:28 DEBUG juju.machinelock machinelock.go:172 machine lock acquired for ceph-osd/2 uniter (run config-changed hook)
2021-04-09 12:47:28 DEBUG juju.worker.uniter.operation executor.go:132 preparing operation "run config-changed hook" for ceph-osd/2
2021-04-09 12:47:28 DEBUG juju.worker.uniter.operation executor.go:132 executing operation "run config-changed hook" for ceph-osd/2
2021-04-09 12:47:28 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] executing: running config-changed hook
2021-04-09 12:47:28 DEBUG juju.worker.uniter.runner runner.go:705 starting jujuc server {unix @/var/lib/juju/agents/unit-ceph-osd-2/agent.socket <nil>}
2021-04-09 12:47:28 DEBUG jujuc server.go:211 running hook tool "juju-log" for ceph-osd/2-config-changed-6219614913432438912
2021-04-09 12:47:28 DEBUG juju-log Hardening function 'config_changed'
2021-04-09 12:47:28 DEBUG jujuc server.go:211 running hook tool "config-get" for ceph-osd/2-config-changed-6219614913432438912
2021-04-09 12:47:28 DEBUG jujuc server.go:211 running hook tool "juju-log" for ceph-osd/2-config-changed-6219614913432438912
2021-04-09 12:47:28 DEBUG juju-log No hardening applied to 'config_changed'
2021-04-09 12:47:28 DEBUG jujuc server.go:211 running hook tool "juju-log" for ceph-osd/2-config-changed-6219614913432438912
2021-04-09 12:47:28 INFO juju-log old_version: mimic
2021-04-09 12:47:28 DEBUG jujuc server.go:211 running hook tool "juju-log" for ceph-osd/2-config-changed-6219614913432438912
2021-04-09 12:47:28 INFO juju-log new_version: mimic
2021-04-09 12:47:28 DEBUG jujuc server.go:211 running hook tool "juju-log" for ceph-osd/2-config-changed-6219614913432438912
2021-04-09 12:47:28 WARNING juju-log Support for use of upstream ``apt_pkg`` module in conjunctionwith charm-helpers is deprecated since 2019-06-25
2021-04-09 12:47:29 DEBUG jujuc server.go:211 running hook tool "juju-log" for ceph-osd/2-config-changed-6219614913432438912
2021-04-09 12:47:29 ERROR juju-log Invalid upgrade path from mimic to mimic. Valid paths are: ['firefly -> hammer', 'hammer -> jewel', 'jewel -> luminous', 'luminous -> mimic', 'mimic -> nautilus', 'nautilus -> octopus']
2021-04-09 12:47:29 DEBUG jujuc server.go:211 running hook tool "juju-log" for ceph-osd/2-config-changed-6219614913432438912

Despite the version obviously being luminous from the output above. Unattended upgrades are turned off for this cloud, so nothing should have touched the ceph-osd packages between entering the new release configuration "cloud:bionic-rocky" into the osd charm.

Juju status for reference:

Model Controller Cloud/Region Version SLA Timestamp
openstack foundations-maas maas_cloud 2.8.10 unsupported 13:35:57Z

App Version Status Scale Charm Store Rev OS Notes
bcache-tuning active 7 bcache-tuning jujucharms 5 ubuntu
ceph-osd 13.2.8 active 7 ceph-osd jujucharms 308 ubuntu
lldpd active 0 lldpd jujucharms 7 ubuntu
neutron-openvswitch 12.1.1 active 0 neutron-openvswitch jujucharms 280 ubuntu
nova-compute 17.0.13 active 7 nova-compute jujucharms 325 ubuntu
nrpe-host active 7 nrpe jujucharms 70 ubuntu
ntp 3.2 active 7 ntp jujucharms 45 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-osd/0* active idle 0 10.244.49.66 Unit is ready (1 OSD)
  bcache-tuning/4 active idle 10.244.49.66 bcache devices tuned
  nrpe-host/4 active idle 10.244.49.66 icmp,5666/tcp ready
  ntp/5 active idle 10.244.49.66 123/udp chrony: Ready
ceph-osd/1 active idle 1 10.244.49.67 Unit is ready (1 OSD)
  bcache-tuning/5 active idle 10.244.49.67 bcache devices tuned
  nrpe-host/5 active idle 10.244.49.67 icmp,5666/tcp ready
  ntp/6 active idle 10.244.49.67 123/udp chrony: Ready
ceph-osd/2 active idle 2 10.244.49.68 Unit is ready (1 OSD)
  bcache-tuning/2 active idle 10.244.49.68 bcache devices tuned
  nrpe-host/2* active idle 10.244.49.68 icmp,5666/tcp ready
  ntp/3 active idle 10.244.49.68 123/udp chrony: Ready
ceph-osd/3 active idle 3 10.244.49.72 Unit is ready (1 OSD)
  bcache-tuning/6 active idle 10.244.49.72 bcache devices tuned
  nrpe-host/6 active idle 10.244.49.72 icmp,5666/tcp ready
  ntp/7 active idle 10.244.49.72 123/udp chrony: Ready
ceph-osd/4 active idle 4 10.244.49.71 Unit is ready (1 OSD)
  bcache-tuning/0* active idle 10.244.49.71 bcache devices tuned
  nrpe-host/1 active idle 10.244.49.71 icmp,5666/tcp ready
  ntp/2 active idle 10.244.49.71 123/udp chrony: Ready
ceph-osd/5 active idle 5 10.244.49.70 Unit is ready (1 OSD)
  bcache-tuning/1 active idle 10.244.49.70 bcache devices tuned
  nrpe-host/0 active idle 10.244.49.70 icmp,5666/tcp ready
  ntp/1 active idle 10.244.49.70 123/udp chrony: Ready
ceph-osd/6 active idle 6 10.244.49.73 Unit is ready (1 OSD)
  bcache-tuning/3 active idle 10.244.49.73 bcache devices tuned
  nrpe-host/3 active idle 10.244.49.73 icmp,5666/tcp ready
  ntp/4 active idle 10.244.49.73 123/udp chrony: Ready

Revision history for this message
Michael Skalka (mskalka) wrote :

Subscribing this field high as it's impacted an sqa upgrade test

description: updated
Revision history for this message
Michael Skalka (mskalka) wrote :

crashdump

description: updated
Michael Skalka (mskalka)
tags: added: charm-upgrade
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (7.3 KiB)

In summary: I think this is likely a side effect of calling config-get too late in the previous hook execution on ceph-osd/2, capturing the updated config value and writing it to .juju-persistent-config (that's how inter-hook-execution config changes are tracked in charmhelpers). As a result the next hook execution doesn't think that the value has changed.

This is due to the behavior in Juju where the result of a config-get call gets persisted at the time when a unit calls config-get rather than at the beginning of a hook execution (this is by design in Juju).

Specifically in the ceph-osd charm case the "source" config key value is used to determine the old version and compare it to the new version.

My suggestion would be to take the installed version into account here rather than just relying on the previous config value persisted by charm-helpers in the previous hook execution.
https://github.com/openstack/charm-ceph-osd/blob/43c93d9749d49afc991b3325a96a4ccc4def75f8/hooks/ceph_hooks.py#L134-L139
https://github.com/openstack/charm-ceph-osd/blob/43c93d9749d49afc991b3325a96a4ccc4def75f8/lib/charms_ceph/utils.py#L3179-L3187

https://github.com/openstack/charm-ceph-osd/blob/43c93d9749d49afc991b3325a96a4ccc4def75f8/hooks/charmhelpers/core/hookenv.py#L392 (config.previous)
https://github.com/openstack/charm-ceph-osd/blob/43c93d9749d49afc991b3325a96a4ccc4def75f8/hooks/charmhelpers/core/hookenv.py#L345 (.juju-persistent-config)

More info from the deployment below:

ubuntu@playground-cpe-9f305f1c-537b-477e-afca-1ac364a24afd:~$ juju run --application ceph-osd 'ceph-osd --version'
- Stdout: |
    ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
  UnitId: ceph-osd/1
- Stdout: |
    ceph version 12.2.13 (584a20eb0237c657dc0567da126be145106aa47e) luminous (stable)
  UnitId: ceph-osd/2
- Stdout: |
    ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
  UnitId: ceph-osd/5
- Stdout: |
    ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
  UnitId: ceph-osd/0
- Stdout: |
    ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
  UnitId: ceph-osd/3
- Stdout: |
    ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
  UnitId: ceph-osd/4
- Stdout: |
    ceph version 13.2.8 (5579a94fafbc1f9cc913a0f5d362953a5d9c3ae0) mimic (stable)
  UnitId: ceph-osd/6

juju ssh ceph-osd/2

/var/log/juju/unit-ceph-osd-2.log
# ...
2021-04-09 12:47:28 DEBUG juju.worker.uniter.operation executor.go:132 executing operation "run config-changed hook" for ceph-osd/2
2021-04-09 12:47:28 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] executing: running config-changed hook
2021-04-09 12:47:28 DEBUG juju.worker.uniter.runner runner.go:705 starting jujuc server {unix @/var/lib/juju/agents/unit-ceph-osd-2/agent.socket <nil>}
2021-04-09 12:47:28 DEBUG jujuc server.go:211 running hook tool "juju-log" for ceph-osd/2-config-changed-6219614913432438912
2021-04-09 12:47:28 DEBUG juju-log Hardening function 'config_changed'
2021-04-09 12:47:28 DEBUG jujuc server.go:211 running hook tool "config-get" for ceph-osd/2-config-changed-6219614913432438912
2021-04...

Read more...

Changed in charm-ceph-osd:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Billy Olsen (billy-olsen) wrote :

A workaround for this is to reset the persistently-saved config on the unit to what it was previously and re-run the config-changed hook. For example, if the original source value was 'distro' and the new source value is 'cloud:bionic-rocky', then the following sequence should work:

juju run --unit ceph-osd/<unit#> -- sed -i 's/cloud-bionic-rocky/distro/g' .juju-persistent-config
juju run --unit ceph-osd/<unit#> -- hooks/config-changed

Revision history for this message
Michael Skalka (mskalka) wrote :

dropping the field-high subscription as the immediate issue has been resolved.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.