OSDs are not starting after upgrade from Mimic to Nautilus

Bug #1871362 reported by Vladimir Grevtsev
This bug report is a duplicate of:  Bug #1861789: [SRU] ceph 14.2.8. Edit Remove
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
New
Undecided
Unassigned

Bug Description

ubuntu@OrangeBox84:~/fce-demo$ juju status
Model Controller Cloud/Region Version SLA Timestamp
ceph-test foundations-maas maas_cloud 2.6.10 unsupported 10:56:56Z

App Version Status Scale Charm Store Rev OS Notes
ceph-mon 13.2.7 active 3 ceph-mon jujucharms 45 ubuntu
ceph-osd 13.2.7 active 4 ceph-osd jujucharms 299 ubuntu
ceph-radosgw 13.2.7 active 1 ceph-radosgw jujucharms 285 ubuntu
ntp 3.2 active 4 ntp jujucharms 39 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-mon/0 active idle 0/lxd/0 172.27.85.191 Unit is ready and clustered
ceph-mon/1 active idle 1/lxd/0 172.27.85.192 Unit is ready and clustered
ceph-mon/2* active idle 2/lxd/0 172.27.85.190 Unit is ready and clustered
ceph-osd/0* active idle 0 172.27.85.186 Unit is ready (1 OSD)
  ntp/2 active idle 172.27.85.186 123/udp chrony: Ready
ceph-osd/1 active idle 1 172.27.85.187 Unit is ready (1 OSD)
  ntp/1* active idle 172.27.85.187 123/udp chrony: Ready
ceph-osd/2 active idle 2 172.27.85.188 Unit is ready (1 OSD)
  ntp/0 active idle 172.27.85.188 123/udp chrony: Ready
ceph-osd/3 active idle 3 172.27.85.189 Unit is ready (1 OSD)
  ntp/3 active idle 172.27.85.189 123/udp chrony: Ready
ceph-radosgw/0* active idle 3/lxd/0 172.27.85.193 80/tcp Unit is ready

Machine State DNS Inst id Series AZ Message
0 started 172.27.85.186 node01 bionic default Deployed
0/lxd/0 started 172.27.85.191 juju-5a5ca4-0-lxd-0 bionic default Container started
1 started 172.27.85.187 node02 bionic default Deployed
1/lxd/0 started 172.27.85.192 juju-5a5ca4-1-lxd-0 bionic default Container started
2 started 172.27.85.188 node03 bionic default Deployed
2/lxd/0 started 172.27.85.190 juju-5a5ca4-2-lxd-0 bionic default Container started
3 started 172.27.85.189 node04 bionic default Deployed
3/lxd/0 started 172.27.85.193 juju-5a5ca4-3-lxd-0 bionic default Container started

ubuntu@OrangeBox84:~/fce-demo$ juju config ceph-mon source
cloud:bionic-stein

ubuntu@OrangeBox84:~/fce-demo$ juju config ceph-mon source="cloud:bionic-train"

# wait for an upgrade finish, everything looks ok
ubuntu@OrangeBox84:~/fce-demo$ juju status
Model Controller Cloud/Region Version SLA Timestamp
ceph-test foundations-maas maas_cloud 2.6.10 unsupported 11:03:41Z

App Version Status Scale Charm Store Rev OS Notes
ceph-mon 14.2.4 active 3 ceph-mon jujucharms 45 ubuntu
ceph-osd 13.2.7 active 4 ceph-osd jujucharms 299 ubuntu
ceph-radosgw 13.2.7 active 1 ceph-radosgw jujucharms 285 ubuntu
ntp 3.2 active 4 ntp jujucharms 39 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-mon/0 active idle 0/lxd/0 172.27.85.191 Unit is ready and clustered
ceph-mon/1 active idle 1/lxd/0 172.27.85.192 Unit is ready and clustered
ceph-mon/2* active idle 2/lxd/0 172.27.85.190 Unit is ready and clustered
ceph-osd/0* active idle 0 172.27.85.186 Unit is ready (1 OSD)
  ntp/2 active idle 172.27.85.186 123/udp chrony: Ready
ceph-osd/1 active idle 1 172.27.85.187 Unit is ready (1 OSD)
  ntp/1* active idle 172.27.85.187 123/udp chrony: Ready
ceph-osd/2 active idle 2 172.27.85.188 Unit is ready (1 OSD)
  ntp/0 active idle 172.27.85.188 123/udp chrony: Ready
ceph-osd/3 active idle 3 172.27.85.189 Unit is ready (1 OSD)
  ntp/3 active idle 172.27.85.189 123/udp chrony: Ready
ceph-radosgw/0* active idle 3/lxd/0 172.27.85.193 80/tcp Unit is ready

Machine State DNS Inst id Series AZ Message
0 started 172.27.85.186 node01 bionic default Deployed
0/lxd/0 started 172.27.85.191 juju-5a5ca4-0-lxd-0 bionic default Container started
1 started 172.27.85.187 node02 bionic default Deployed
1/lxd/0 started 172.27.85.192 juju-5a5ca4-1-lxd-0 bionic default Container started
2 started 172.27.85.188 node03 bionic default Deployed
2/lxd/0 started 172.27.85.190 juju-5a5ca4-2-lxd-0 bionic default Container started
3 started 172.27.85.189 node04 bionic default Deployed
3/lxd/0 started 172.27.85.193 juju-5a5ca4-3-lxd-0 bionic default Container started

# check package version
ubuntu@OrangeBox84:~/fce-demo$ juju ssh ceph-mon/0 'sudo apt-cache policy ceph-mon'
ceph-mon:
  Installed: 14.2.4-0ubuntu0.19.10.1~cloud0
  Candidate: 14.2.4-0ubuntu0.19.10.1~cloud0
  Version table:
 *** 14.2.4-0ubuntu0.19.10.1~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu bionic-updates/train/main amd64 Packages
        100 /var/lib/dpkg/status
     12.2.12-0ubuntu0.18.04.5 500
        500 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages
     12.2.4-0ubuntu1 500
        500 http://archive.ubuntu.com/ubuntu bionic/main amd64 Packages
Connection to 172.27.85.191 closed.

# check cluster health, everything looks fine except of warning, https://bugs.launchpad.net/charm-ceph-mon/+bug/1840701
ubuntu@OrangeBox84:~/fce-demo$ juju ssh ceph-mon/0 'sudo ceph -s'
  cluster:
    id: ecb243d2-78bd-11ea-abad-00163ec99c97
    health: HEALTH_WARN
            3 monitors have not enabled msgr2

  services:
    mon: 3 daemons, quorum juju-5a5ca4-2-lxd-0,juju-5a5ca4-0-lxd-0,juju-5a5ca4-1-lxd-0 (age 3m)
    mgr: juju-5a5ca4-1-lxd-0(active), standbys: juju-5a5ca4-2-lxd-0, juju-5a5ca4-0-lxd-0
    osd: 4 osds: 4 up, 4 in
    rgw: 1 daemon active (juju-5a5ca4-3-lxd-0)

  data:
    pools: 15 pools, 62 pgs
    objects: 187 objects, 1.1 KiB
    usage: 4.0 GiB used, 104 GiB / 108 GiB avail
    pgs: 62 active+clean

Connection to 172.27.85.191 closed.

# proceed to the OSD upgrade
ubuntu@OrangeBox84:~/fce-demo$ juju config ceph-osd source="cloud:bionic-train"

# wait until agents become idle - OSDs are broken

ubuntu@OrangeBox84:~/fce-demo$ juju status
Model Controller Cloud/Region Version SLA Timestamp
ceph-test foundations-maas maas_cloud 2.6.10 unsupported 11:11:08Z

App Version Status Scale Charm Store Rev OS Notes
ceph-mon 14.2.4 active 3 ceph-mon jujucharms 45 ubuntu
ceph-osd 14.2.4 blocked 4 ceph-osd jujucharms 299 ubuntu
ceph-radosgw 13.2.7 active 1 ceph-radosgw jujucharms 285 ubuntu
ntp 3.2 active 4 ntp jujucharms 39 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-mon/0 active executing 0/lxd/0 172.27.85.191 Unit is ready and clustered
ceph-mon/1 active idle 1/lxd/0 172.27.85.192 Unit is ready and clustered
ceph-mon/2* active executing 2/lxd/0 172.27.85.190 Unit is ready and clustered
ceph-osd/0* blocked idle 0 172.27.85.186 No block devices detected using current configuration
  ntp/2 active idle 172.27.85.186 123/udp chrony: Ready
ceph-osd/1 blocked idle 1 172.27.85.187 No block devices detected using current configuration
  ntp/1* active idle 172.27.85.187 123/udp chrony: Ready
ceph-osd/2 blocked idle 2 172.27.85.188 No block devices detected using current configuration
  ntp/0 active idle 172.27.85.188 123/udp chrony: Ready
ceph-osd/3 blocked idle 3 172.27.85.189 No block devices detected using current configuration
  ntp/3 active idle 172.27.85.189 123/udp chrony: Ready
ceph-radosgw/0* active idle 3/lxd/0 172.27.85.193 80/tcp Unit is ready

Machine State DNS Inst id Series AZ Message
0 started 172.27.85.186 node01 bionic default Deployed
0/lxd/0 started 172.27.85.191 juju-5a5ca4-0-lxd-0 bionic default Container started
1 started 172.27.85.187 node02 bionic default Deployed
1/lxd/0 started 172.27.85.192 juju-5a5ca4-1-lxd-0 bionic default Container started
2 started 172.27.85.188 node03 bionic default Deployed
2/lxd/0 started 172.27.85.190 juju-5a5ca4-2-lxd-0 bionic default Container started
3 started 172.27.85.189 node04 bionic default Deployed
3/lxd/0 started 172.27.85.193 juju-5a5ca4-3-lxd-0 bionic default Container started

# SSH to the OSD node
ubuntu@OrangeBox84:~/fce-demo$ juju ssh ceph-osd/0
ubuntu@node01:~$ sudo apt-cache policy ceph-osd
ceph-osd:
  Installed: 14.2.4-0ubuntu0.19.10.1~cloud0
  Candidate: 14.2.4-0ubuntu0.19.10.1~cloud0
  Version table:
 *** 14.2.4-0ubuntu0.19.10.1~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu bionic-updates/train/main amd64 Packages
        100 /var/lib/dpkg/status
     12.2.12-0ubuntu0.18.04.5 500
        500 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
        500 http://archive.ubuntu.com/ubuntu bionic-security/main amd64 Packages
     12.2.4-0ubuntu1 500
        500 http://archive.ubuntu.com/ubuntu bionic/main amd64 Packages
ubuntu@node01:~$
ubuntu@node01:~$ sudo systemctl status ceph-osd@3.service
● ceph-osd@3.service - Ceph object storage daemon osd.3
   Loaded: loaded (/lib/systemd/system/ceph-osd@.service; indirect; vendor preset: enabled)
   Active: failed (Result: core-dump) since Tue 2020-04-07 11:09:12 UTC; 2min 32s ago
 Main PID: 58443 (code=dumped, signal=ABRT)

Apr 07 11:09:12 node01 systemd[1]: ceph-osd@3.service: Main process exited, code=dumped, status=6/ABRT
Apr 07 11:09:12 node01 systemd[1]: ceph-osd@3.service: Failed with result 'core-dump'.
Apr 07 11:09:12 node01 systemd[1]: ceph-osd@3.service: Service hold-off time over, scheduling restart.
Apr 07 11:09:12 node01 systemd[1]: ceph-osd@3.service: Scheduled restart job, restart counter is at 3.
Apr 07 11:09:12 node01 systemd[1]: Stopped Ceph object storage daemon osd.3.
Apr 07 11:09:12 node01 systemd[1]: ceph-osd@3.service: Start request repeated too quickly.
Apr 07 11:09:12 node01 systemd[1]: ceph-osd@3.service: Failed with result 'core-dump'.
Apr 07 11:09:12 node01 systemd[1]: Failed to start Ceph object storage daemon osd.3.

# OSD log
ubuntu@node01:~$ sudo cat /var/log/ceph/ceph-osd.3.log | pastebinit
http://paste.ubuntu.com/p/5MMp3y8YCs/

# attempt to start OSD manually
ubuntu@node01:~$ sudo /usr/bin/ceph-osd -f --cluster ceph --id 3 --setuser ceph --setgroup ceph 2>&1 | pastebinit
http://paste.ubuntu.com/p/rpPt9NqyG9/

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Sounds like this is an upstream bug: https://tracker.ceph.com/issues/42223 with backport available https://github.com/ceph/ceph/pull/31644

This was released in Nautilus 14.2.5 (https://docs.ceph.com/docs/master/releases/nautilus/)

summary: - OSD are not starting after upgrade from Mimic to Nautilus
+ OSDs are not starting after upgrade from Mimic to Nautilus
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.