ceph-osd can't connect after upgrade to focal

Bug #1874939 reported by Christian Huebner
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
ceph (Ubuntu)
Opinion
Undecided
Unassigned

Bug Description

Upon upgrading a Ceph node with do-release-upgrade from eoan to focal, the OSD doesn't connect after the upgrade. I rolled back the change (VBox snapshot) and tried again, same result. I also tried to hold back the Ceph packages and upgrade after the fact, but again same result.

Epected behavior: OSD connects to cluster after upgrade.

Actual behavior: OSD log shows endlessly repeated 'tick_without_osd_lock' messages. OSD will stay down from perspective of the cluster.

Extract from debug log of OSD:

2020-04-24T16:25:35.811-0700 7fd70e83d700 5 osd.0 16499 heartbeat osd_stat(store_statfs(0x44990000/0x40000000/0x240000000, data 0x14bb97877/0x1bb660000, compress 0x0/0x0/0x0, omap 0x2bbf, meta 0x3fffd441), peers [] op hist [])
2020-04-24T16:25:35.811-0700 7fd70e83d700 20 osd.0 16499 check_full_status cur ratio 0.769796, physical ratio 0.769796, new state none
2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 tick
2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
2020-04-24T16:25:36.043-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700
2020-04-24T16:25:36.631-0700 7fd72606c700 10 osd.0 16499 tick_without_osd_lock
2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 tick
2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
2020-04-24T16:25:37.055-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700
2020-04-24T16:25:37.595-0700 7fd72606c700 10 osd.0 16499 tick_without_osd_lock
2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 tick
2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
2020-04-24T16:25:38.071-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700
2020-04-24T16:25:38.243-0700 7fd71cc0d700 20 osd.0 16499 reports for 0 queries
2020-04-24T16:25:38.583-0700 7fd72606c700 10 osd.0 16499 tick_without_osd_lock
2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 tick
2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start
2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish
2020-04-24T16:25:39.103-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700

This repeats over and over again.

strace of the process yields lots of unfinished futex access attempts:

[pid 2130] futex(0x55b17b8e216c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=937726129}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2100] write(12, "2020-04-24T16:47:33.915-0700 7fd"..., 79) = 79
[pid 2100] futex(0x55b17b7108e4, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772053, tv_nsec=969572004}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=20189832}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=70811223}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2163] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2134] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2163] io_getevents(0x7fd7272eb000, 1, 16, <unfinished ...>
[pid 2134] io_getevents(0x7fd7272fc000, 1, 16, <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=121288477}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2200] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2200] futex(0x55b17b7108e4, FUTEX_WAKE_PRIVATE, 2147483647) = 1
[pid 2200] futex(0x7ffc4aa8b708, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2200] futex(0x7ffc4aa8b770, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772055, tv_nsec=102644954}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2100] <... futex resumed> ) = 0
[pid 2100] futex(0x55b17b710838, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2100] futex(0x55b17b7108e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=171906673}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2127] <... clock_nanosleep resumed> NULL) = 0
[pid 2127] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=256000000}, <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=222271211}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=273226419}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=323615391}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2163] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2163] io_getevents(0x7fd7272eb000, 1, 16, <unfinished ...>
[pid 2134] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2134] io_getevents(0x7fd7272fc000, 1, 16, <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=373946132}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=424283527}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2127] <... clock_nanosleep resumed> NULL) = 0
[pid 2127] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=256000000}, <unfinished ...>
[pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=474599677}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=525368586}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=575839547}, FUTEX_BITSET_MATCH_ANY <unfinished ...>
[pid 2163] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2134] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0
[pid 2163] io_getevents(0x7fd7272eb000, 1, 16, <unfinished ...>
[pid 2134] io_getevents(0x7fd7272fc000, 1, 16, ^Cstrace: Process 2093 detached

Suspected cause: OSD can not connect to monitor.

Repeatability: On 5 attempts (3 separate nodes and 2 repetitions) the result was the same.

Research done: I checked Launchpad and the Ceph bug tracker, couldn't find something similar. Tried restart of process, reboot of node, revert change and re-upgrade, hold Ceph packages and manually upgrade after do-release-upgrade, strace of the process.

Impact: Right now upgrade of Ceph to 20.04 LTS appears to be broken

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: ceph-osd 15.2.1-0ubuntu1
ProcVersionSignature: Ubuntu 5.4.0-26.30-generic 5.4.30
Uname: Linux 5.4.0-26-generic x86_64
ApportVersion: 2.20.11-0ubuntu27
Architecture: amd64
CasperMD5CheckResult: skip
Date: Fri Apr 24 16:30:53 2020
ProcEnviron:
 SHELL=/bin/bash
 LANG=en_US.UTF-8
 TERM=xterm-256color
 PATH=(custom, no user)
SourcePackage: ceph
UpgradeStatus: Upgraded to focal on 2020-04-24 (0 days ago)

Revision history for this message
Christian Huebner (ossarchitect) wrote :
Revision history for this message
Dan Hill (hillpd) wrote :

Eoan packages Nautilus, while Focal packages Octopus:
 ceph | 14.2.2-0ubuntu3 | eoan
 ceph | 14.2.4-0ubuntu0.19.10.2 | eoan-security
 ceph | 14.2.8-0ubuntu0.19.10.1 | eoan-updates
 ceph | 15.2.1-0ubuntu1 | focal
 ceph | 15.2.1-0ubuntu2 | focal-proposed

When upgrading your cluster, make sure to follow the Octopus upgrade guidelines [0]. Specifically, the Mon and Mgr nodes must be upgraded and their services restarted before upgrading OSD nodes.

[0] https://docs.ceph.com/docs/master/releases/octopus/#upgrading-from-mimic-or-nautilus

Revision history for this message
Christian Huebner (ossarchitect) wrote : Re: [Bug 1874939] Re: ceph-osd can't connect after upgrade to focal
Download full text (11.5 KiB)

This would work If all nodes have a single function only (mon, mgr, old). I
tried everything to update the monitors first, but due to the dependencies
between the Ceph packages the monitors and mgr daemons can not simply be
updated separately from the OSDs What I don't get, though, is that once all
three monitors and mgrs are updated the OSDs do not fall back in line after
a reboot.
I will try to force the install of ceph-base, ceph-common and mon/mgr and
then force upgrade the OSDs to test whether that will work. If not at
least a workflow should be considered that allows upgrade of hyper
converged clusters, which are becoming more and more important for edge
sites.

On Fri, Apr 24, 2020 at 5:50 PM Dan Hill <email address hidden> wrote:

> Eoan packages Nautilus, while Focal packages Octopus:
> ceph | 14.2.2-0ubuntu3 | eoan
> ceph | 14.2.4-0ubuntu0.19.10.2 | eoan-security
> ceph | 14.2.8-0ubuntu0.19.10.1 | eoan-updates
> ceph | 15.2.1-0ubuntu1 | focal
> ceph | 15.2.1-0ubuntu2 | focal-proposed
>
> When upgrading your cluster, make sure to follow the Octopus upgrade
> guidelines [0]. Specifically, the Mon and Mgr nodes must be upgraded and
> their services restarted before upgrading OSD nodes.
>
> [0] https://docs.ceph.com/docs/master/releases/octopus/#upgrading-from-
> mimic-or-nautilus
> <https://docs.ceph.com/docs/master/releases/octopus/#upgrading-from-mimic-or-nautilus>
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1874939
>
> Title:
> ceph-osd can't connect after upgrade to focal
>
> Status in ceph package in Ubuntu:
> New
>
> Bug description:
> Upon upgrading a Ceph node with do-release-upgrade from eoan to focal,
> the OSD doesn't connect after the upgrade. I rolled back the change
> (VBox snapshot) and tried again, same result. I also tried to hold
> back the Ceph packages and upgrade after the fact, but again same
> result.
>
> Epected behavior: OSD connects to cluster after upgrade.
>
> Actual behavior: OSD log shows endlessly repeated
> 'tick_without_osd_lock' messages. OSD will stay down from perspective
> of the cluster.
>
> Extract from debug log of OSD:
>
> 2020-04-24T16:25:35.811-0700 7fd70e83d700 5 osd.0 16499 heartbeat
> osd_stat(store_statfs(0x44990000/0x40000000/0x240000000, data
> 0x14bb97877/0x1bb660000, compress 0x0/0x0/0x0, omap 0x2bbf, meta
> 0x3fffd441), peers [] op hist [])
> 2020-04-24T16:25:35.811-0700 7fd70e83d700 20 osd.0 16499
> check_full_status cur ratio 0.769796, physical ratio 0.769796, new state
> none
> 2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 tick
> 2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> start
> 2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters --
> finish
> 2020-04-24T16:25:36.043-0700 7fd7272ea700 20 osd.0 16499 tick
> last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next
> 2020-04-25T15:54:43.601161-0700
> 2020-04-24T16:25:36.631-0700 7fd72606c700 10 osd.0 16499
> tick_without_osd_lock
> 2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 tick
> 2020-04-24T16:25...

Revision history for this message
Dan Hill (hillpd) wrote :

The same guidelines apply to hyper-converged architectures.

Package updates are not applied until their corresponding service restarts. Ceph packaging does not automatically restart any services. This is by design so you can safely install on a hyper-converged host, and then control the order in which service updates are applied.

Revision history for this message
Christian Huebner (ossarchitect) wrote :

I redid the whole upgrade:
* do-release-upgrade and finished without reboot (all 4 nodes)
** so ceph daemons should not have been restarted
* restarted all ceph mons sequentially
** verified I get octopus as min mon release
* restarted all ceph-mgrs sequentially
** verified that all ceph-mgr daemons are running
* restarted all OSDs
** OSDs show
"2020-04-29T16:25:52.132-0700 7f43d2788700 1 osd.4 16945 tick checking mon for new map"
* All the logs are full of failed futex requests (connection timed out / unfinished'

Revision history for this message
Christian Huebner (ossarchitect) wrote :

I just shut down Ceph on all four nodes completely, then did the do-release-upgrade. Before the upgrade I verified that all Ceph services were down so I would be able to start them in the correct order.

After the upgrade (without reboot!) I found that all Ceph services on all Ceph nodes had been started and thus the upgrade of Ceph again failed.

There needs to be either a warning that do-release-upgrade cen not be used for Ceph upgrades, or do-release-upgrade needs to be fixed so Ceph services are not restarted.

Revision history for this message
Christian Huebner (ossarchitect) wrote :

I tried to do the upgrade by hand (disable all the services that can not be autostarted, do the upgrade (btw, a manpage has been moved from ceph-deploy to ceph-base and thus the apt upgrade fails. do-release-upgrade is using --force-overwrite for this, but that's not a clean solution). Solution is to first uninstall ceph-deploy and then do the upgrade, but this should be fixed.

I restarted all services manually in the correct order. mon and mgr work fine, the OSDs do not.

The result is mostly the same. This time at least all OSDs came up, but like before they hang in peering. I'll continue research on this. The OSDs still log that they are waiting for a new monmap.

Although started from the 15.2.1 binary they show up in Ceph report as 14.2.8, probably because they have not been converted yet (which should automatically happen when the OSDs connect to the monitors for the first time). Next step is tracing the OSDs to see where they hang, but probably still some futex deadlock.

Revision history for this message
Christian Huebner (ossarchitect) wrote :

One note on importance: If someone runs do-release-upgrade on a converged Ceph node, it will destroy the node. So far I have not seen any recovery procedure. The only reason I was able to rapidly redo the upgrade is because it runs on snapshots and thus can be recovered after destruction. This is not an expectation that can be made for even smaller-scale clusters which are going to be upgraded earliest.

Revision history for this message
Christian Huebner (ossarchitect) wrote :

I accomplished the upgrade by marking all Ceph packages held, then digging myself through the dependency jungle to upgrade the packages subsequently. This obviously is not a production ready way to do so, but at least Ceph Octopus is running in 20.04 now now.

This really needs to be fixed.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ceph (Ubuntu):
status: New → Confirmed
Revision history for this message
Jay Ring (jay-ring) wrote :

Just writing in to confirm this bug.

It's very serious.

Lost a whole node. No real warning. Extremely frustrating.

Revision history for this message
James Page (james-page) wrote :

working on reproduction for debug and triage.

Changed in ceph (Ubuntu):
status: Confirmed → In Progress
assignee: nobody → James Page (james-page)
Revision history for this message
James Page (james-page) wrote :

ceph-mon eoan->focal upgrade testing

ceph-mon@`hostname` systemd units not restarted until reboot step of the upgrade process on each node; mixed version cluster operated as expected as each mon was upgraded.

Revision history for this message
Jay Ring (jay-ring) wrote :

You may need more than one node to reproduce the problem.

I had a 3 node system.

I ran do-release-upgrade on node 1.

The OSDs on node 1 connected to the monitor quorum, which had un-upgraded monitors on hosts 2 & 3.

The upgraded OSDs on node 1 immediately died and could not be revived.

Revision history for this message
James Page (james-page) wrote :

OK further fact discovery from my testing.

I have a 6 machine cluster deployed - three machines host mon,mgr and three machines host osd.

Upgrading the mon,mgr cluster first followed by the three osd machine using do-release-upgrade and allowing the tool to reboot the machine at the end resulted in an upgraded and functioning cluster.

I also validated that the process of upgrading the packages does not stop or restart the daemons - so they will run on the 14.2.x series from eoan until either they are restarted OR the do-release-upgrade tool is permitted to reboot the box.

I appreciate that the reporters of this bug are deploying all daemons on all three machines which is different to what I have tested - I'll look at that next.

However it should be possible to complete the do-release-upgrade to the point of requesting a reboot - don't - drop to the CLI and get all machines to this point and then:

  restart the mons across all three machines
  restart the mgrs across all three machines
  restart the osds across all three machines

validating health between each step. I'm going to test this now.

This is inline with the upstream documented process for upgrading a ceph cluster:

 https://docs.ceph.com/docs/master/releases/octopus/#upgrading-from-mimic-or-nautilus

After this has been completed a reboot of each machine will be required to complete the release upgrade.

Revision history for this message
James Page (james-page) wrote :

As a side note - even if there is a bug here (and it sounds like there might be) I would recommend placing the mon and mgr daemons in LXD containers ontop of the machines hosting the osd's - this will allow you to manage them independently from an upgrade process for both ceph upgrades and ubuntu release upgrades.

Revision history for this message
James Page (james-page) wrote :

Testing phase 2 - three machine all-in-one deploy.

Deployed using eoan - mon,mgr and 1 x osd on each machine

Deployment seeded with pools a lightweight test data - rbd's in each pool.

Each machine upgraded in turn (1,2 and then 0) using do-release-upgrade.

ceph versions checked throughout deployment - mixed versions observered.

OSD's booted OK after machine reboots post do-release-upgrade.

During upgrade process:

$ sudo ceph mon dump | grep min_mon_release
dumped monmap epoch 1
min_mon_release 14 (nautilus)

$ sudo ceph versions
{
    "mon": {
        "ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)": 1,
        "ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)": 2
    },
    "mgr": {
        "ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)": 1,
        "ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)": 2
    },
    "osd": {
        "ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)": 1,
        "ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)": 2
    },
    "mds": {},
    "overall": {
        "ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable)": 3,
        "ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)": 6
    }
}

Post upgrade of last machine:

$ sudo ceph mon dump | grep min_mon_release
dumped monmap epoch 2
min_mon_release 15 (octopus)

$ sudo ceph versions
{
    "mon": {
        "ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)": 3
    },
    "mgr": {
        "ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)": 3
    },
    "osd": {
        "ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)": 3
    },
    "mds": {},
    "overall": {
        "ceph version 15.2.1 (9fd2f65f91d9246fae2c841a6222d34d121680ee) octopus (stable)": 9
    }
}

Revision history for this message
James Page (james-page) wrote :

Marking 'Incomplete' for now as unable to reproduce.

Changed in ceph (Ubuntu):
status: In Progress → Incomplete
Revision history for this message
James Page (james-page) wrote :

Other ideas - please could impacted users validate networking esp MTU configuration between machines in their cluster before, during and post upgrade.

Ceph can be very sensitive to MTU mismatches and just hang when stuff is not quite right.

Revision history for this message
Christian Huebner (ossarchitect) wrote :

i filed this bug specifically for hyperconverged environments. Upgrading monitor nodes first and then upgrading separate OSD nodes is probably doable, but in a hyperconverged environment you can not separate.

I tried do-release-upgrade (a couple of times) without rebooting at the end, but found the monitors and OSDs were upgraded and deadlocked at the end.
I tried shutting down all Ceph services first and then do-release upgrade. Which started my Ceph services and destroyed my cluster.
I tried manually upgrading Ceph, which is thwarted by the dependencies, it's all or nothing.

I finally accomplished the upgrade by marking all Ceph packages held, then digging myself through the dependency jungle to upgrade the packages in the right sequence. This was an absolute nightmare and took me more than an hour per node. Obviously is not a production ready way to do so, but at least Ceph Octopus is running in 20.04 now now.

'
There are two asks here:

Separate the dependencies so that ceph-mon, ceph-mgr and ceph-osd can be installed separately (with the appropriate dependencies, but in a way that upgrading ceph-mon does not try to upgrade ceph-osd also. There is no good reason why upgrade of ceph-mon should go down and back up the dependency tree and try to upgrade ceph-osd too. In fact, I would not want monitor packages on my OSD nodes and vice versa in a traditional cluster.

And fix do-release-upgrades, so a Ceph cluster does not get restarted when the upgrade procedure ends. I can vouch for the services being restarted, i tried it several times, once even with the services shut down before do-release-upgrade was started.

An upgrade procedure that breaks customer data should be fixed.

Revision history for this message
James Page (james-page) wrote :
Download full text (3.5 KiB)

Hi Christian

On Fri, May 22, 2020 at 8:10 AM Christian Huebner <
<email address hidden>> wrote:

> i filed this bug specifically for hyperconverged environments. Upgrading
> monitor nodes first and then upgrading separate OSD nodes is probably
> doable, but in a hyperconverged environment you can not separate.
>

I appreciate that which is why I have endeavoured to reproduce your issue
on a hyperconverged deployment as well.

> I tried do-release-upgrade (a couple of times) without rebooting at the
> end, but found the monitors and OSDs were upgraded and deadlocked at the
> end.
> I tried shutting down all Ceph services first and then do-release upgrade.
> Which started my Ceph services and destroyed my cluster.
> I tried manually upgrading Ceph, which is thwarted by the dependencies,
> it's all or nothing.
>
> I finally accomplished the upgrade by marking all Ceph packages held,
> then digging myself through the dependency jungle to upgrade the
> packages in the right sequence. This was an absolute nightmare and took
> me more than an hour per node. Obviously is not a production ready way
> to do so, but at least Ceph Octopus is running in 20.04 now now.
>
> There are two asks here:
>
> Separate the dependencies so that ceph-mon, ceph-mgr and ceph-osd can be
> installed separately (with the appropriate dependencies, but in a way
> that upgrading ceph-mon does not try to upgrade ceph-osd also. There is
> no good reason why upgrade of ceph-mon should go down and back up the
> dependency tree and try to upgrade ceph-osd too. In fact, I would not
> want monitor packages on my OSD nodes and vice versa in a traditional
> cluster.
>

The versioning between the various binary packages that the ceph source
code produces are strongly versioned so that you can't end up with an
inappropriate/broken mix of binaries on disk at the same time.

Upgrading the ceph-mon package results in an upgrade of the ceph-osd
package because they both depend on ceph-base with a strong version
dependency of a matching binary version.

This is how we enforce a known good set of bits on disks - and is why the
package maintainer scripts don't do restarts of the daemons on upgrade so
that the restart process can be managed with appropriate upgrade ordering.

> And fix do-release-upgrades, so a Ceph cluster does not get restarted
> when the upgrade procedure ends. I can vouch for the services being
> restarted, i tried it several times, once even with the services shut
> down before do-release-upgrade was started.
>

If you shutdown services the postinst script starts
'ceph{-mon,osd,mgr}.target' so they would get started back up, but targets
and services won't get restarted - I tested and validated and checked the
installed maintainer scripts.

I think you'd have to disable and mask the targets *and* services to ensure
that the target start did not force daemons to start as well but I did not
observe any restart behaviour during my upgrade testing (other than due to
the reboot of the system).

>
> An upgrade procedure that breaks customer data should be fixed.
>

Agreed but the first step is reproduction of the issue so that we can
actually identify what the problem is...

Read more...

Revision history for this message
Jay Ring (jay-ring) wrote :

"However it should be possible to complete the do-release-upgrade to the point of requesting a reboot - don't - drop to the CLI and get all machines to this point and then:

  restart the mons across all three machines
  restart the mgrs across all three machines
  restart the osds across all three machines"

Yes, I believe this would work.

However, that's not normally how I would do an upgrade. Normally, I upgrade one machine, make sure it works, and then upgrade the next. I have done it this way since I built the cluster back in Firefly. When I did this time, and it destroyed every OSD on the node that I upgraded.

This was very unexpected and disappointing, to say the least.

I wanted to warn others and try to prevent it from happening to them. I accept some of the blame. Part of it is on me, part of it is on Ceph, part of it is on Ubuntu.

Revision history for this message
Jay Ring (jay-ring) wrote :

"As a side note - even if there is a bug here (and it sounds like there might be) I would recommend placing the mon and mgr daemons in LXD containers ontop of the machines hosting the osd's"

Yes. I would strongly suggest doing this also. That is how Ceph now recommends it anyway. However, older installs are not usually set up this way.

And there is no warning that if you aren't set up this way that do-release-upgrade will destroy the node.

I would have been happy to make the change, I just didn't know it was necessary.

Also, and not to complain, but if you are setting up this way, there is no reason for the monitor package to be installed outside of the container - and it should probably not be.

This would suggest to me that ceph-mon should "conflict" with ceph-osd since they should never be installed in the same context/container/host. This would force a user to remove either the monitor or OSDs , preventing a reboot from destroying the node.

In a perfect world, ceph-osd would notice that it is connecting to an old monitor and politely disconnect without destroying all it's OSDs.

For now, however, I suggest some sort of stop-gap measure that prevents users from nuking their cluster without warning.

Revision history for this message
James Page (james-page) wrote :

On Fri, May 22, 2020 at 11:25 AM Jay Ring <email address hidden>
wrote:

> "However it should be possible to complete the do-release-upgrade to the
> point of requesting a reboot - don't - drop to the CLI and get all
> machines to this point and then:
>
> restart the mons across all three machines
> restart the mgrs across all three machines
> restart the osds across all three machines"
>
> Yes, I believe this would work.
>
> However, that's not normally how I would do an upgrade. Normally, I
> upgrade one machine, make sure it works, and then upgrade the next. I
> have done it this way since I built the cluster back in Firefly. When I
> did this time, and it destroyed every OSD on the node that I upgraded.
>

Although not best practice (upgrading machine at a time, rather than mons,
mgrs and osd ingroups) when I tried this earlier today it did actually work
- hence why I think I'm missing something about impacted deployments.

My testing did a fresh deploy of eoan with nautilus and then upgraded to
focal; maybe deployments which have been about for a while have different
state on disk/characteristic which cause this issue.

I'm endeavouring to get to a point where we understand *why* this happens
in certain situations.

tl;dr I need more details about impacted deployments to be able to debug
this further.

Revision history for this message
James Page (james-page) wrote :

Something was tickling my brain about upgrades that we dealt with in the ceph charms a while back.

The MON's can run v1 and v2 messenger ports however if a port is specified in mon hosts in ceph.conf its possible that the v2 port is disable, which is why the OSD can't connect back to the cluster.

Please can impacted users provide details of mon hosts from their ceph.conf files.

Revision history for this message
James Page (james-page) wrote :
Revision history for this message
James Page (james-page) wrote :

For example, the test deployment I have uses:

mon_host = 10.5.0.8,10.5.0.5,10.5.0.19

Revision history for this message
James Page (james-page) wrote :

To confirm:

tcp 0 0 10.5.0.8:3300 0.0.0.0:* LISTEN 64045 27128 784/ceph-mon
tcp 0 0 10.5.0.8:6789 0.0.0.0:* LISTEN 64045 27129 784/ceph-mon

3300 == v2
6789 == v1

Revision history for this message
Jay Ring (jay-ring) wrote :

/etc/ceph/ceph.conf
mon host = 192.168.120.1 192.168.120.2 192.168.120.3

ceph mon dump:
epoch 7
fsid <redacted>
last_changed 2020-05-16T23:16:32.234657-0500
created 2016-04-08T10:30:10.123758-0500
min_mon_release 15 (octopus)
0: [v2:192.168.120.1:3300/0,v1:192.168.120.1:6789/0] mon.temple-h1
1: [v2:192.168.120.2:3300/0,v1:192.168.120.2:6789/0] mon.temple-h2
2: [v2:192.168.120.3:3300/0,v1:192.168.120.3:6789/0] mon.temple-h3

netstat -ltup |grep ceph-mon:
tcp 0 0 temple-h1:3300 0.0.0.0:* LISTEN 1722/ceph-mon
tcp 0 0 temple-h1:6789 0.0.0.0:* LISTEN 1722/ceph-mon

I doubt this matters, but it might. These drives were formatted with ceph-disk, not ceph-vol. They are, however, mounted in the right place, and the block device is linked to the correct partition.

SystemD has been ignoring enable/disable instructions for a while, I don't know why. I assume new detection code.

Revision history for this message
Jay Ring (jay-ring) wrote :

tail -f /var/log/ceph/ceph-osd.13.log
2020-05-22T17:27:43.909-0500 7f44708ca700 1 osd.13 46107 tick checking mon for new map
2020-05-22T17:28:14.825-0500 7f44708ca700 1 osd.13 46107 tick checking mon for new map
2020-05-22T17:28:44.838-0500 7f44708ca700 1 osd.13 46107 tick checking mon for new map
2020-05-22T17:29:14.914-0500 7f44708ca700 1 osd.13 46107 tick checking mon for new map
2020-05-22T17:29:45.718-0500 7f44708ca700 1 osd.13 46107 tick checking mon for new map
2020-05-22T17:30:16.515-0500 7f44708ca700 1 osd.13 46107 tick checking mon for new map
2020-05-22T17:30:46.539-0500 7f44708ca700 1 osd.13 46107 tick checking mon for new map
2020-05-22T17:31:16.543-0500 7f44708ca700 1 osd.13 46107 tick checking mon for new map
2020-05-22T17:31:46.671-0500 7f44708ca700 1 osd.13 46107 tick checking mon for new map
2020-05-22T17:32:16.792-0500 7f44708ca700 1 osd.13 46107 tick checking mon for new map

Revision history for this message
madar (gaspar-akos) wrote :

I am in the middle of an mimic -> nautilus -> octopus upgrade, and got the same 'tick checking mon for new map' cycle from my 15.2.3 OSD daemons. After

$ ceph osd require-osd-release mimic

octopus OSD-s can connect to the cluster.

Revision history for this message
Trent Lloyd (lathiat) wrote :

This issue appears to be documented here: https://docs.ceph.com/en/latest/releases/nautilus/#instructions

Complete the upgrade by disallowing pre-Nautilus OSDs and enabling all new Nautilus-only functionality:

# ceph osd require-osd-release nautilus
Important This step is mandatory. Failure to execute this step will make it impossible for OSDs to communicate after msgrv2 is enabled.

Revision history for this message
TWENTY |20 (tw20) wrote :

I have the same issue.
That's why I've been testing a few things over the last few days:

Upgrade process:
Luminous -> Mimic -> Nautilus -> Octopus
(All Versions run under Bionic)

It doesn't matter whether I activate msgr2 or not. I always get the problem after upgrading to Octopus:
2021-01-11T09: 46: 33.674 + 0000 7fb8cf2d1700 1 osd.0 194 tick checking mon for new map
2021-01-11T09: 47: 04.490 + 0000 7fb8cf2d1700 1 osd.0 194 tick checking mon for new map
2021-01-11T09: 47: 34.514 + 0000 7fb8cf2d1700 1 osd.0 194 tick checking mon for new map
2021-01-11T09: 48: 05.451 + 0000 7fb8cf2d1700 1 osd.0 194 tick checking mon for new map

With a fresh installed version from Ceph Mimic and update to Nautilus -> Octopus I don't get this problem.
The problem apparently only comes from the update process from Luminous to Mimic, which then affects Octopus at the latest.

Workaround: Execute command on one of the Ceph monitors:
ceph osd require-osd-release mimic
After that, the octupus osd's can connect again.

Perhaps it is a good idea to run the "ceph osd require-osd-release [version]" command after every update.

e.g .:
After update Luminous -> Mimic
-> command: ceph osd require-osd-release mimic

After update Mimic -> Nautilus
-> command: ceph osd require-osd-release nautilus

After update Nautilus -> Octopus
-> command: ceph osd require-osd-release octopus

Apparently this is not done by the charms yet. Maybe the charms should do that or it should be mentioned in the charm documentation. What do you think about that?

Revision history for this message
Jay Ring (jay-ring) wrote :

That sounds promising.

I replaced my node a while ago so I can't verify this one way or the other, but it certainly sounds like it may be the problem. Including why Page could not duplicate it in his new install.

One of the reasons I bothered confirming the bug report was so that future searches for this error would lead to whatever solution was eventually found. Hopefully it will help them.

James Page (james-page)
Changed in ceph (Ubuntu):
assignee: James Page (james-page) → nobody
status: Incomplete → Opinion
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.