[SRU] ceph_osd crash in _committed_osd_maps when failed to encode first inc map

Bug #1891567 reported by Dan Hill
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Invalid
Critical
Unassigned
Ussuri
Fix Released
Critical
Unassigned
Victoria
Invalid
Critical
Unassigned
ceph (Ubuntu)
Fix Released
Critical
Unassigned
Focal
Fix Released
Critical
Unassigned
Groovy
Fix Released
Critical
Unassigned

Bug Description

[Impact]
Upstream tracker: issue#46443 [0].

The ceph-osd service can crash when processing osd map updates.

When the osd encounters a CRC error while processing an incremental map update, it will request a full map update from its peers. In this code path, an uninitialized variable was recently introduced and that will get de-referenced causing a crash.

The uninitialized variable was introduced in nautilus 14.2.10, and octopus 15.2.1.

[Test Case]
# Inject osd_inject_bad_map_crc_probability = 1
sudo ceph daemon osd.{id} config set osd_inject_bad_map_crc_probability 1

# Trigger some osd map updates by restarting a different osd
sudo systemctl restart osd@{diff-id}

[Regression Potential]
The code has been updated to leave handle_osd_maps() early if a CRC error is encountered, therefore preventing the map commit if the failure is encountered while processing an incremental map update. This will make the full map update take longer but should prevent the crash that resulted in this bug. Additionally, _committed_osd_maps() is now coded to assert if first <= last, but it is assumed that code should never be reached.

[Other Info]
Upstream has released a fix for this issue in Nautilus 14.2.11. The SRU for this point release is being tracked by LP: #1891077

Upstream has merged a fix for this issue in Octopus [1], but there is no current release target. The ceph packages in focal, groovy, and the ussuri cloud archive are exposed to this critical regression.

[0] https://tracker.ceph.com/issues/46443
[1] https://github.com/ceph/ceph/pull/36340

Dan Hill (hillpd)
description: updated
Changed in ceph (Ubuntu Focal):
status: New → Triaged
Changed in ceph (Ubuntu Groovy):
status: New → Triaged
Changed in ceph (Ubuntu Focal):
importance: Undecided → Critical
Changed in ceph (Ubuntu Groovy):
importance: Undecided → Critical
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I'm checking in oftc #ceph irc channel to see if there is a 15.2.5 release coming soon for octopus.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

ceph 15.2.3-0ubuntu2 is uploaded to groovy and ceph 15.2.3-0ubuntu0.20.04.2 has been uploaded to the focal unapproved queue [1] to fix this bug.

[1] https://launchpad.net/ubuntu/focal/+queue?queue_state=1&queue_text=ceph

Revision history for this message
Robie Basak (racb) wrote :

A fix for this is in groovy-proposed (since 15.2.3-0ubuntu2 like Corey said) but not migrated yet, so I'll adjust that task to Fix Committed so it doesn't look like the SRU is going in ahead of it.

Changed in ceph (Ubuntu Groovy):
status: Triaged → Fix Committed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 15.2.3-0ubuntu3

---------------
ceph (15.2.3-0ubuntu3) groovy; urgency=medium

  * d/control: Drop BD on obsolete cython (LP: #1891820).

ceph (15.2.3-0ubuntu2) groovy; urgency=medium

  * d/p/fix-crash-in-committed-osd-maps.patch: Fix ceph-osd crash
    when processing osd map updates (LP: #1891567).

 -- Corey Bryant <email address hidden> Mon, 17 Aug 2020 13:46:06 -0400

Changed in ceph (Ubuntu Groovy):
status: Fix Committed → Fix Released
description: updated
Dan Hill (hillpd)
description: updated
Revision history for this message
Robie Basak (racb) wrote : Please test proposed package

Hello Dan, or anyone else affected,

Accepted ceph into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/15.2.3-0ubuntu0.20.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in ceph (Ubuntu Focal):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Ceph currently isn't in the victoria cloud archive, marking invalid.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Dan, or anyone else affected,

Accepted ceph into ussuri-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:ussuri-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-ussuri-needed to verification-ussuri-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-ussuri-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-ussuri-needed
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :
Download full text (5.8 KiB)

I have tested this ussuri-proposed packages and it fixes the issue.

Setup a Nautilus cluster with the following versions:

# ceph versions
{
    "mon": {
        "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)": 1
    },
    "mgr": {
        "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)": 3
    },
    "mds": {},
    "overall": {
        "ceph version 14.2.9 (581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable)": 5
    }
}

# dpkg -l | grep -i ceph
ii ceph 14.2.9-0ubuntu0.19.10.1~cloud0 amd64 distributed storage and file system
ii ceph-base 14.2.9-0ubuntu0.19.10.1~cloud0 amd64 common ceph daemon libraries and management tools
ii ceph-common 14.2.9-0ubuntu0.19.10.1~cloud0 amd64 common utilities to mount and interact with a ceph storage cluster
ii ceph-mgr 14.2.9-0ubuntu0.19.10.1~cloud0 amd64 manager for the ceph distributed file system
ii ceph-mon 14.2.9-0ubuntu0.19.10.1~cloud0 amd64 monitor server for the ceph storage system
ii ceph-osd 14.2.9-0ubuntu0.19.10.1~cloud0 amd64 OSD server for the ceph storage system
ii libcephfs2 14.2.9-0ubuntu0.19.10.1~cloud0 amd64 Ceph distributed file system client library
ii python3-ceph-argparse 14.2.9-0ubuntu0.19.10.1~cloud0 amd64 Python 3 utility libraries for Ceph CLI
ii python3-cephfs 14.2.9-0ubuntu0.19.10.1~cloud0 amd64 Python 3 libraries for the Ceph libcephfs library
ii python3-rados 14.2.9-0ubuntu0.19.10.1~cloud0 amd64 Python 3 libra...

Read more...

Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

Also tested the same with Octopus:

# ceph versions
{
    "mon": {
        "ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)": 1
    },
    "mgr": {
        "ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)": 1
    },
    "osd": {
        "ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)": 3
    },
    "mds": {},
    "overall": {
        "ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)": 5
    }
}

# ceph report | grep ceph_version
report 2214250888
            "ceph_version": "ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)",
            "ceph_version_short": "15.2.3",
            "ceph_version": "ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)",
            "ceph_version_short": "15.2.3",
            "ceph_version": "ceph version 15.2.3 (d289bbdec69ed7c1f516e0a093594580a76b78d0) octopus (stable)",
            "ceph_version_short": "15.2.3",

tags: added: verification-ussuri-done
removed: verification-ussuri-needed
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :
Download full text (7.3 KiB)

Tests for Focal:

$ for osd in {0..2}; do juju ssh ceph-osd/$osd 'sudo dpkg -l | grep ceph'; done
ii ceph 15.2.3-0ubuntu0.20.04.2 amd64 distributed storage and file system
ii ceph-base 15.2.3-0ubuntu0.20.04.2 amd64 common ceph daemon libraries and management tools
ii ceph-common 15.2.3-0ubuntu0.20.04.2 amd64 common utilities to mount and interact with a ceph storage cluster
ii ceph-mds 15.2.3-0ubuntu0.20.04.2 amd64 metadata server for the ceph distributed file system
ii ceph-mgr 15.2.3-0ubuntu0.20.04.2 amd64 manager for the ceph distributed file system
ii ceph-mgr-modules-core 15.2.3-0ubuntu0.20.04.2 all ceph manager modules which are always enabled
ii ceph-mon 15.2.3-0ubuntu0.20.04.2 amd64 monitor server for the ceph storage system
ii ceph-osd 15.2.3-0ubuntu0.20.04.2 amd64 OSD server for the ceph storage system
ii libcephfs2 15.2.3-0ubuntu0.20.04.2 amd64 Ceph distributed file system client library
ii python3-ceph-argparse 15.2.3-0ubuntu0.20.04.2 amd64 Python 3 utility libraries for Ceph CLI
ii python3-ceph-common 15.2.3-0ubuntu0.20.04.2 all Python 3 utility libraries for Ceph
ii python3-cephfs 15.2.3-0ubuntu0.20.04.2 amd64 Python 3 libraries for the Ceph libcephfs library
Connection to 10.5.2.78 closed.
ii ceph 15.2.3-0ubuntu0.20.04.2 amd64 distributed storage and file system
ii ceph-base 15.2.3-0ubuntu0.20.04.2 amd64 common ceph daemon libraries and management tools
ii ceph-common 15.2.3-0ubuntu0.20.04.2 amd64 common utilities to mount and interact with a ceph storage cluster
ii ceph-mds 15.2.3-0ubuntu0.20.04.2 amd64 metadata server for the ceph distributed file system
ii ceph-mgr 15.2.3-0ubuntu0.20.04.2 amd64 manager for the ceph distributed file system
ii ceph-mgr-modules-core 15.2.3-0ubuntu0.20.04.2 all ceph manager modules which are always enabled
ii ceph-mon 15.2.3-0ubuntu0.20.04.2 amd64 monitor server for the ceph storage system
ii ceph-osd 15.2.3-0ubuntu0.20.04.2 amd64 OSD server for the ceph storage system
ii libcephfs2 15.2.3-0ubuntu0.20.04.2 amd64 Ceph distributed file system client library
ii python3-ceph-argparse 15.2.3-0ubuntu0.20.04.2 amd64 Python 3 utility libraries for Ceph CLI
ii python3-ceph-common ...

Read more...

tags: added: verification-needed-done
removed: verification-needed-focal
Dan Hill (hillpd)
tags: added: verification-done verification-done-focal
removed: verification-needed verification-needed-done
Revision history for this message
Brian Murray (brian-murray) wrote :

I don't see the Test Case from the bug description having been done in the comment regarding this a verified for Ubuntu 20.04, subsequently I'm flipping the tags back to verification needed.

tags: added: verification-needed verification-needed-focal
removed: verification-done verification-done-focal
Revision history for this message
Ponnuvel Palaniyappan (pponnuvel) wrote :

@Brian, I have repeated the steps for focal and attached the text file with relevant logs/output. Can you please check again?

tags: added: verification-done verification-done-focal
removed: verification-needed verification-needed-focal
Revision history for this message
Brian Murray (brian-murray) wrote : Update Released

The verification of the Stable Release Update for ceph has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 15.2.3-0ubuntu0.20.04.2

---------------
ceph (15.2.3-0ubuntu0.20.04.2) focal; urgency=medium

  * d/p/fix-crash-in-committed-osd-maps.patch: Fix ceph-osd crash
    when processing osd map updates (LP: #1891567).

 -- Corey Bryant <email address hidden> Fri, 14 Aug 2020 11:46:05 -0400

Changed in ceph (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Corey Bryant (corey.bryant) wrote :

The verification of the Stable Release Update for ceph has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package ceph - 15.2.3-0ubuntu0.20.04.2~cloud0
---------------

 ceph (15.2.3-0ubuntu0.20.04.2~cloud0) bionic-ussuri; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 ceph (15.2.3-0ubuntu0.20.04.2) focal; urgency=medium
 .
   * d/p/fix-crash-in-committed-osd-maps.patch: Fix ceph-osd crash
     when processing osd map updates (LP: #1891567).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.