Bionic: CollectD ceph plugin is incompatible with Ceph 12+ (Luminous)

Bug #1774032 reported by David McBride
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
collectd (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Won't Fix
Undecided
Unassigned
Bionic
Fix Released
Medium
Eric Desrochers

Bug Description

[IMPACT]
A subset of Ceph metrics no-longer get logged with Ceph luminous (v12 and late)

The version of collectd shipped with Ubuntu 18.04 (Bionic) provides a ceph plugin that is incompatible with the version of Ceph shipped in the same distribution.

The version of collectd is 5.7.2-2ubuntu1
The version of ceph is 12.2.4-0ubuntu1.1

This patch for collectd is required for correct interoperation with Ceph 12+:

 https://github.com/collectd/collectd/pull/2464

The first version of collectd to contain this patch is 5.8.0.

Without this patch, errors of the following form will be logged by collectd, and many ceph-specific metrics will not be collected:

May 28 09:31:56 stor-a collectd[2141]: ceph plugin: cconn_handle_event(name=osd.2,i=0,st=4): error 1
May 28 09:31:56 stor-a collectd[2141]: ceph plugin: ds Bluestore.kvFlushLat.avgtime was not properly initialized.
May 28 09:31:56 stor-a collectd[2141]: ceph plugin: JSON handler failed with status -1.

[TEST CASE]

* Install Bionic and have a Luminous and/or late Ceph environment.

* Install collectd
  ** Enable Ceph plugin in /etc/collectd/collectd.conf
  ** Configure Ceph plugin daemon in /etc/collectd/collectd.conf

* Restart collectd
  ** systemctl stop collectd.service
  ** systemctl start collectd.service

* Force collection
  ** collectd -C /etc/collectd/collectd.conf

* Monitor /var/log/syslog
  ** tail -f /var/log/syslog | grep -i collectd
  ** tail -f /var/log/syslog | grep -i "ceph plugin"

* Check Ceph plugin metrics from visualisation system.

[POTENTIAL REGRESSION]

 * Bionic's oldest Ceph version supported is Luminous, so the backward incompatibility with previous ceph versions is not a problem here.

 * Upstream faced a segfault situation in the Ceph plugin with Mimic version via issue: https://github.com/collectd/collectd/issues/2572, this problem is also addressed in the current SRU (d/p/ceph-plugin-Fix-2572.patch)

 * The new Ceph support is already part of debian and Ubuntu Cosmic and Disco.

 * Right now it seems like the Ceph plugin is not working at all anyway. It couldn't go worse after this SRU than the current situation.

 * A test package with the fixes has been made available PRE-SRU to impacted user(s) and the feedback were positive and this was tested against different Ceph daemons (MDS, MON, OSD) (See comment #5, #15, #18)

[OTHER INFORMATION]

# Collectd Plugin:Ceph information:
https://collectd.org/wiki/index.php/Plugin:Ceph

# Upstream commits:
647ac31b Add support for ceph version luminous:
https://github.com/collectd/collectd/commit/647ac31bf9db60b1685d6d8d25be65375ba85891

de05fb53 ceph plugin: Fix #2572:
https://github.com/collectd/collectd/commit/de05fb53fad6bc998f585b704ca0caeadc14a035

$ git describe --contains 647ac31b
collectd-5.8.0~29^2~9

$ git describe --contains de05fb53
collectd-5.8.1~38

# rmadison
==> collectd | 5.7.2-2ubuntu1 | bionic/universe
 collectd | 5.8.0-5.2 | cosmic/universe
 collectd | 5.8.1-1.2 | disco/universe

[ORIGINAL DESCRIPTION]

The version of collectd shipped with Ubuntu 18.04 (Bionic) provides a ceph plugin that is incompatible with the version of Ceph shipped in the same distribution.

The version of collectd is 5.7.2-2ubuntu1
The version of ceph is 12.2.4-0ubuntu1.1

This patch for collectd is required for correct interoperation with Ceph 12+:

 https://github.com/collectd/collectd/pull/2464

The first version of collectd to contain this patch is 5.8.0.

Without this patch, errors of the following form will be logged by collectd, and many ceph-specific metrics will not be collected:

May 28 09:31:56 stor-a collectd[2141]: ceph plugin: cconn_handle_event(name=osd.2,i=0,st=4): error 1
May 28 09:31:56 stor-a collectd[2141]: ceph plugin: ds Bluestore.kvFlushLat.avgtime was not properly initialized.
May 28 09:31:56 stor-a collectd[2141]: ceph plugin: JSON handler failed with status -1.

summary: - Bionic: CollectD ceph plugin is version incompatible
+ Bionic: CollectD ceph plugin is incompatible with Ceph 12+ (Luminous)
description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in collectd (Ubuntu):
status: New → Confirmed
Revision history for this message
Jonas Jelten (jonas-jelten) wrote :

Additionally, this patch is required to avoid crashes with Ceph mimic: https://github.com/collectd/collectd/commit/01cf776f74c6364f0dc3ad07efb0850bc22a03ea

Eric Desrochers (slashd)
tags: added: sts
Eric Desrochers (slashd)
Changed in collectd (Ubuntu):
status: Confirmed → Fix Released
Changed in collectd (Ubuntu Bionic):
assignee: nobody → Eric Desrochers (slashd)
importance: Undecided → Medium
status: New → In Progress
Revision history for this message
Eric Desrochers (slashd) wrote :

To avoid any potential confusion, I have marked Xenial as 'Won't Fix' with the following rationale:

It is true that Xenial offers Jewel and Luminous support, but since the 'Add support for ceph version luminous' commit is not backward compatible with previous Ceph version.

We can't backport that change in Xenial/16.04 LTS.

============================
From 647ac31bf9db60b1685d6d8d25be65375ba85891 Mon Sep 17 00:00:00 2001
From: Aleksei Zakharov <email address hidden>
Date: Wed, 4 Oct 2017 11:12:23 +0300
Subject: [PATCH] Add support for ceph version luminous

This patch is not backward compatible with previous ceph versions.
============================

Regards,
Eric

Changed in collectd (Ubuntu Xenial):
status: New → Won't Fix
Eric Desrochers (slashd)
description: updated
description: updated
description: updated
Eric Desrochers (slashd)
description: updated
Eric Desrochers (slashd)
description: updated
description: updated
Eric Desrochers (slashd)
description: updated
description: updated
description: updated
Eric Desrochers (slashd)
description: updated
description: updated
description: updated
description: updated
Eric Desrochers (slashd)
description: updated
description: updated
Revision history for this message
Eric Desrochers (slashd) wrote :

Hi David McBride and/or anyone else impacted,

Would you be amenable to give this test package I made available in my PPA[1] a try ?
I would strongly recommend not testing in production area if possible since this is a test package.

Any feedback will be appreciated.

[1] - Adding this PPA to your system
sudo add-apt-repository ppa:slashd/lp1774032
sudo apt-get update
sudo apt install collectd -y

Regards,
Eric

Revision history for this message
David McBride (david-mcbride) wrote :

Hello Eric,

Thank you so much for working on this issue, it's going to help with some maintainability issues here!

With the existing package, I continue to see errors as previously reported of the form:

 Mar 26 21:32:37 stor-a collectd[25914]: ceph plugin: cconn_handle_event(name=mon.stor-a,i=0,st=4): error 1
 Mar 26 21:32:37 stor-a collectd[25914]: ceph plugin: ds ThrottleMsgrDispatchThrottlerMds.wait.avgtime was not properly initialized.
 Mar 26 21:32:37 stor-a collectd[25914]: ceph plugin: JSON handler failed with status -1.
 Mar 26 21:32:37 stor-a collectd[25914]: ceph plugin: cconn_handle_event(name=mds.stor-a,i=1,st=4): error 1

Additionally, a subset of Ceph metrics no-longer get logged.

Installing the test packages provided (version 5.7.2-2ubuntu1+testpkg20190326b1), these errors no-longer occur, and (by inspection) the full set of Ceph metrics start getting logged again.

(I say 'again', because I have been running a manually-compiled build of version 5.8 of collectd that I manually compiled with these fixes, which has been running happily for some considerable time.)

So this looks like a good fix! Thanks again for your help.

Kind regards,
David

Revision history for this message
Eric Desrochers (slashd) wrote :

Thanks David, your feedback is appreciated !

I'm waiting for other feedback from the field and I'll start the SRU to update the package in the archive soon.

I may request another round of test when the package will land in bionic-proposed.

Regards,
Eric

description: updated
Revision history for this message
Eric Desrochers (slashd) wrote :

David, what was the Ceph version you ran the test against ? Still luminous (v12) or Mimic (v13) ?

Revision history for this message
David McBride (david-mcbride) wrote :

My test was against Luminous, using Ceph packages supplied in Bionic. I can look up the exact version when I'm back in the office in the morning.

Revision history for this message
Eric Desrochers (slashd) wrote :

David, it's fine, I was just curious to see if it was luminous or mimic.

Thanks, I'll contact you later when the package will enter the testing phase, if you don't mind.

- Eric

Revision history for this message
David McBride (david-mcbride) wrote :

Not a problem, ask away. :)

For reference, this was testing against Ceph 12.2.8-0ubuntu0.18.04.2. (Which is not the latest in Bionic; I shall likely undertake a quick upgrade later today.)

Eric Desrochers (slashd)
description: updated
description: updated
Revision history for this message
Eric Desrochers (slashd) wrote :

debdiff for Bionic: lp1774032-bionic.debdiff

Revision history for this message
Eric Desrochers (slashd) wrote :

David McBride,

Did you ran collectd against different Ceph type of instances in your cluster ?
Such as Ceph Monitor, Ceph OSD, ... ?

- Eric

Eric Desrochers (slashd)
description: updated
Revision history for this message
David McBride (david-mcbride) wrote :

Hi Eric,

I only updated collectd on a node with just an MDS and MON, not an OSD.

I'm about to update the version of Ceph on this (small) cluster to the latest current in Bionic (12.2.11-0ubuntu0.18.04.1); I'll take this opportunity to apply the collectd update test on one of the OSD hosts.

Kind regards,
David

Revision history for this message
Eric Desrochers (slashd) wrote :

Great !

Would be nice to have an overall testing against all daemons (MDS, MON, OSD) you have in your cluster, while I'm still waiting for other feedback from the field.

I appreciate the time you spend testing.

- Eric

Revision history for this message
David McBride (david-mcbride) wrote :

Quick testing update:

* I have updated my Ceph cluster to 12.2.11-0ubuntu0.18.04.1, everything still works as expected.

* Running the provided test version of collectd, 5.7.2-2ubuntu1+testpkg20190326b1 against the updated Ceph instance does seem to be successfully collecting all metrics, and not otherwise logging errors;

* I have run the provided test version of collectd against a second host which also runs OSDs; this likewise appears to work correctly.

Thanks!

Kind regards,
David

Revision history for this message
Eric Desrochers (slashd) wrote :

David McBride,

Perfect thanks, I should start the SRU process very soon.

Again I'll request another round of testing when the package will enter its testing phase (bionic-proposed)

Stay tuned .....

- Eric

description: updated
Revision history for this message
Eric Desrochers (slashd) wrote :

Uploaded in the Bionic upload queue. It is now waiting for SRU approval for the package to start building in bionic-proposed for the testing phase.

- Eric

Revision history for this message
Eric Desrochers (slashd) wrote :

It has been brought to my attention the following by another Ubuntu users (based on the testpkg provided on my PPA):

"Thanks for this. Yes, I can confirm that when I use your collectd test packages, they don't segfault, and all the metrics I expect are available in our visualisation system."

description: updated
Revision history for this message
Timo Aaltonen (tjaalton) wrote : Please test proposed package

Hello David, or anyone else affected,

Accepted collectd into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/collectd/5.7.2-2ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in collectd (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-bionic
Revision history for this message
Eric Desrochers (slashd) wrote :

David McBride, we are now entering the testing phase.
Could you please test the package found in bionic-proposed, this package (if no regression found still) will be the one being promoted to bionic-updates once the testing phase is over.

- Eric

Revision history for this message
David McBride (david-mcbride) wrote :

Proposed package (5.7.2-2ubuntu1.1) deployed on host; error messages no-longer appear in journal, and previously-missing Ceph metrics are appearing in our metrics visualisation system.

Looks good!

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
Eric Desrochers (slashd) wrote :

Perfect thanks David for your willingness to test the packages.

description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package collectd - 5.7.2-2ubuntu1.1

---------------
collectd (5.7.2-2ubuntu1.1) bionic; urgency=medium

  * d/p/add-support-for-ceph-version-luminous.patch (LP: #1774032)
    - This patch is not backward compatible with previous ceph versions.

  * d/p/ceph-plugin-Fix-2572.patch:
    - ceph plugin causes collectd to segfault.

 -- Eric Desrochers <email address hidden> Tue, 26 Mar 2019 12:18:41 -0400

Changed in collectd (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for collectd has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.