Gnocchi

data corruption with CEPH & gnocchi-metricd leades to delete whole CEPH pool and loose all data

Bug #1500646 reported by Alejandro Comisario on 2015-09-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Gnocchi	Fix Released	High	Mehdi Abaakouk	Gnocchi 1.3.0

Bug Description

When gnocchi-metricd ( on master branch and stable/1.2 ) writes on CEPH, if you kill gnocchi-metricd with :

# pkill -f gnocchi-metricd

Metricd apparently lefts corrupted files on CEPH while killing it, so restarting it, never resumes and stales so you just see metrics accumulating from gnocchi-api.

-----------------
2015-09-28 18:39:26.376 19169 INFO gnocchi.cli [-] Metricd reporting: 58 measurements bundles across 54 metrics wait to be processed.
2015-09-28 18:49:26.376 19169 INFO gnocchi.cli [-] Metricd reporting: 68 measurements bundles across 85 metrics wait to be processed.
2015-09-28 18:53:26.376 19169 INFO gnocchi.cli [-] Metricd reporting: 88 measurements bundles across 99 metrics wait to be processed.
-----------------

Debugin a little, we see that when metricd stales, it does geting in and out of this function (reading CEPH xattr)

https://github.com/openstack/gnocchi/blob/master/gnocchi/storage/ceph.py#L133

Never gets out of this caller function :

https://github.com/openstack/gnocchi/blob/master/gnocchi/storage/ceph.py#L141

And never return to carbonara caller (that does return when this are working fine while processing measures ) at :

https://github.com/openstack/gnocchi/blob/master/gnocchi/storage/_carbonara.py#L142

This is critical since there's no way to fix the issue because there's no message on the logs ( on metricd nor gnocchi-api ) to find what's or what're the file/s that are corrupted, you need to destroy the whole pool / delete all rados objects.

gnocchi.conf example:

[DEFAULT]
debug = True
verbose = True
log_file = /var/log/gnocchi/gnocchi.log
[api]
port = 8041
host = 0.0.0.0
workers = 2
[archive_policy]
[database]
[indexer]
url = mysql://gnocchi:NOTgnocchi@mysql/gnocchi?charset=utf8
[keystone_authtoken]
signing_dir = /var/cache/gnocchi
auth_uri = http://kstn:5000/v2.0
auth_url = http://kstn:35357/v2.0
project_domain_id = default
project_name = service
project_name = admin
password = MYSUPERPASSWD
username = cloudadmin
auth_plugin = password
memcached_servers = memcache2:11211,memcache1:11211
memcache_security_strategy = ENCRYPT
memcache_secret_key = LE9_s0kyh7Z_qNsmljOT
[metricd]
[oslo_policy]
[statsd]
[storage]
driver = ceph
metric_processing_delay = 5
ceph_pool = gnocchi
ceph_username = gnocchi
ceph_keyring = /etc/ceph/ceph.client.gnocchi.keyring
ceph_conffile = /etc/ceph/ceph.conf
file_basepath = /var/lib/gnocchi
file_basepath_tmp = ${file_basepath}/tmp

Tags:

Alejandro Comisario (alejandro-f) on 2015-09-28

summary:

- data corruption with CEPH & gnocchi-metricd leaves to delete whole CEPH
- pool
+ data corruption with CEPH & gnocchi-metricd leades to delete whole CEPH
+ pool and loose all data

Julien Danjou (jdanjou) on 2015-09-29

Changed in gnocchi:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-07: Fix proposed to gnocchi (master)

Fix proposed to branch: master
Review: https://review.openstack.org/232061

Changed in gnocchi:
assignee:	nobody → Mehdi Abaakouk (sileht)
status:	Triaged → In Progress

OpenStack Infra (hudson-openstack) on 2015-10-08

Changed in gnocchi:
assignee:	Mehdi Abaakouk (sileht) → Chris Dent (cdent)

OpenStack Infra (hudson-openstack) on 2015-10-08

Changed in gnocchi:
assignee:	Chris Dent (cdent) → Mehdi Abaakouk (sileht)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-09: Fix merged to gnocchi (master)

Reviewed: https://review.openstack.org/232061
Committed: https://git.openstack.org/cgit/openstack/gnocchi/commit/?id=c27929471013604f6e182a0fc864df00a05a1f21
Submitter: Jenkins
Branch: master

commit c27929471013604f6e182a0fc864df00a05a1f21
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Oct 7 17:28:42 2015 +0200

Removing ceph locking system

    The ceph locking system doesn't work as expected.
    When the client loose the connection to ceph, Locks are
    not released, so more code need to be added to expire
    the lock when it's no more updated and/or break it when
    we detect a client that hold the lock is dead.

Instead of this, this change just use tooz as meachnism system.

    And a ceph driver will be added to tooz instead. Tooz have
    already a tests coverage for drivers, it should be easier to
    detect this kind of race.

This change have a migration impact, changing the lock meachism
means all gnocchi daemons must be stopped before upgrading them.

Change-Id: I5662817db9cd6bbb7dc220407df42c55f862ef6b
Closes-bug: #1500646

Changed in gnocchi:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-14: Related fix merged to gnocchi (master)

Reviewed: https://review.openstack.org/232576
Committed: https://git.openstack.org/cgit/openstack/gnocchi/commit/?id=2f408a4d33b5f24b77e856a4664b576b5b6032cf
Submitter: Jenkins
Branch: master

commit 2f408a4d33b5f24b77e856a4664b576b5b6032cf
Author: Mehdi Abaakouk <email address hidden>
Date: Thu Oct 8 16:46:16 2015 +0200

Don't fail when data are unreadable.

When the data is corrupted, the metric cannot be processed anymore,
because msgpack raise a ValueError.

This change catches this, logs an appropriate message and create a
new empty timeserie.

    Related-Bug: #1500646
    Closes-Bug: #1499372
    Change-Id: Ib47f84230a012197a06e615206e0b0f3e3780515

Julien Danjou (jdanjou) on 2015-11-03

Changed in gnocchi:
milestone:	none → 1.3.0
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.