data corruption with CEPH & gnocchi-metricd leades to delete whole CEPH pool and loose all data

Bug #1500646 reported by Alejandro Comisario
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Gnocchi
Fix Released
High
Mehdi Abaakouk

Bug Description

When gnocchi-metricd ( on master branch and stable/1.2 ) writes on CEPH, if you kill gnocchi-metricd with :

# pkill -f gnocchi-metricd

Metricd apparently lefts corrupted files on CEPH while killing it, so restarting it, never resumes and stales so you just see metrics accumulating from gnocchi-api.

-----------------
2015-09-28 18:39:26.376 19169 INFO gnocchi.cli [-] Metricd reporting: 58 measurements bundles across 54 metrics wait to be processed.
2015-09-28 18:49:26.376 19169 INFO gnocchi.cli [-] Metricd reporting: 68 measurements bundles across 85 metrics wait to be processed.
2015-09-28 18:53:26.376 19169 INFO gnocchi.cli [-] Metricd reporting: 88 measurements bundles across 99 metrics wait to be processed.
-----------------

Debugin a little, we see that when metricd stales, it does geting in and out of this function (reading CEPH xattr)

https://github.com/openstack/gnocchi/blob/master/gnocchi/storage/ceph.py#L133

Never gets out of this caller function :

https://github.com/openstack/gnocchi/blob/master/gnocchi/storage/ceph.py#L141

And never return to carbonara caller (that does return when this are working fine while processing measures ) at :

https://github.com/openstack/gnocchi/blob/master/gnocchi/storage/_carbonara.py#L142

This is critical since there's no way to fix the issue because there's no message on the logs ( on metricd nor gnocchi-api ) to find what's or what're the file/s that are corrupted, you need to destroy the whole pool / delete all rados objects.

gnocchi.conf example:

[DEFAULT]
debug = True
verbose = True
log_file = /var/log/gnocchi/gnocchi.log
[api]
port = 8041
host = 0.0.0.0
workers = 2
[archive_policy]
[database]
[indexer]
url = mysql://gnocchi:NOTgnocchi@mysql/gnocchi?charset=utf8
[keystone_authtoken]
signing_dir = /var/cache/gnocchi
auth_uri = http://kstn:5000/v2.0
auth_url = http://kstn:35357/v2.0
project_domain_id = default
project_name = service
project_name = admin
password = MYSUPERPASSWD
username = cloudadmin
auth_plugin = password
memcached_servers = memcache2:11211,memcache1:11211
memcache_security_strategy = ENCRYPT
memcache_secret_key = LE9_s0kyh7Z_qNsmljOT
[metricd]
[oslo_policy]
[statsd]
[storage]
driver = ceph
metric_processing_delay = 5
ceph_pool = gnocchi
ceph_username = gnocchi
ceph_keyring = /etc/ceph/ceph.client.gnocchi.keyring
ceph_conffile = /etc/ceph/ceph.conf
file_basepath = /var/lib/gnocchi
file_basepath_tmp = ${file_basepath}/tmp

summary: - data corruption with CEPH & gnocchi-metricd leaves to delete whole CEPH
- pool
+ data corruption with CEPH & gnocchi-metricd leades to delete whole CEPH
+ pool and loose all data
Julien Danjou (jdanjou)
Changed in gnocchi:
status: New → Triaged
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to gnocchi (master)

Fix proposed to branch: master
Review: https://review.openstack.org/232061

Changed in gnocchi:
assignee: nobody → Mehdi Abaakouk (sileht)
status: Triaged → In Progress
Changed in gnocchi:
assignee: Mehdi Abaakouk (sileht) → Chris Dent (cdent)
Changed in gnocchi:
assignee: Chris Dent (cdent) → Mehdi Abaakouk (sileht)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to gnocchi (master)

Reviewed: https://review.openstack.org/232061
Committed: https://git.openstack.org/cgit/openstack/gnocchi/commit/?id=c27929471013604f6e182a0fc864df00a05a1f21
Submitter: Jenkins
Branch: master

commit c27929471013604f6e182a0fc864df00a05a1f21
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Oct 7 17:28:42 2015 +0200

    Removing ceph locking system

    The ceph locking system doesn't work as expected.
    When the client loose the connection to ceph, Locks are
    not released, so more code need to be added to expire
    the lock when it's no more updated and/or break it when
    we detect a client that hold the lock is dead.

    Instead of this, this change just use tooz as meachnism system.

    And a ceph driver will be added to tooz instead. Tooz have
    already a tests coverage for drivers, it should be easier to
    detect this kind of race.

    This change have a migration impact, changing the lock meachism
    means all gnocchi daemons must be stopped before upgrading them.

    Change-Id: I5662817db9cd6bbb7dc220407df42c55f862ef6b
    Closes-bug: #1500646

Changed in gnocchi:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to gnocchi (master)

Reviewed: https://review.openstack.org/232576
Committed: https://git.openstack.org/cgit/openstack/gnocchi/commit/?id=2f408a4d33b5f24b77e856a4664b576b5b6032cf
Submitter: Jenkins
Branch: master

commit 2f408a4d33b5f24b77e856a4664b576b5b6032cf
Author: Mehdi Abaakouk <email address hidden>
Date: Thu Oct 8 16:46:16 2015 +0200

    Don't fail when data are unreadable.

    When the data is corrupted, the metric cannot be processed anymore,
    because msgpack raise a ValueError.

    This change catches this, logs an appropriate message and create a
    new empty timeserie.

    Related-Bug: #1500646
    Closes-Bug: #1499372
    Change-Id: Ib47f84230a012197a06e615206e0b0f3e3780515

Julien Danjou (jdanjou)
Changed in gnocchi:
milestone: none → 1.3.0
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.