The gnocchi wsgi app experiences timeout errors when using influxdb

Bug #1488027 reported by Chris Dent
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Gnocchi
Invalid
Medium
Ilya Tyaptin

Bug Description

In a devstack, 32GB, 8 core machine, 4 gnocchi mod_wsgi processes, gnocchi using influxdb as the backend, ceilometer dispatching to gnocchi with two collectors, poll period of 10 seconds, 10 nova instances, this error is showing up regularly in the log:

    Timeout when reading response headers from daemon process 'gnocchi': /var/www/gnocchi/app.wsgi

Then after some time (as a result of the blocking):

    (11)Resource temporarily unavailable: [client 192.168.2.3:44679] mod_wsgi (pid=5715): Unable to connect to WSGI daemon process 'gnocchi' on '/var/run/httpd.3769.0.2.sock' after multiple attempts as listener backlog limit was exceeded.

After killing the collector the influxdb log showed that the gnocchi processes were still feeding data to influxdb >5 minutes after collector shutdown.

Some kind of tuning is required here to avoid this blocking.
Or influxdb is slow.
Or we need to use metricd with influxdb to put a stronger asynchrony gap in place.

The last POST to gnocchi was a full ten minutes before the last POST from gnocchi to influxdb.

Revision history for this message
Chris Dent (cdent) wrote :

More investigation is required to determine if the problem is that influxdb isn't ingesting data fast enough, or that gnocchi is getting hung up somehow trying to write and blocking.

One possibility is that requests (used by the influxdb client) is pooling connections but the pool is much too small for the rate we are throwing.

I have to be away for the afternoon, so I can't look now, if someone else can, cool.

Changed in gnocchi:
assignee: nobody → Ilya Tyaptin (ityaptin)
Revision history for this message
Chris Dent (cdent) wrote :

As a note: I was using the default settings. There are multiple backends available for influxdb (leveldb, rocksdb etc). It may be interesting/useful to compare which is better. I don't think that will be useful in the specific case of this bug though: It appears there's just something wrong with the interaction between gnocchi and influxdb which would be wrong whatever the backend.

Julien Danjou (jdanjou)
Changed in gnocchi:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Chris Dent (cdent) wrote :

As an update: this problem is still present in influxdb 0.9.3.

Watching logs suggests this is a problem with influxdb itself:

* 500 errors (the timeouts) are most present when new metrics are being created (in response to new resource being created on the ceilometer side)
* They also have a brief moment slightly before the logs show:

[retention] 2015/09/02 14:54:59 retention policy shard deletion check commencing
[retention] 2015/09/02 14:54:59 retention policy enforcement check commencing

This suggests that there's some kind of hangup when influxdb is required to think about something other than accepting writes into existing timeseries.

I will see if I can adjust a timeout to make a change in behavior.

Revision history for this message
Chris Dent (cdent) wrote :

This bug https://github.com/influxdb/influxdb/issues/3349 appears to be a related problem.

I'm now using version 0.9.3 and still seeing the problem. It appears to show up most when new time series are being created (in response to new resources on the ceilometer side).

Following the advice on the bug I've made adjustments to the wal file size and the cluster timeouts and that did lower the frequency of the problem _and_ the errors only happen in excess of the define timeout.

This suggests that the problem that is claimed to be fixed in 0.9.2 is not or has regressed in 0.9.3.

Revision history for this message
Simon Pasquier (simon-pasquier) wrote :

I haven't deployed Gnocchi + InfluxDB lately but I suspect that Gnocchi doesn't batch the writes, right?

I work on another project [2] using InfluxDB as a backend and we had the same kind of issue without batching the writes. This has also been mentioned several times on the InfluxDB ML that batching is mandatory to get high throughput [1].

[0] https://github.com/stackforge/fuel-plugin-lma-collector
[1] https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/influxdb/FWX6VQ6lzD0/gQkAphbcKQAJ

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to gnocchi (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/227533

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to gnocchi (master)

Reviewed: https://review.openstack.org/227533
Committed: https://git.openstack.org/cgit/openstack/gnocchi/commit/?id=1d776b937547b2fd4fe7cdafd0b824fb5c79ab8d
Submitter: Jenkins
Branch: master

commit 1d776b937547b2fd4fe7cdafd0b824fb5c79ab8d
Author: Julien Danjou <email address hidden>
Date: Thu Sep 24 22:17:17 2015 +0200

    Mark InfluxDB driver as experimental

    Change-Id: Ic2ca6ee924af83e117c883e02882ae1d4fede005
    Related-Bug: #1488027

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to gnocchi (master)

Fix proposed to branch: master
Review: https://review.openstack.org/230928

Revision history for this message
Sam Morrison (sorrison) wrote :

We're using influxdb 0.9.4.2 and not seeing this issue. The driver has been working well for us in an environment with ~300 gnocchi resources.

Revision history for this message
Julien Danjou (jdanjou) wrote :

I was able to run the tests with 0.9.4.2. without much issue indeed. However, the results returned are not correct. I'll open another bug for that.

Changed in gnocchi:
status: Triaged → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on gnocchi (master)

Change abandoned by Ilya Tyaptin (<email address hidden>) on branch: master
Review: https://review.openstack.org/230928
Reason: It doesn't work

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.