gnocchi statsd consumes all overcloud resources when configured with swift backend

Bug #1621164 reported by John Trowbridge
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Gnocchi
Invalid
High
Julien Danjou
2.1
Invalid
High
Julien Danjou
2.2
Invalid
High
Julien Danjou
tripleo
Won't Fix
High
Unassigned

Bug Description

We have gnocchi enabled on the overcloud with swift backend by default. This is currently broken. The statsd service is unable to start and consumes all available resources on the overcloud trying. Eventually, this leads to the overcloud deploy timing out.

I have only reproduced it when using the pacemaker config, but I am not sure if that config just has less wiggle room resource wise.

This is the deploy command that reproduces the issue:

openstack overcloud deploy \
    --templates /usr/share/openstack-tripleo-heat-templates \
    --libvirt-type qemu \
    --timeout 90 \
    -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml

In the multinode scenario jobs we are testing gnocchi with the file backend:
https://github.com/openstack-infra/tripleo-ci/blob/master/test-environments/scenario001-multinode.yaml#L54

We should either make that the default in our user-facing environments, or disable gnocchi by default there.

John Trowbridge (trown)
Changed in tripleo:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Emilien Macchi (emilienm) wrote :
Changed in tripleo:
assignee: nobody → Emilien Macchi (emilienm)
milestone: none → newton-rc1
status: Triaged → In Progress
Revision history for this message
Alan Pevec (apevec) wrote :

This should still be at least looked at in gnocchi, it's not quite right that it spins like crazy when dependent resource is unavailable, it should try to reconnect but not eating all cpu.

Revision history for this message
Julien Danjou (jdanjou) wrote :

AFAIK this is already fixed in Gnocchi with commit a59c759a5473b781ea6645a7c722108eff986e1d

Changed in gnocchi:
status: New → Confirmed
Julien Danjou (jdanjou)
Changed in gnocchi:
importance: Undecided → High
status: Confirmed → Fix Committed
assignee: nobody → Julien Danjou (jdanjou)
Revision history for this message
Julien Danjou (jdanjou) wrote :

Actually not sure it changes anything. From the log I received at https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-delorean-minimal_pacemaker-113/overcloud-controller-0/var/log/gnocchi/statsd.log.gz I can't really see what gnocchi-statsd would be doing with all that CPU.

Changed in gnocchi:
status: Fix Committed → Triaged
Revision history for this message
John Trowbridge (trown) wrote :

https://review.openstack.org/#/c/366887/ did fix the issue in tripleo for reasonable hardware, but we are still seeing spikes in CPU usage when gnocchi comes online. This is still a problem when using older hardware as it can make the deploy extremely slow.

In RDO we have some slower machines testing tripleo, and the issue is very noticeable there:

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-delorean-minimal_pacemaker-114/overcloud-controller-0/var/log/messages.gz

CPU usage is high throughout the entire deploy, but things get really bogged down around "Sep 8 02:18:55", it does seem like swift-proxy is being hit pretty hard from gnocchi around that time:

Sep 8 02:18:54 localhost proxy-server: - - 08/Sep/2016/02/18/54 HEAD /v1/AUTH_8f14d89180394919b9a12b64472ade9b HTTP/1.0 204 - Swift - - - - tx525bd3bf466e4c36922aa-0057d0ca8e - 0.0264 RL - 1473301134.110661030 1473301134.137016058 -
Sep 8 02:18:54 localhost proxy-server: 172.16.1.5 172.16.1.5 08/Sep/2016/02/18/54 GET /v1/AUTH_8f14d89180394919b9a12b64472ade9b/measure%3Fformat%3Djson%26limit%3D16%26delimiter%3D/ HTTP/1.0 200 - python-swiftclient-3.0.0 112472f715c34dff... - 2 - tx525bd3bf466e4c36922aa-0057d0ca8e - 0.0611 - - 1473301134.108489037 1473301134.169552088 0

Revision history for this message
Julien Danjou (jdanjou) wrote :

Gnocchi does a GET request to Swift every 5 seconds by default to check if there's any metric to process. Though I doubt doing a GET every 5 seconds and getting an empty list is what cause high CPU load, unless you see high peak in swift-proxy too?

A test would be to increase the value metric_processing_delay in gnocchi.conf to something like 60 or so.

Revision history for this message
Emilien Macchi (emilienm) wrote :

John, do we have any progress on this bug? I'm going to defer it to Ocata if no update. Since it works in tripleo CI, I'm reducing the urgency to "High" instead of "Critical".

Changed in tripleo:
assignee: Emilien Macchi (emilienm) → nobody
importance: Critical → High
Changed in tripleo:
milestone: newton-rc1 → ocata-1
Revision history for this message
Julien Danjou (jdanjou) wrote :

I'm marking that as incomplete for Gnocchi as, as far as we know, this is not a problem in Gnocchi. Turns out that recent testing of OSP showed that the high IOPS usage was dong by Swift and its account/objects auditor doing a lot of verification with a very small delay by default. This has been fixed in theory upstream.

Unless we have more data points we can leverage to fix anything, this should be ok.

Changed in gnocchi:
status: Triaged → Incomplete
Revision history for this message
Steven Hardy (shardy) wrote :

We've been through several iterations of this problem in CI now - can anyone provide an overview of the current status, and what remains (if anything) until we can declare this fixed?

I'm deferring to ocata-2 as we're planning to release ocata-1 this week,

Changed in tripleo:
milestone: ocata-1 → ocata-2
Changed in tripleo:
milestone: ocata-2 → ocata-3
Changed in tripleo:
milestone: ocata-3 → ocata-rc1
Julien Danjou (jdanjou)
Changed in gnocchi:
status: Incomplete → Invalid
Changed in tripleo:
milestone: ocata-rc1 → ocata-rc2
Changed in tripleo:
milestone: ocata-rc2 → pike-1
Changed in tripleo:
milestone: pike-1 → pike-2
Revision history for this message
Pradeep Kilambi (pkilambi) wrote :

Added metric_processing_delay param in tripleo to tweak this:

https://review.openstack.org/#/c/458959/

https://review.openstack.org/#/c/458117/

So we have more knobs to tweak if needed. workers is already there.

Changed in tripleo:
milestone: pike-2 → pike-3
Revision history for this message
Emilien Macchi (emilienm) wrote :

There are no currently open reviews on this bug, changing the status back to the previous state and unassigning. If there are active reviews related to this bug, please include links in comments.

Changed in tripleo:
status: In Progress → Triaged
Changed in tripleo:
milestone: pike-3 → pike-rc1
Revision history for this message
Ben Nemec (bnemec) wrote :

So, for TripleO we aren't even deploying telemetry in most of the jobs now. Also, it looks like there were patches merged that were intended to help with this, but we probably need a ci change to make use of them?

At this point it's not clear to me that we need this targeted to Pike. Would anyone object to moving it out to Queens?

Changed in tripleo:
milestone: pike-rc1 → pike-rc2
Changed in tripleo:
milestone: pike-rc2 → queens-1
Revision history for this message
Alex Schultz (alex-schultz) wrote :

As we are no longer deploying telemetry in the CI, I'm going to close this bug as Won't Fix. If this pops up again, feel free to reopen it.

Changed in tripleo:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.