tripleo

gnocchi statsd consumes all overcloud resources when configured with swift backend

Bug #1621164 reported by John Trowbridge on 2016-09-07

This bug affects 3 people

	Status	Importance	Assigned to	Milestone
Gnocchi	Invalid	High	Julien Danjou
2.1	Invalid	High	Julien Danjou
2.2	Invalid	High	Julien Danjou
tripleo	Won't Fix	High	Unassigned	tripleo queens-1

Bug Description

We have gnocchi enabled on the overcloud with swift backend by default. This is currently broken. The statsd service is unable to start and consumes all available resources on the overcloud trying. Eventually, this leads to the overcloud deploy timing out.

I have only reproduced it when using the pacemaker config, but I am not sure if that config just has less wiggle room resource wise.

This is the deploy command that reproduces the issue:

openstack overcloud deploy \
    --templates /usr/share/openstack-tripleo-heat-templates \
    --libvirt-type qemu \
    --timeout 90 \
    -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml

In the multinode scenario jobs we are testing gnocchi with the file backend:
https://github.com/openstack-infra/tripleo-ci/blob/master/test-environments/scenario001-multinode.yaml#L54

We should either make that the default in our user-facing environments, or disable gnocchi by default there.

John Trowbridge (trown) on 2016-09-07

Changed in tripleo:
status:	New → Triaged
importance:	Undecided → Critical

Revision history for this message

Emilien Macchi (emilienm) wrote on 2016-09-07:

https://review.openstack.org/#/c/366887/

Changed in tripleo:
assignee:	nobody → Emilien Macchi (emilienm)
milestone:	none → newton-rc1
status:	Triaged → In Progress

Revision history for this message

Alan Pevec (apevec) wrote on 2016-09-07:

This should still be at least looked at in gnocchi, it's not quite right that it spins like crazy when dependent resource is unavailable, it should try to reconnect but not eating all cpu.

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-09-08:

AFAIK this is already fixed in Gnocchi with commit a59c759a5473b781ea6645a7c722108eff986e1d

Changed in gnocchi:
status:	New → Confirmed

Julien Danjou (jdanjou) on 2016-09-08

Changed in gnocchi:
importance:	Undecided → High
status:	Confirmed → Fix Committed
assignee:	nobody → Julien Danjou (jdanjou)

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-09-08:

Actually not sure it changes anything. From the log I received at https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-delorean-minimal_pacemaker-113/overcloud-controller-0/var/log/gnocchi/statsd.log.gz I can't really see what gnocchi-statsd would be doing with all that CPU.

Changed in gnocchi:
status:	Fix Committed → Triaged

Revision history for this message

John Trowbridge (trown) wrote on 2016-09-08:

https://review.openstack.org/#/c/366887/ did fix the issue in tripleo for reasonable hardware, but we are still seeing spikes in CPU usage when gnocchi comes online. This is still a problem when using older hardware as it can make the deploy extremely slow.

In RDO we have some slower machines testing tripleo, and the issue is very noticeable there:

https://ci.centos.org/artifacts/rdo/jenkins-tripleo-quickstart-promote-master-delorean-minimal_pacemaker-114/overcloud-controller-0/var/log/messages.gz

CPU usage is high throughout the entire deploy, but things get really bogged down around "Sep 8 02:18:55", it does seem like swift-proxy is being hit pretty hard from gnocchi around that time:

Sep 8 02:18:54 localhost proxy-server: - - 08/Sep/2016/02/18/54 HEAD /v1/AUTH_8f14d89180394919b9a12b64472ade9b HTTP/1.0 204 - Swift - - - - tx525bd3bf466e4c36922aa-0057d0ca8e - 0.0264 RL - 1473301134.110661030 1473301134.137016058 -
Sep 8 02:18:54 localhost proxy-server: 172.16.1.5 172.16.1.5 08/Sep/2016/02/18/54 GET /v1/AUTH_8f14d89180394919b9a12b64472ade9b/measure%3Fformat%3Djson%26limit%3D16%26delimiter%3D/ HTTP/1.0 200 - python-swiftclient-3.0.0 112472f715c34dff... - 2 - tx525bd3bf466e4c36922aa-0057d0ca8e - 0.0611 - - 1473301134.108489037 1473301134.169552088 0

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-09-09:

Gnocchi does a GET request to Swift every 5 seconds by default to check if there's any metric to process. Though I doubt doing a GET every 5 seconds and getting an empty list is what cause high CPU load, unless you see high peak in swift-proxy too?

A test would be to increase the value metric_processing_delay in gnocchi.conf to something like 60 or so.

Revision history for this message

Emilien Macchi (emilienm) wrote on 2016-09-15:

John, do we have any progress on this bug? I'm going to defer it to Ocata if no update. Since it works in tripleo CI, I'm reducing the urgency to "High" instead of "Critical".

Changed in tripleo:
assignee:	Emilien Macchi (emilienm) → nobody
importance:	Critical → High

Emilien Macchi (emilienm) on 2016-09-16

Changed in tripleo:
milestone:	newton-rc1 → ocata-1

Revision history for this message

Julien Danjou (jdanjou) wrote on 2016-10-17:

I'm marking that as incomplete for Gnocchi as, as far as we know, this is not a problem in Gnocchi. Turns out that recent testing of OSP showed that the high IOPS usage was dong by Swift and its account/objects auditor doing a lot of verification with a very small delay by default. This has been fixed in theory upstream.

Unless we have more data points we can leverage to fix anything, this should be ok.

Changed in gnocchi:
status:	Triaged → Incomplete

Revision history for this message

Steven Hardy (shardy) wrote on 2016-11-14:

We've been through several iterations of this problem in CI now - can anyone provide an overview of the current status, and what remains (if anything) until we can declare this fixed?

I'm deferring to ocata-2 as we're planning to release ocata-1 this week,

Changed in tripleo:
milestone:	ocata-1 → ocata-2

Emilien Macchi (emilienm) on 2016-12-14

Changed in tripleo:
milestone:	ocata-2 → ocata-3

Emilien Macchi (emilienm) on 2017-01-31

Changed in tripleo:
milestone:	ocata-3 → ocata-rc1

Julien Danjou (jdanjou) on 2017-02-02

Changed in gnocchi:
status:	Incomplete → Invalid

Emilien Macchi (emilienm) on 2017-02-16

Changed in tripleo:
milestone:	ocata-rc1 → ocata-rc2

Emilien Macchi (emilienm) on 2017-03-06

Changed in tripleo:
milestone:	ocata-rc2 → pike-1

Emilien Macchi (emilienm) on 2017-04-11

Changed in tripleo:
milestone:	pike-1 → pike-2

Revision history for this message

Pradeep Kilambi (pkilambi) wrote on 2017-04-21:

#10

Added metric_processing_delay param in tripleo to tweak this:

https://review.openstack.org/#/c/458959/

https://review.openstack.org/#/c/458117/

So we have more knobs to tweak if needed. workers is already there.

Emilien Macchi (emilienm) on 2017-06-08

Changed in tripleo:
milestone:	pike-2 → pike-3

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-07-05:

#11

There are no currently open reviews on this bug, changing the status back to the previous state and unassigning. If there are active reviews related to this bug, please include links in comments.

Changed in tripleo:
status:	In Progress → Triaged

Emilien Macchi (emilienm) on 2017-07-30

Changed in tripleo:
milestone:	pike-3 → pike-rc1

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-08-09:

#12

So, for TripleO we aren't even deploying telemetry in most of the jobs now. Also, it looks like there were patches merged that were intended to help with this, but we probably need a ci change to make use of them?

At this point it's not clear to me that we need this targeted to Pike. Would anyone object to moving it out to Queens?

Emilien Macchi (emilienm) on 2017-08-25

Changed in tripleo:
milestone:	pike-rc1 → pike-rc2

Emilien Macchi (emilienm) on 2017-09-05

Changed in tripleo:
milestone:	pike-rc2 → queens-1

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2017-10-09:

#13

As we are no longer deploying telemetry in the CI, I'm going to close this bug as Won't Fix. If this pops up again, feel free to reopen it.

Changed in tripleo:
status:	Triaged → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.