Nova collectd plugin timeout with a lot of instances

Bug #1554502 reported by Swann Croiset on 2016-03-08
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StackLight
High
Swann Croiset
0.10
Undecided
Unassigned
0.9
High
Swann Croiset
1.0
High
Swann Croiset

Bug Description

Collectd log:
[2016-03-08 12:35:32] Got exception for 'http://192.168.0.2:8774/v2/2ff7a46d198b4f7ebf5511181aa3da56/servers/detail?all_tenants=1': 'HTTPConnectionPool(host='192.168.0.2', port=8774): Max retries exceeded with url: /v2/2ff7a46d198b4f7ebf5511181aa3da56/servers/detail?all_tenants=1 (Caused by ReadTimeoutError("HTTPConnectionPool(host='192.168.0.2', port=8774): Read timed out. (read timeout=5)",))'

# time nova --debug list --all > /dev/null 2>&1
real 0m7.622s
user 0m0.730s
sys 0m0.112s

# nova list --all | wc -l
632

Current Timeout is 5 seconds:

<Module "openstack_nova">
  DependsOnResource "vip__management"
  KeystoneUrl "http://192.168.0.2:5000/v2.0"
  Password "421fSYzxnf5SXwP5pKr9qQmo"
  Tenant "services"
  Timeout "5"
  Username "nova"
</Module>

Swann Croiset (swann-w) wrote :

All collectd plugins (neutron, cinder, glance) can be impacted the same way

Changed in lma-toolchain:
milestone: 0.9.0 → 1.0.0
no longer affects: lma-toolchain/1.0

Fix proposed to branch: master
Review: https://review.openstack.org/290380

Changed in lma-toolchain:
assignee: LMA-Toolchain Fuel Plugins (mos-lma-toolchain) → Swann Croiset (swann-w)
status: Confirmed → In Progress
Simon Pasquier (simon-pasquier) wrote :

Increasing the timeout value doesn't cover all situations. By default, the Nova API won't return more than 1,000 items. This can be changed by using the osapi_max_limit configuration parameter but there's still a limit. So once we reach this limit, the collected metrics will be incorrect.

IMO, a more complete fix would be at least to:
- explicitly set the number of items that the plugin wants to get and make as many calls as necessary to get the whole data.
- keep a local cache and use the Changes-since parameter [1] to reduce the size and the number of the subsequent requests.

[1] http://docs.openstack.org/developer/nova/v2/polling_changes-since_parameter.html

Reviewed: https://review.openstack.org/290380
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-collector/commit/?id=9cb06879fe29ec4a66041f14a52ef7d17171d664
Submitter: Jenkins
Branch: master

commit 9cb06879fe29ec4a66041f14a52ef7d17171d664
Author: Swann Croiset <email address hidden>
Date: Wed Mar 9 11:10:27 2016 +0100

    Increase timeout to 20s for Openstack collectd plugins

    And decrease the max_retries from 3 to 2 to stay in the 50 seconds window.
    This change allows to retrieve large number of objects and also avoids to
    overload the system by performing 3 'zombies' requests every 50 seconds
    without any metrics collected.

    Partial-bug: #1554502
    Change-Id: I60a7611bc82598831538da01245b87fb29a15c44

Changed in lma-toolchain:
importance: High → Medium

Reviewed: https://review.openstack.org/292435
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-collector/commit/?id=06356a868745e3e522c6d1adb3f4da1dd47fcd67
Submitter: Jenkins
Branch: stable/0.9

commit 06356a868745e3e522c6d1adb3f4da1dd47fcd67
Author: Swann Croiset <email address hidden>
Date: Wed Mar 9 11:10:27 2016 +0100

    Increase timeout to 20s for Openstack collectd plugins

    And decrease the max_retries from 3 to 2 to stay in the 50 seconds window.
    This change allows to retrieve large number of objects and also avoids to
    overload the system by performing 3 'zombies' requests every 50 seconds
    without any metrics collected.

    Partial-bug: #1554502
    Change-Id: I60a7611bc82598831538da01245b87fb29a15c44
    (cherry picked from commit 9cb06879fe29ec4a66041f14a52ef7d17171d664)

no longer affects: lma-toolchain/1.0
Changed in lma-toolchain:
milestone: 1.0.0 → 0.10.0
Swann Croiset (swann-w) on 2016-06-07
Changed in lma-toolchain:
status: In Progress → Won't Fix
Dmitry Sutyagin (dsutyagin) wrote :

Customer found on MOS 8.0 with LMA plugin 0.10.0, customer has hit the 1000VM limit and their data in Graphana is now inaccurate, showing only up to 1000VMs by state, while the total number is correct.

tags: added: customer-found support
Andrii Petrenko (aplsms) on 2016-10-27
tags: added: ct1
Dmitry Sutyagin (dsutyagin) wrote :

LMA team, any progress on this?

Changed in lma-toolchain:
status: Won't Fix → In Progress

Reviewed: https://review.openstack.org/427678
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-collector/commit/?id=88d7bb28b31f6807172969522a2a1dedd80e0c53
Submitter: Jenkins
Branch: master

commit 88d7bb28b31f6807172969522a2a1dedd80e0c53
Author: Swann Croiset <email address hidden>
Date: Tue Jan 31 14:11:40 2017 +0100

    Support pagination for OpenStack services

    This concerns:
    - Nova server list
    - Neutron
    - Glance
    - Cinder

    Closes-bug: #1554502
    Closes-bug: #1557455

    Change-Id: Ia8b029080c8a18161441ab9bc13799f26e0941f3

Changed in lma-toolchain:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/434732
Committed: https://git.openstack.org/cgit/openstack/fuel-plugin-lma-collector/commit/?id=1c66e2a93664ef73877b79ca9ecbace14a6903c0
Submitter: Jenkins
Branch: stable/1.0

commit 1c66e2a93664ef73877b79ca9ecbace14a6903c0
Author: Swann Croiset <email address hidden>
Date: Tue Jan 31 14:11:40 2017 +0100

    Support pagination for OpenStack services

    This concerns:
    - Nova server list
    - Neutron
    - Glance
    - Cinder

    Closes-bug: #1554502
    Closes-bug: #1557455

    Change-Id: Ia8b029080c8a18161441ab9bc13799f26e0941f3
    (cherry picked from commit 88d7bb28b31f6807172969522a2a1dedd80e0c53)

Denis Klepikov (dklepikov) wrote :

Huge timeout settings (20 sec) causing a false positive or unknown statuses into Nagios when you have 2 endpoints down or misconfigured, all another endpoints statuses can not be sent to Nagios due to timeout of receiving data is reached

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers