available devstack node count should be monitored with alerts

Bug #929021 reported by Monty Taylor
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Core Infrastructure
Fix Released
High
Bhuvan Arumugam

Bug Description

We should monitor the number of available devstack nodes, and we should send alerts if that falls below a threshold.

Monty Taylor (mordred)
Changed in openstack-ci:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Bhuvan Arumugam (bhuvan) wrote :

Monty, as you may know, Jenkins has support for api, we may extract status of each slave by appending "api/" to each url. For example, to get list of all slaves, we may use following url:
  * JSON format: https://jenkins.openstack.org/computer/api/json
  * XML format: https://jenkins.openstack.org/computer/api/xml

We may then scan through each slave and identify the number of executors idle. If they fell below the threshold limit (default: 5), we'll raise a warning.

I wish to write a (python) script to perform these tasks. Could you please clarify following doubts?
  1) Do we simply throw a warning message, or should we send an alert to specified email address?
  2) How do we execute the script? on-demand from command line, or another jenkins jobs?
  3) Where do i commit the script? openstack-ci repository, or elsewhere?

OTOH, Jenkins also has native python api to interact with Jenkins system, jenkinsapi. I'd believe, it's a overkill to use it to perform desired operation.
  http://pypi.python.org/pypi/jenkinsapi

Revision history for this message
Monty Taylor (mordred) wrote :

Hi!

This is actually in relation to a pool of cloud servers that we maintain as part of the devstack-gate scripts - they aren't actual jenkins slaves (yet, although there is work in the jclouds plugin to achieve this)

For devstack tests, we create a new cloud server, then run devstack on it, then delete the server when we're done. Creating cloud servers fails frequently though - so rather than tying creation of the server to the running of the job, we have a different process that creates the servers and keeps a spare set of them. Even that breaks sometimes - sometimes we can't create new nodes as fast as we consume them, so the pool gets to small, and then tests start failing because they can't get a server to run on.

If you pull https://github.com/openstack-ci/devstack-gate, you'll see the code that manages this process, as well as vmdatabase.py, which contains a description of the database where information is stored about the pool of slaves.

If there was a script which could run and check to see if the slave count was above a given number, we could run a jenkins job periodically to run the script and then configure that to send an alert to the IRC channels. Something like:

./devstack-vm-threshold.py 5

Which would exit 0 if there were at least 5 available slaves in the pool and would print a message about how many slaves there were and exit 1 if there were less than that.

As a bonus, if the script also wrote out a file that contained two lines, like this:

slaves
5

Then we can configure the jenkins graphing module to make a graph of the slave count over time.

description: updated
summary: - available slave count should be monitored with alerts
+ available devstack node count should be monitored with alerts
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to devstack-gate (master)

Fix proposed to branch: master
Review: https://review.openstack.org/5822

Changed in openstack-ci:
status: Triaged → In Progress
Bhuvan Arumugam (bhuvan)
Changed in openstack-ci:
assignee: nobody → Bhuvaneswaran A (bhuvan)
Revision history for this message
Bhuvan Arumugam (bhuvan) wrote :

Monty, can you please review the proposed patch?

Revision history for this message
Bhuvan Arumugam (bhuvan) wrote :

For the record, James had reviewed the patch and proposed comments.
I've incorporated his review comments and posted a revised patch.

Revision history for this message
Bhuvan Arumugam (bhuvan) wrote :

The revised patch has been reviewed by James E. Blair. The patch need one another blessing (approval) to complete the merge.

Revision history for this message
Bhuvan Arumugam (bhuvan) wrote :

Can someone help to review/approve the patch?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to devstack-gate (master)

Reviewed: https://review.openstack.org/5822
Committed: http://github.com/openstack-ci/devstack-gate/commit/45413b522b3b34c6c89b206c8422ed42825f73e6
Submitter: Jenkins
Branch: master

commit 45413b522b3b34c6c89b206c8422ed42825f73e6
Author: Bhuvan Arumugam <email address hidden>
Date: Mon Apr 2 12:54:19 2012 -0700

    Bug 929021.

    Monitor ready node count for all providers.

    * devstack-vm-threshold.py
    New script to print count of available nodes, across all providers.
    The threshold is specified in command line. If available node count
    is less than threshold, the script exit with error code (1).

    The available node count is written to a file. File name is
    specified in command line. If file name is not specified, it is
    written to ~/vm-threshold.txt file.

    usage: devstack-vm-threshold.py [-h] -t threshold [-f stat-file]

    Change-Id: Ib5d24b2a81a79c753ede4bc0c59e17808dc75b18

Changed in openstack-ci:
status: In Progress → Fix Committed
Bhuvan Arumugam (bhuvan)
Changed in openstack-ci:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.