Add object-server api test monitoring

Bug #1854299 reported by Drew Freiberger
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Swift Storage Charm
Fix Released
Wishlist
Edin S

Bug Description

When investigating swift replication alerts, I found that there were some hung object servers returning timeouts when running swift-recon -r:

ubuntu@juju-machine-0-lxc-14:~$ sudo swift-recon -r
sudo: unable to resolve host juju-machine-0-lxc-14
===============================================================================
--> Starting reconnaissance on 12 hosts
===============================================================================
[2019-11-27 20:11:45] Checking on replication
-> http://10.0.0.110:6000/recon/replication/object: timed out
-> http://10.0.0.111:6000/recon/replication/object: timed out

There were no alerts showing that the object server was not functioning on those nodes, because the service was detected as running.

We should ensure that we're testing the api availability of the swift object/container/account services with this charm's nrpe monitoring.

Andrew McLeod (admcleod)
Changed in charm-swift-storage:
status: New → Triaged
importance: Undecided → Wishlist
Revision history for this message
Ryan Farrell (whereisrysmind) wrote :

'nc -zv localhost 6000'

Revision history for this message
Ryan Farrell (whereisrysmind) wrote :

We had another incident on a customer's cloud where the object-server service was not responding but generated no alerts for over 48 hours. When this goes unchecked for so long it results in very poor swift performance.

The check can be as simple as what I posted in comment #1.

-Ryan

Revision history for this message
Andrea Ieri (aieri) wrote :

This would be very easy to implement by reusing the check_http plugin. Something like:

/usr/lib/nagios/plugins/check_http -I localhost -u /recon/version -p 6000 # object server
/usr/lib/nagios/plugins/check_http -I localhost -u /recon/version -p 6001 # container server
/usr/lib/nagios/plugins/check_http -I localhost -u /recon/version -p 6002 # account server

API reference is here: https://docs.openstack.org/swift/latest/admin_guide.html#cluster-telemetry-and-monitoring

But yes, this has caused multiple outages because a dead object server left unattended will cause a very rapid increase in the number of handoff partitions on heavily loaded cluster.

Revision history for this message
James Page (james-page) wrote :

Please can we consider switching this from field-high -> field-medium and including this work for the bootstack engineering team on rotation for charm work.

Revision history for this message
Edin S (exsdev) wrote :
Changed in charm-swift-storage:
assignee: nobody → Edin S (exsdev)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-swift-storage (master)

Reviewed: https://review.opendev.org/747105
Committed: https://git.openstack.org/cgit/openstack/charm-swift-storage/commit/?id=9feb5e55964f5619f5d62ab1b64a372d7ad9824d
Submitter: Zuul
Branch: master

commit 9feb5e55964f5619f5d62ab1b64a372d7ad9824d
Author: Edin Sarajlic <email address hidden>
Date: Thu Aug 20 15:29:37 2020 +1000

    Monitor Swift Object/Container/Account API availability

    As per the bug report, it's not enough to simply monitor that the
    appropriate process is alive; there have been instances of the process
    being alive but the port/API being unavailable.

    This patch adds monitoring for Object/Container/Account API
    availability.

    I've tested the fix in my small test environment and I can confirm
    it's working.

    For reference, the following branch/commit was used as a functional
    test (later rejected with the aim of moving the checks to Mojo):
    https://github.com/openstack-charmers/zaza-openstack-tests/pull/395

    Change-Id: I60c5b74279f71ca8f8bc769c93af2eab1f59e002
    Closes-Bug: #1854299

Changed in charm-swift-storage:
status: In Progress → Fix Committed
Changed in charm-swift-storage:
milestone: none → 20.10
Changed in charm-swift-storage:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.