statsd server can accidentally fail an haproxy node if device is processing a loadbalancer 'DELETE' operation

Bug #1177642 reported by Patrick Crews
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
libra
Fix Released
High
David Shrewsbury

Bug Description

We are seeing the statsd server failing loadbalancer devices that are processing DELETE loadbalancer operations.
During these moments, the haproxy process is down and the statsd ping doesn't like this.

As already discussed, need a mechanism for detecting this.

2013-05-08 02:57:28,391: libra_worker - DEBUG - Return JSON message: {
<snip>
    "hpcs_action": "DELETE",
    "hpcs_device": YYY,
    "hpcs_requestid": NNNN,
    "hpcs_response": "PASS"
}
2013-05-08 02:57:28,493: libra_worker - DEBUG - Received JSON message: {
    "hpcs_action": "STATS"
}
2013-05-08 02:57:28,493: libra_worker - DEBUG - Entered LBaaSController
2013-05-08 02:57:28,493: libra_worker - INFO - Requested action: STATS
2013-05-08 02:57:28,493: libra_worker - ERROR - STATS failed: <type 'exceptions.Exception'>, HAProxy is not running.
2013-05-08 02:57:28,493: libra_worker - DEBUG - Return JSON message: {
    "hpcs_action": "STATS",
    "hpcs_error": "HAProxy is not running.",
    "hpcs_response": "FAIL"
}

Revision history for this message
David Shrewsbury (dshrews) wrote :

I see two potential ways to fix this:

1) Add some sort of coordination between statsd and API server. Maybe at the DB level?

2) Allow pings to LB's in the DELETED state. Worker could be changed to recognize that it has been deleted and just return a PASS message instead of FAIL. Not sure what implications that would have on the current meaning of this ping result or the future uses of the STATS message (for true statistics info, etc).

Revision history for this message
Andrew Hutchings (linuxjedi) wrote :

1) I added a fix to do that today. Whilst reducing the occurrence of this it doesn't kill it. The problem being that we CREATE/DELETE many times during a Jenkins test run so it can be active during the first probe of the API server, deleted during the ping and active again during the second check of the API server (this has happened once so far after the deployment of the fix).

2) Something like this may be the only option. Maybe a third state such as "DELETED" should be returned?

Changed in libra:
assignee: nobody → Andrew Hutchings (linuxjedi)
importance: Undecided → High
Changed in libra:
assignee: Andrew Hutchings (linuxjedi) → David Shrewsbury (dshrews)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to libra (master)

Fix proposed to branch: master
Review: https://review.openstack.org/29411

Changed in libra:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to libra (master)

Reviewed: https://review.openstack.org/29411
Committed: http://github.com/stackforge/libra/commit/45095183ed36cdba12903a4807e1cecd8b9d2f1f
Submitter: Jenkins
Branch: master

commit 45095183ed36cdba12903a4807e1cecd8b9d2f1f
Author: David Shrewsbury <email address hidden>
Date: Thu May 16 13:29:59 2013 -0400

    Return 'status' field for STATS on deleted LB.

    Fixes bug 1177642.

    Due to a race condition in some of our Jenkins tests, it is possible
    that we could send a STATS message to a LB that has just been deleted.
    To recognize this situation, we'll return a FAIL message, but include
    a new 'status' field in the JSON response indicating the LB is deleted.

    Change-Id: I785cfdff526e67f4b55bf3f9bff911052c27ece7

Changed in libra:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to libra (release-v2)

Fix proposed to branch: release-v2
Review: https://review.openstack.org/29413

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to libra (release-v2)

Reviewed: https://review.openstack.org/29413
Committed: http://github.com/stackforge/libra/commit/3750ca7c17f7ecd953a891192c23ef022f9f12d0
Submitter: Jenkins
Branch: release-v2

commit 3750ca7c17f7ecd953a891192c23ef022f9f12d0
Author: David Shrewsbury <email address hidden>
Date: Thu May 16 13:29:59 2013 -0400

    Return 'status' field for STATS on deleted LB.

    Fixes bug 1177642.

    Due to a race condition in some of our Jenkins tests, it is possible
    that we could send a STATS message to a LB that has just been deleted.
    To recognize this situation, we'll return a FAIL message, but include
    a new 'status' field in the JSON response indicating the LB is deleted.

    Change-Id: I785cfdff526e67f4b55bf3f9bff911052c27ece7

Changed in libra:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.