libra

statsd server can accidentally fail an haproxy node if device is processing a loadbalancer 'DELETE' operation

Bug #1177642 reported by Patrick Crews on 2013-05-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	libra	Fix Released	High	David Shrewsbury

Bug Description

We are seeing the statsd server failing loadbalancer devices that are processing DELETE loadbalancer operations.
During these moments, the haproxy process is down and the statsd ping doesn't like this.

As already discussed, need a mechanism for detecting this.

2013-05-08 02:57:28,391: libra_worker - DEBUG - Return JSON message: {
<snip>
    "hpcs_action": "DELETE",
    "hpcs_device": YYY,
    "hpcs_requestid": NNNN,
    "hpcs_response": "PASS"
}
2013-05-08 02:57:28,493: libra_worker - DEBUG - Received JSON message: {
    "hpcs_action": "STATS"
}
2013-05-08 02:57:28,493: libra_worker - DEBUG - Entered LBaaSController
2013-05-08 02:57:28,493: libra_worker - INFO - Requested action: STATS
2013-05-08 02:57:28,493: libra_worker - ERROR - STATS failed: <type 'exceptions.Exception'>, HAProxy is not running.
2013-05-08 02:57:28,493: libra_worker - DEBUG - Return JSON message: {
    "hpcs_action": "STATS",
    "hpcs_error": "HAProxy is not running.",
    "hpcs_response": "FAIL"
}

Revision history for this message

David Shrewsbury (dshrews) wrote on 2013-05-08:

I see two potential ways to fix this:

1) Add some sort of coordination between statsd and API server. Maybe at the DB level?

2) Allow pings to LB's in the DELETED state. Worker could be changed to recognize that it has been deleted and just return a PASS message instead of FAIL. Not sure what implications that would have on the current meaning of this ping result or the future uses of the STATS message (for true statistics info, etc).

Revision history for this message

Andrew Hutchings (linuxjedi) wrote on 2013-05-13:

1) I added a fix to do that today. Whilst reducing the occurrence of this it doesn't kill it. The problem being that we CREATE/DELETE many times during a Jenkins test run so it can be active during the first probe of the API server, deleted during the ping and active again during the second check of the API server (this has happened once so far after the deployment of the fix).

2) Something like this may be the only option. Maybe a third state such as "DELETED" should be returned?

Changed in libra:
assignee:	nobody → Andrew Hutchings (linuxjedi)
importance:	Undecided → High

David Shrewsbury (dshrews) on 2013-05-16

Changed in libra:
assignee:	Andrew Hutchings (linuxjedi) → David Shrewsbury (dshrews)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-05-16: Fix proposed to libra (master)

Fix proposed to branch: master
Review: https://review.openstack.org/29411

Changed in libra:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-05-16: Fix merged to libra (master)

Reviewed: https://review.openstack.org/29411
Committed: http://github.com/stackforge/libra/commit/45095183ed36cdba12903a4807e1cecd8b9d2f1f
Submitter: Jenkins
Branch: master

commit 45095183ed36cdba12903a4807e1cecd8b9d2f1f
Author: David Shrewsbury <email address hidden>
Date: Thu May 16 13:29:59 2013 -0400

Return 'status' field for STATS on deleted LB.

Fixes bug 1177642.

    Due to a race condition in some of our Jenkins tests, it is possible
    that we could send a STATS message to a LB that has just been deleted.
    To recognize this situation, we'll return a FAIL message, but include
    a new 'status' field in the JSON response indicating the LB is deleted.

Change-Id: I785cfdff526e67f4b55bf3f9bff911052c27ece7

Changed in libra:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-05-16: Fix proposed to libra (release-v2)

Fix proposed to branch: release-v2
Review: https://review.openstack.org/29413

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-05-16: Fix merged to libra (release-v2)

Reviewed: https://review.openstack.org/29413
Committed: http://github.com/stackforge/libra/commit/3750ca7c17f7ecd953a891192c23ef022f9f12d0
Submitter: Jenkins
Branch: release-v2

commit 3750ca7c17f7ecd953a891192c23ef022f9f12d0
Author: David Shrewsbury <email address hidden>
Date: Thu May 16 13:29:59 2013 -0400

Return 'status' field for STATS on deleted LB.

Fixes bug 1177642.

Change-Id: I785cfdff526e67f4b55bf3f9bff911052c27ece7

David Shrewsbury (dshrews) on 2013-05-17

Changed in libra:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.