statsd server can accidentally fail an haproxy node if device is processing a loadbalancer 'DELETE' operation
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
libra |
Fix Released
|
High
|
David Shrewsbury |
Bug Description
We are seeing the statsd server failing loadbalancer devices that are processing DELETE loadbalancer operations.
During these moments, the haproxy process is down and the statsd ping doesn't like this.
As already discussed, need a mechanism for detecting this.
2013-05-08 02:57:28,391: libra_worker - DEBUG - Return JSON message: {
<snip>
"hpcs_action": "DELETE",
"hpcs_device": YYY,
"hpcs_
"hpcs_
}
2013-05-08 02:57:28,493: libra_worker - DEBUG - Received JSON message: {
"hpcs_action": "STATS"
}
2013-05-08 02:57:28,493: libra_worker - DEBUG - Entered LBaaSController
2013-05-08 02:57:28,493: libra_worker - INFO - Requested action: STATS
2013-05-08 02:57:28,493: libra_worker - ERROR - STATS failed: <type 'exceptions.
2013-05-08 02:57:28,493: libra_worker - DEBUG - Return JSON message: {
"hpcs_action": "STATS",
"hpcs_error": "HAProxy is not running.",
"hpcs_
}
Changed in libra: | |
assignee: | Andrew Hutchings (linuxjedi) → David Shrewsbury (dshrews) |
Changed in libra: | |
status: | Fix Committed → Fix Released |
I see two potential ways to fix this:
1) Add some sort of coordination between statsd and API server. Maybe at the DB level?
2) Allow pings to LB's in the DELETED state. Worker could be changed to recognize that it has been deleted and just return a PASS message instead of FAIL. Not sure what implications that would have on the current meaning of this ping result or the future uses of the STATS message (for true statistics info, etc).