setup a counter for the size of the retracing index table
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Daisy |
Triaged
|
Medium
|
Unassigned |
Bug Description
I recently (r459, r460) fixed an issue with submit_core.py where a core file could be written to swift but not written to the rabbit queue. (This happened because the connection to rabbit wasn't established and no attempts were made to retry it. This resulted in a "IOError: Socket closed" traceback which can be found in the OOPS reports for the error tracker.) So its possible that there are core files in swift that will never be retraced because they aren't in the queue.
https:/
We can see 535 of these on the 12th of June. These OOPSes should exist in the OOPS CF and in the retracing index, so we should be able to update them. I think it'd be worthwhile to go through the "IOError: Socket closed" tracebacks since the changeover to the DSE temp ring and check to see if they are still in the retracing index (they may have been removed if a crash with the same SAS was found).
If they are in the retracing index then we should readd them to the retracing queue for their arch in rabbit.
If they are not in the retracing index and have been bucketed in a problem then we should remove the core file from swift.
summary: |
- more core files in swift than in the retracing queue + setup a counter for the size of the retracing queue |
summary: |
- setup a counter for the size of the retracing queue + setup a counter for the size of the retracing index table |
That sounds workable. What do you propose for handling Rabbit failures in the future? Should we just eat the exception and discard the core file, knowing we'll get more until we receive and write a complete one? Or perhaps we should write to a CF in Cassandra that can be read from by the retracers as well?