setup a counter for the size of the retracing index table

Bug #1331212 reported by Brian Murray
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Daisy
Triaged
Medium
Unassigned

Bug Description

I recently (r459, r460) fixed an issue with submit_core.py where a core file could be written to swift but not written to the rabbit queue. (This happened because the connection to rabbit wasn't established and no attempts were made to retry it. This resulted in a "IOError: Socket closed" traceback which can be found in the OOPS reports for the error tracker.) So its possible that there are core files in swift that will never be retraced because they aren't in the queue.

https://oops.canonical.com/static-reports/WHOOPSIE-PROD-2014-06-12.html

We can see 535 of these on the 12th of June. These OOPSes should exist in the OOPS CF and in the retracing index, so we should be able to update them. I think it'd be worthwhile to go through the "IOError: Socket closed" tracebacks since the changeover to the DSE temp ring and check to see if they are still in the retracing index (they may have been removed if a crash with the same SAS was found).

If they are in the retracing index then we should readd them to the retracing queue for their arch in rabbit.

If they are not in the retracing index and have been bucketed in a problem then we should remove the core file from swift.

Revision history for this message
Evan (ev) wrote :

That sounds workable. What do you propose for handling Rabbit failures in the future? Should we just eat the exception and discard the core file, knowing we'll get more until we receive and write a complete one? Or perhaps we should write to a CF in Cassandra that can be read from by the retracers as well?

Revision history for this message
Brian Murray (brian-murray) wrote :

Regardless of whether there is or isn't a rabbit failure it is written to the retracing indexes column family. So we could just use that as a source instead of the list of core files in swift.

I think it'd be good to setup a counter or metric for the retracing index so we can monitor whether or not it is greater than the length of the retracing queues. If it is greater than we have some items to readd to the rabbit retracing queue.

Changed in daisy:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Evan (ev) wrote :

Yeah, that sounds entirely sensible. You might be able to get that data out of JMX.

summary: - more core files in swift than in the retracing queue
+ setup a counter for the size of the retracing queue
summary: - setup a counter for the size of the retracing queue
+ setup a counter for the size of the retracing index table
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.