All the evidence suggests that this is a transient problem, so "stopped working" isn't accurate. I've grovelled the logs, and every branch that had a failure later succeeded, except for things that happened today (and presumably haven't had a chance to succeed yet).
There's also no reason to believe that the ConcurrentUpdateError is inaccurate. The problem is just how we handle it. I believe we should retry the job instead of oopsing, and only oops if we exceed max_retries.
All the evidence suggests that this is a transient problem, so "stopped working" isn't accurate. I've grovelled the logs, and every branch that had a failure later succeeded, except for things that happened today (and presumably haven't had a chance to succeed yet).
There's also no reason to believe that the ConcurrentUpdat eError is inaccurate. The problem is just how we handle it. I believe we should retry the job instead of oopsing, and only oops if we exceed max_retries.