Comment 5 for bug 1630578

Revision history for this message
Martin Pitt (pitti) wrote :

This is harder to work around/catch, as in the new case the test does *not* time out, it just kills sshd (or something in the kernel that breaks ssh/networking). In general these are cases that we do want to treat as "tmpfail" and auto-restart, I don't want to treat an auxverb failure as failure in general.

Perhaps we need to introduce some kind of retry counter, but this would need to span at least half a day -- three tmpfails on the same worker in a row are usually a sign of a broken cloud or a broken testbed image, not a test failure. So perhaps some logic to check if other tests tmpfail on the same worker/cloud, and if not then call that test a failure.

This would all require state keeping, which we don't currently do (the only state is the AMQP queue contents).