buildd-slave-scanner.py regularly aborts loudly on transient errors

Bug #287293 reported by Steve McInerney
2
Affects Status Importance Assigned to Milestone
Launchpad itself
Won't Fix
High
Celso Providelo

Bug Description

The LP error reports list gets regular failures from buildd-slave-scanner.py.

Apparently from transient connection failures.

These appear to be a case of 'boy crying wolf'; generating a lot of email masking (potentially) more serious problems. ie I've received over 70 in the past 24 hours. Properly speaking, each and every email should be examined in detail to see if the failure is in fact serious or not. This obviously takes time. :-)

eg: https://pastebin.canonical.com/10415/

Would be greatly appreciated! :-) if these could be either quietened, or recommendations given to a more appropriate action?

As the pastebin shows several "WARNING builder is in manual state. Ignored." messages, can these major aborts be dealt with in a similar fashion?

Tags: lp-soyuz
Revision history for this message
Julian Edwards (julian-edwards) wrote :

Steve, do you know why there are transient socket errors in the first place? How normal is it?

I am concerned that it's a genuine problem that should be fixed in the infrastructure and any script changes could mask a bigger issue.

Revision history for this message
Steve McInerney (spm) wrote :

Know why transient socket errors? No - that was exactly my question of yourself. :-)
Pretty much the only view I have of that side is what you see in the two messages in the pastebin linked above.

Normal? Normal enough to be irritating. :-)
We get a blast of 6 or so within 2-5 minutes and then nothing for 10's of minutes or hours, or days. There seem to have been a lot more in recent days to weeks.

I get the impression you were not aware of this transient failure?
If not done so, is it possible this could be hooked into the OOPS system? Would that generate more detailed information such that between us all we could figure out what's happening? Or at least assist yourselves in triaging the commonalities between the failures?

At this stage I just don't have enough information to even know where the problem could be, let alone what. eg Is the 'socket error' simply the generic "Something broke" error message? Is the connection timeout in the code base set too low, and waiting an extra 3 seconds would have shown success? Did the buildd code side get a corrupted packet and dealt with it by terminating the connection - which was unexpected and results in the error we see? Was the buildd server even available to be built to - still in a "rebuild" phase - ie a nice little race condition?
etc :-)

Changed in soyuz:
assignee: nobody → al-maisan
importance: Undecided → High
milestone: none → 2.1.12
status: New → Triaged
Changed in soyuz:
milestone: 2.1.12 → pending
Celso Providelo (cprov)
Changed in soyuz:
milestone: pending → 2.2.1
Changed in soyuz:
milestone: 2.2.1 → 2.2.2
Changed in soyuz:
assignee: al-maisan → cprov
milestone: 2.2.2 → 2.2.3
Revision history for this message
Celso Providelo (cprov) wrote :

With the new `buildd-manager` we won't see any of those emails anymore.

Changed in soyuz:
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.