Launchpad itself

buildd-slave-scanner.py regularly aborts loudly on transient errors

Bug #287293 reported by Steve McInerney on 2008-10-22

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Won't Fix	High	Celso Providelo	Launchpad itself 2.2.3

Bug Description

The LP error reports list gets regular failures from buildd-slave-scanner.py.

Apparently from transient connection failures.

These appear to be a case of 'boy crying wolf'; generating a lot of email masking (potentially) more serious problems. ie I've received over 70 in the past 24 hours. Properly speaking, each and every email should be examined in detail to see if the failure is in fact serious or not. This obviously takes time. :-)

eg: https://pastebin.canonical.com/10415/

Would be greatly appreciated! :-) if these could be either quietened, or recommendations given to a more appropriate action?

As the pastebin shows several "WARNING builder is in manual state. Ignored." messages, can these major aborts be dealt with in a similar fashion?

Tags:

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2008-10-22:

Steve, do you know why there are transient socket errors in the first place? How normal is it?

I am concerned that it's a genuine problem that should be fixed in the infrastructure and any script changes could mask a bigger issue.

Revision history for this message

Steve McInerney (spm) wrote on 2008-10-22:

Know why transient socket errors? No - that was exactly my question of yourself. :-)
Pretty much the only view I have of that side is what you see in the two messages in the pastebin linked above.

Normal? Normal enough to be irritating. :-)
We get a blast of 6 or so within 2-5 minutes and then nothing for 10's of minutes or hours, or days. There seem to have been a lot more in recent days to weeks.

I get the impression you were not aware of this transient failure?
If not done so, is it possible this could be hooked into the OOPS system? Would that generate more detailed information such that between us all we could figure out what's happening? Or at least assist yourselves in triaging the commonalities between the failures?

At this stage I just don't have enough information to even know where the problem could be, let alone what. eg Is the 'socket error' simply the generic "Something broke" error message? Is the connection timeout in the code base set too low, and waiting an extra 3 seconds would have shown success? Did the buildd code side get a corrupted packet and dealt with it by terminating the connection - which was unexpected and results in the error we see? Was the buildd server even available to be built to - still in a "rebuild" phase - ie a nice little race condition?
etc :-)

Julian Edwards (julian-edwards) on 2008-11-19

Changed in soyuz:
assignee:	nobody → al-maisan
importance:	Undecided → High
milestone:	none → 2.1.12
status:	New → Triaged

Julian Edwards (julian-edwards) on 2008-11-24

Changed in soyuz:
milestone:	2.1.12 → pending

Celso Providelo (cprov) on 2008-12-16

Changed in soyuz:
milestone:	pending → 2.2.1

Julian Edwards (julian-edwards) on 2009-01-16

Changed in soyuz:
milestone:	2.2.1 → 2.2.2

Julian Edwards (julian-edwards) on 2009-02-26

Changed in soyuz:
assignee:	al-maisan → cprov
milestone:	2.2.2 → 2.2.3

Revision history for this message

Celso Providelo (cprov) wrote on 2009-03-04:

With the new `buildd-manager` we won't see any of those emails anymore.

Changed in soyuz:
status:	Triaged → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.