Resume trigger hangs buildd-manager
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Launchpad itself |
Fix Released
|
High
|
Julian Edwards |
Bug Description
buildd-manager this morning said http://
To reproduce:
1) Tweak vm_resume_command to something like 'sleep 30'.
2) Enable both sampledata builders.
3) Start launchpad-buildd (so connecting to the non-virt bob will succeed, but the virt frog will fail)
4) Start buildd-manager.
5) Watch buildd-manager and launchpad-buildd logs, noting large breaks in activity while a resume attempt is made on the unreachable frog builder.
I've not been able to confirm whether this also affects normal build-start resumes.
Related branches
- Jonathan Lange (community): Approve
-
Diff: 7192 lines (+2211/-3509)24 files modifiedlib/lp/buildmaster/doc/builder.txt (+2/-118)
lib/lp/buildmaster/interfaces/builder.py (+83/-62)
lib/lp/buildmaster/manager.py (+205/-469)
lib/lp/buildmaster/model/builder.py (+240/-224)
lib/lp/buildmaster/model/buildfarmjobbehavior.py (+60/-52)
lib/lp/buildmaster/model/packagebuild.py (+6/-0)
lib/lp/buildmaster/tests/mock_slaves.py (+157/-32)
lib/lp/buildmaster/tests/test_builder.py (+582/-154)
lib/lp/buildmaster/tests/test_manager.py (+248/-782)
lib/lp/buildmaster/tests/test_packagebuild.py (+12/-0)
lib/lp/code/model/recipebuilder.py (+32/-28)
lib/lp/soyuz/browser/tests/test_builder_views.py (+1/-1)
lib/lp/soyuz/doc/buildd-dispatching.txt (+0/-371)
lib/lp/soyuz/doc/buildd-slavescanner.txt (+0/-876)
lib/lp/soyuz/model/binarypackagebuildbehavior.py (+59/-41)
lib/lp/soyuz/tests/test_binarypackagebuildbehavior.py (+290/-8)
lib/lp/soyuz/tests/test_doc.py (+0/-6)
lib/lp/testing/factory.py (+8/-2)
lib/lp/translations/doc/translationtemplatesbuildbehavior.txt (+0/-114)
lib/lp/translations/model/translationtemplatesbuildbehavior.py (+20/-14)
lib/lp/translations/stories/buildfarm/xx-build-summary.txt (+1/-1)
lib/lp/translations/tests/test_translationtemplatesbuildbehavior.py (+202/-153)
lib/lp_sitecustomize.py (+3/-0)
utilities/migrater/file-ownership.txt (+0/-1)
- Brad Crittenden (community): Approve (code)
tags: | added: canonical-losa-lp |
tags: | added: buildd-manager |
Changed in soyuz: | |
status: | New → Triaged |
importance: | Undecided → High |
Changed in soyuz: | |
status: | Triaged → In Progress |
assignee: | nobody → Julian Edwards (julian-edwards) |
tags: | added: bad-commit-11801 |
tags: |
added: qa-ok removed: qa-needstesting |
Changed in soyuz: | |
status: | Fix Committed → Fix Released |
tags: | added: buildd-scalability |
Actually I saw this happening on dogfood but it always came back to life. There was also no break in activity on other builders.
The traceback shows the call stack to be updateBuilderStatus -> checkSlaveAlive -> self.slave.echo
Can you explain what you mean by "timeout code doesn't work when a resume is triggered from within Builder" ?