PPA ssh reset trigger can hang the buildd-manager indefinitely

Bug #404693 reported by James Troup
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Fix Released
High
Julian Edwards

Bug Description

1000 21864 5.8 7.0 532052 145280 ? Sl Jul22 253:45 /usr/bin/python2.4 /srv/launchpad.net/codelines/current//bin/twistd --pidfile /srv/launchpad.net/var/buildd-manager.pid --python /srv/launchpad.net/codelines/current//daemons/buildd-manager.tac --logfile /srv/launchpad.net/production-logs/buildd-manager.log
1000 16442 0.0 0.1 49820 2876 ? S 18:59 0:00 \_ ssh -i /home/lp_buildd/.ssh/ppa-reset-builder <email address hidden>

i.e. it's been blocked on that ssh for almost 5 hours

Probably wants a pretty low (minutes) timeout?

Related branches

Celso Providelo (cprov)
Changed in soyuz:
assignee: nobody → Celso Providelo (cprov)
importance: Undecided → High
milestone: none → 2.2.8
status: New → In Progress
Revision history for this message
Celso Providelo (cprov) wrote :

r8324 (devel)

Changed in soyuz:
status: In Progress → Fix Committed
tags: added: soyuz-build
Celso Providelo (cprov)
Changed in soyuz:
status: Fix Committed → Fix Released
Revision history for this message
LaMont Jones (lamont) wrote :

Still happens, e.g. 11:38 london time today, vs doubah.

Changed in soyuz:
status: Fix Released → Confirmed
Changed in soyuz:
status: Confirmed → Triaged
assignee: Celso Providelo (cprov) → Julian Edwards (julian-edwards)
milestone: 2.2.8 → 3.1.12
Revision history for this message
Julian Edwards (julian-edwards) wrote :

The timeout is currently set to 180 seconds. I've no idea why it doesn't trigger, it might be a different problem on the buildd-manager. Is this still happening and how often?

Changed in soyuz:
milestone: 3.1.12 → 10.01
Revision history for this message
Julian Edwards (julian-edwards) wrote :

And bang on cue this happens:

https://pastebin.canonical.com/26206/

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Oddly it's not honouring the timeout, it appears to bail out immediately after trying to send data down the socket.

tags: added: buildd-manager
Changed in soyuz:
assignee: Julian Edwards (julian-edwards) → Jelmer Vernooij (jelmer)
Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Jelmer and I looked at this and it seemed impossible, but I don't remember if that was before or after we found that there are two implementations of resetting virtual builders -- lp.buildmaster.model.BuilderSlave.resume and buildmaster.manager.RecordingSlave.resumeSlave -- and only one of them honours the time out (the RecordingSlave one).

Now I know a bit more about how the manager works, it seems likely that this is indeed the problem -- but I don't know why and when the non-recording slave's resume method is called.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Me neither without analysing it further :(

Changed in soyuz:
milestone: 10.01 → none
Revision history for this message
Julian Edwards (julian-edwards) wrote :

This should be fixed in 10.04. See bug 563353 as well.

Changed in soyuz:
assignee: Jelmer Vernooij (jelmer) → Julian Edwards (julian-edwards)
milestone: none → 10.04
status: Triaged → Fix Released
status: Fix Released → Fix Committed
Changed in soyuz:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.