debconf-apt-progress hangs sometimes on SMP systems due to racy debconf protocol handling

Bug #62986 reported by JosefK on 2006-09-29
42
Affects Status Importance Assigned to Milestone
debconf (Debian)
Fix Released
Unknown
debconf (Ubuntu)
High
Colin Watson
Dapper
High
Colin Watson

Bug Description

debconf-apt-progress is a helper program used by the installer to run apt-get and turn its progress output into debconf progress bar commands, while passing through any genuine debconf interaction by the packages being installed. Ever since I originally wrote it, that passthrough function has been handled incorrectly: it simply passes through its debconf file descriptors to the child process, which means that child processes may collide with debconf-apt-progress itself while communicating with debconf. This bug is present in all releases since Ubuntu 6.06.

The effect of this is that users installing on SMP systems experience sporadic hangs during the pkgsel step of d-i ("Select and install software"). Since it is a race, frequencies vary; I've heard 30% recently from one person installing on a large number of machines at once. There is absolutely no indication of what has gone wrong, so it gives a very bad impression, and I think therefore it is worth fixing this in Ubuntu 6.06.2.

I believe that this has been fixed in debconf 1.5.14 in Debian unstable and Ubuntu gutsy. While the patch is substantial as it reorganises debconf-apt-progress' main loop, it operates by opening two additional pipes for the command and reply ends of the debconf protocol, and serialising apt status messages and debconf protocol messages using a select loop, ensuring that collisions cannot occur. I am still actively soliciting testing to confirm that this fixes the bug experienced by users and does not regress. https://bugs.launchpad.net/ubuntu/+source/debconf/+bug/62986/comments/16 links to the patch I applied to the test CD images linked to from this bug.

Since this is a race, I cannot provide cast-iron instructions on reproducing the bug. However, installing several times (ideally automated with a fresh disk) on a multi-processor system should be sufficient to hit it eventually.

The only code outside the installer that uses debconf-apt-progress is tasksel. Thus, unless one is using tasksel regularly (which is relatively rare), this patch is not likely to have any effect on upgraders. If it goes wrong, the consequences are likely to be similar to the original bug, i.e. hangs in the pkgsel step of d-i.

Kevin Kubasik (kkubasik) wrote :

Assigning to the ubuntu installer, hopefully someone there can help you!

JosefK (josefk) wrote :

Just tried with the daily (30th) install CD. The installation hung at "Retrieving file 813 of 813". These are the tails of dpkg related logs, without timestamps:

/target/var/log/dpkg.log:
install xresprobe <none> 0.4.24
status half-installed xresprobe 0.4.24
status unpacked xresprobe 0.4.24
status unpacked xresprobe 0.4.24
status unpacked xresprobe 0.4.24
status half-configured xresprobe 0.4.24
status installed xresprobe 0.4.24

/target/var/log/aptitude
[INSTALL] zip
========================================

Log complete.

----

I'm going to try an install from the desktop CD. The 'nv' drivers on there are giving me grief, but I think I can get to a fullscreen terminal before X hangs the system.

Colin Watson (cjwatson) wrote :

For future reference, Kevin, bugs about the alternate install CD should go to 'debian-installer' (not 'ubiquity') unless you have more accurate information. I'll set this to pkgsel for now, as that's the component responsible for this stage of the installation.

Josef, I've had reports of this stage being very slow, but getting there eventually. How long did you leave it?

JosefK (josefk) wrote :

About 10 minutes the first time, without checking from the terminal.

The last run, which hung at "Retrieving file 813 of 813" stayed there for about 15. `ps aux` showed all the installer related processes to be sleeping, but without strace on the installer CD I can't figure out any reasons for that.

I have similar problems here. Alternate install CD hangs when starting to install/load packages. It hangs for a while and then gives me an error about downloading libgcrypt.

The edgy desktop amd64 cd just hangs at a black screen after usplash has timed out or something (usplash is black/white but that is a known bug). Maybe this is unrelated...

Using i386 desktop cd works fine and boots up to the desktop with networking working fine.

Hardware: AMD64 3200+, ATI graphics

I just tried three times. The i386 desktop cd also hangs on the installer.

At some point X just becomes utterly unrepsonsive. I can move the mouse around but nothing responds. Keyboard is dead too. Only option is a hard reboot. I also tried using the vesa driver (default is "ati"), but it eventually crashed anyway.

I opened a separate bug 68867, on my issue. I guessed it was unrelated when I realised that "Core 2" was referencing the new dual core intel processor while I had a single core amd64. Feel free to bash me if I'm a block head :-)

JosefK (josefk) wrote :

Mikkel - I assumed this problem would be caused by Core 2 Duo brokenness, I'll happily amend the title of this bug to make it more general if needed.

I imagine the causes of our problems are similar.

JosefK: Yeah, our symptoms are similar, but I am not sure they are from the same issue. I think we should leave things as is until someone with more expertise can triage the bugs... Atleast I knwo _very_ little about theese issues.

Anders Olsson (anders-anderso) wrote :

I also had problems using the installer on my Core 2 Duo, Dell D620. I don't remember the exact symptoms because it was a while ago. One interesting thing to note was that I think I tried to install four times and on the fourth time it succeeded. The three first attempts all failed at different stages and with different outcomes, I think at least once the machine froze completely, I don't remember the details.

This was the release candidate of Edgy and on a Core 2 Duo 2.00 GHz, Nvidia Quadro graphics, bluetooth and Intel wireless.

Colin Watson (cjwatson) wrote :

From the linked Debian bug, it sounds like this may be specific to multi-CPU machines. Could somebody who can reproduce this do so with DEBCONF_DEBUG=developer added to the kernel command line, and post the resulting /var/log/syslog from after the hang? Doing this with Ubuntu 7.04 would be ideal, but Ubuntu 6.10 will do if that's all that's possible. Thanks!

Changed in pkgsel:
assignee: nobody → kamion
importance: Undecided → High
status: Unconfirmed → Needs Info
Colin Watson (cjwatson) wrote :

I *believe* that I've finally got this fixed. However, I never could reproduce it myself, and the change to debconf-apt-progress verges on a rewrite of its main loop, so I would very much appreciate testing by anyone who could reproduce this. Tomorrow's daily build of Gutsy's alternate images, or Tribe 4 if you prefer, should include this fix.

debconf (1.5.14) unstable; urgency=low

  [ Colin Watson ]
  * Retry flock() on EINTR. Failing that, print the errno if flock() fails
    so that we have a better chance of working out why.
  * Install Python confmodule for python2.5 as well.
  * Add confmodule bindings for the DATA command.
  * Somebody looking at confmodule(3) probably actually wants
    debconf-devel(7). Add a reference in SEE ALSO.
  * Make sure that apt status commands and debconf protocol commands under
    debconf-apt-progress are properly interleaved. Closes: #425397

  [ Debconf Translations ]
  * Marathi added. Closes: #416805
  * Basque updated. Closes: #418897

  [ Programs Translations ]
  * Marathi added. Closes: #416805
  * Punjabi added. Closes: #427327
  * Basque updated. Closes: #418902
  * Esperanto added. Closes: #428275

  [ Joey Hess ]
  * Increase selectspacer to 13 for dialog. May be needed due to changes in
    new versions of dialog.
  * Update url to web site in README.

  [ Trent Buck ]
  * Fix bash_completion syntax. Closes: #425676

 -- Colin Watson <email address hidden> Wed, 25 Jul 2007 14:58:39 +0100

Changed in pkgsel:
status: Incomplete → Fix Released
Changed in debconf:
status: New → Fix Released
Colin Watson (cjwatson) wrote :

This causes frequent hangs during installation. If the fix is confirmed to work well in Gutsy, I'd like to propose it for Ubuntu 6.06.2.

Changed in debconf:
assignee: nobody → kamion
importance: Undecided → High
status: New → Confirmed
Colin Watson (cjwatson) wrote :
Colin Watson (cjwatson) wrote :

I have posted modified versions of the i386 alternate install CDs for Dapper, Edgy, and Feisty here:

  http://cdimage.ubuntu.com/custom/20070805-bug62986/

I would very much appreciate it if anyone affected by this bug could attempt installations from these and report whether they fix the hangs you reported. I would *particularly* appreciate testing of the Dapper image, as my ability to get this fix into Ubuntu 6.06.2 (due in about a month's time, but the fix needs to go into the archive this week) is directly dependent on successful testing by affected users.

Please report successes or failures at fixing the specific problem in this bug here. Please report any other problems as separate bugs.

Thanks in advance!

Changed in debconf:
status: Confirmed → In Progress
Colin Watson (cjwatson) wrote :
description: updated
description: updated
Martin Pitt (pitti) wrote :

Just for the record:

Without having scrutinized the patch in detail, this looks promising to me; we'll test it for UP during the tribe4 release, and since it does not change the behaviour of existing dapper installations, the "breaks the world" potential is very low.

I propose to upload this to dapper-proposed after the tribe 4 tests were successful, so that we can test both the new debconf in -proposed and the test isos in parallel.

Colin Watson (cjwatson) wrote :

Uploaded to dapper-proposed.

Colin Watson (cjwatson) wrote :

(I expect Martin will only actually let it in after Tribe 4.)

Eli Collins (elicollins) wrote :

The patch works for me. I tested automated 2-way smp installations of the new dapper and feisty server isos:
on dapper 88 installs have completed successfully, on feisty 137 have completed successfully. I've started 4-way
installs of both and have not seen any issues yet.

Martin Pitt (pitti) wrote :

Accepted into dapper-proposed, please go ahead with QA testing.

Tests should include:
 * Verify that the installation hang is fixed (with the test ISOs that Colin provided)
 * Install a few packages in dapper which use debconf, to check for regressions.

Changed in debconf:
status: In Progress → Fix Committed
Colin Watson (cjwatson) wrote :

debconf (1.4.72ubuntu10) dapper-proposed; urgency=low

  * Backport from trunk, fixing a race leading to occasional hangs during
    installation (LP: #62986):
    - Make sure that apt status commands and debconf protocol commands under
      debconf-apt-progress are properly interleaved. Closes: #425397

 -- Colin Watson <email address hidden> Tue, 07 Aug 2007 10:09:04 +0100

Changed in debconf:
status: Fix Committed → Fix Released
Martin Pitt (pitti) on 2007-08-10
Changed in debconf:
status: Fix Released → Fix Committed
Martin Pitt (pitti) wrote :

I tested the dapper-proposed debconf on a single-processor machine with some packages which make heavy use of debconf (postfix and libnss-ldap) and did not see any regression.

Martin Pitt (pitti) wrote :

Considering as verification-done after Eli's extensive test series and my own regression tests.

Martin Pitt (pitti) wrote :

Copied to dapper-updates.

Changed in debconf:
status: Fix Committed → Fix Released
ChrisKelley (ckelley) wrote :

Hi all, I just found this via Google. Was attempting a net install to a Pentium 4 I just got. The first failure bricked the box. Now I set up pxe server and it has frozen again. First time was at 5%, now at 6%. Was using ubuntu gutsy alt cd and this time, xubuntu alt cd from the pxe server. Not real sure how to go forward. Will keep trying. Next, will try to tell it to read from the CD-ROM (which the bios/mo-board cannot boot from). If there is a way I can give more info, please reply.

Oh, wait... Wow, 40~50 minutes after it froze at 6%, it picked up again. (literally, just this instant) The first time, I selected LAMP server and some stuff, this time only Xubuntu desktop. Looks like it's running now.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.