Possible local DoS: load increases abnormaly when running BOINC client

Bug #245458 reported by T-Bone
2
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned

Bug Description

Description: Ubuntu 8.04.1
Release: 8.04

Linux dogma 2.6.24-19-server #1 SMP Wed Jun 18 14:44:47 UTC 2008 x86_64 GNU/Linux

boinc-client:
  Installed: 5.10.45-1ubuntu1
  Candidate: 5.10.45-1ubuntu1
  Version table:
 *** 5.10.45-1ubuntu1 0
        500 http://se.archive.ubuntu.com hardy/universe Packages
        100 /var/lib/dpkg/status

linux-image-2.6.24-19-server:
  Installed: 2.6.24-19.41
  Candidate: 2.6.24-19.41
  Version table:
 *** 2.6.24-19.41 0
        500 http://fr.archive.ubuntu.com hardy-updates/main Packages
        100 /var/lib/dpkg/status
     2.6.24-19.36 0
        500 http://security.ubuntu.com hardy-security/main Packages

Not sure which package is at fault (boinc or maybe the kernel?). Here are the symptoms:

I upgraded from dapper to hardy. In dapper, boinc had been running Rosetta and WCG just fine for years. After the upgrade, I realized that the load was going crazily high (I stopped at 5+, it's a 2 way machine) and the machine became unresponsive when boinc was running. I performed a reset of all projects, started again with just WCG first. At first, things went ok but after a little while (~5/10 minutes) suddenly the load average would go nuts again (i.e: way over the expected 2.00). Here's a capture of top at this time:

top - 11:23:43 up 3 days, 17:42, 2 users, load average: 2.95, 2.40, 1.30
Tasks: 105 total, 4 running, 101 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 12.0%sy, 63.0%ni, 25.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 75.0%ni, 25.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1028352k total, 1008036k used, 20316k free, 63276k buffers
Swap: 779144k total, 128k used, 779016k free, 552288k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21262 boinc 39 19 277m 86m 1076 R 100 8.6 4:06.32 wcg_dddt_autodo
21265 boinc 39 19 34316 29m 1736 R 100 2.9 4:06.72 wcg_hcc1_img_6.
21502 root 20 0 10652 1100 824 R 0 0.1 0:00.05 top
    1 root 20 0 4000 928 660 S 0 0.1 0:01.02 init
    2 root 15 -5 0 0 0 S 0 0.0 0:00.00 kthreadd
    3 root RT -5 0 0 0 S 0 0.0 0:00.27 migration/0
    4 root 15 -5 0 0 0 S 0 0.0 0:00.27 ksoftirqd/0
    5 root RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/0
    6 root RT -5 0 0 0 S 0 0.0 0:00.28 migration/1
    7 root 15 -5 0 0 0 S 0 0.0 0:00.20 ksoftirqd/1

Note the incoherent "25% idle" on both CPUs by the way.

Then I stopped, and tried with Rosetta, running only one thread. There again, everything went fine for a little while, until the load exploded again (above the expected 1.00):

top - 12:02:19 up 3 days, 18:20, 2 users, load average: 2.63, 1.56, 0.90
Tasks: 107 total, 5 running, 102 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.0%sy, 75.0%ni, 25.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 1028352k total, 1013568k used, 14784k free, 62624k buffers
Swap: 779144k total, 136k used, 779008k free, 326464k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
22553 boinc 39 19 312m 209m 12 R 100 20.9 8:08.59 rosetta_beta_5.
    1 root 20 0 4000 928 660 S 0 0.1 0:01.02 init
    2 root 15 -5 0 0 0 S 0 0.0 0:00.00 kthreadd
    3 root RT -5 0 0 0 S 0 0.0 0:00.27 migration/0
    4 root 15 -5 0 0 0 S 0 0.0 0:00.27 ksoftirqd/0
    5 root RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/0
    6 root RT -5 0 0 0 S 0 0.0 0:00.28 migration/1
    7 root 15 -5 0 0 0 S 0 0.0 0:00.21 ksoftirqd/1
    8 root RT -5 0 0 0 S 0 0.0 0:00.00 watchdog/1
    9 root 15 -5 0 0 0 R 0 0.0 0:00.98 events/0
   10 root 15 -5 0 0 0 S 0 0.0 0:00.61 events/1
   11 root 15 -5 0 0 0 S 0 0.0 0:00.00 khelper
   44 root 15 -5 0 0 0 S 0 0.0 0:00.27 kblockd/0

Note again the weird split in the Cpu0 line.

ia32-libs can probably be ruled out as a cause of problem: rosetta is native 64bit (WCG is not, afaicr).

I also tried running boinc client with "SCHEDULE=0", in case this was a bug in the IDLEPRIO policy. Same symptoms.

HTH

Revision history for this message
T-Bone (varenet) wrote :

I could reproduce this with vanilla 6.2.14 x86_64. It arguably took a little longer (about 25 minutes) for the issue to show up, but it eventually did, with the same symptoms (including the bogus 25% idle report on the working cpu(s) in top).

I would thus suspect a kernel bug (considering that I'm not aware of any such severe b0rkenness in the vanilla boinc client which I'm reliably using elsewhere, and also considering the fact that the bogus report from "top" altogether points toward some kernel breakage).

HTH

Revision history for this message
T-Bone (varenet) wrote :

More details:
- still happening with linux-image-2.6.24-19-server 2.6.24-19.41 x86_64 and boinc-client 5.10.45-1ubuntu1
- not reproducible on ia64 linux-image-2.6.24-19-mckinley 2.6.24-19.41 and boinc-client 5.10.45-1ubuntu1
Both tests with the same app (simap).

I unfortunately don't have another x86_64 machine running ubuntu hardy to test this, but I'm fairly convinced that all of this points toward a x86_64 kernel bug. I'll try booting a vanilla kernel sometimes soon.

Revision history for this message
T-Bone (varenet) wrote :

Confirmed kernel bug.

Not reproducible with Debian kernel 2.6.26-1-amd64_2.6.26-4, although the "top" glitch (showing 25% idle on each CPU when they're (hopefully) not) is still there.

Revision history for this message
T-Bone (varenet) wrote :

I have confirmed that this bug is a kernel one. It doesn't occur when running a different kernel.

T-Bone (varenet)
description: updated
Revision history for this message
kernel-janitor (kernel-janitor) wrote :

Hi T-Bone,

This bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux-image-`uname -r` 245458

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
T-Bone (varenet) wrote :

This box (a deployed server, fwiw) is running Ubuntu Hardy LTS. Trying development release is clearly out of question. I've already given up using Ubuntu kernels anyway, so nevermind.

BTW, you have my congrats: 1 year before the first reply to a bugreport, that's a record I wouldn't be proud of.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi T-Bone,

Indeed the delayed response to your bug report was not acceptable and is something we're actively trying to get better at. I'm sorry to hear you are no longer running the Ubuntu kernels. For now I'll go ahead and close this bug. Thanks.

Changed in linux (Ubuntu):
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers