cpu usage does not match sum of per-process usage

Bug #1193073 reported by Bryan Quigley
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Medium
Dave Chiluk

Bug Description

Ubuntu 12.04. Already tested with kernel 3.10 as well.

Steps:
1 Get on a system with 2 cpus, netperf and top installed (need m1.medium)
2. Stop already running netserver
3. sudo taskset -c 0 netserver
4. taskset -c 1 netperf -H localhost -l 3600 -t TCP_RR & (start netperf with priority on cpu1)
5. Run top, press 1 for multiple CPUs to be separated
6. Observe how the numbers don't add up (varies between being off by 3-7% or so) between the process's numbers and the CPU%

The numbers also do not add up correctly when viewed directly from /proc
Bonus. Run htop to see a completely different set of seemingly wrong results.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed
Dave Chiluk (chiluk)
Changed in linux (Ubuntu):
assignee: nobody → Dave Chiluk (chiluk)
Revision history for this message
Dave Chiluk (chiluk) wrote :

I have verified the above results in a virtual machine, netperf, and netserver were both showing 41-42% cpu usage while per cpu usage were showing 63-65%.

I also verified this with 3.10-rc6.

I also did a test, which was to run
taskset -c 1 nice 19 stress -c 1
while running the above netperf. I discovered that the time spent on the netperf task dropped to 30% while the cpu usage on the netserver stayed at 50%. Leading me to believe that the per process usage is being reported higher than it should be.

I also ran a test to verify that top is correctly reporting values as spit out from /proc/
while(true) do cat stat >> /tmp/stat; cat /proc/[pid of netserver]/stat >> /tmp/stat ; cat /proc/[pid of netperf]/stat >> /tmp/stat; sleep 2s ; done
Attached is the output from running the above

Revision history for this message
Dave Chiluk (chiluk) wrote :

The results from the raw proc data show that top is reporting the same numbers as the raw output. So this issue exists somewhere in the scheduler accounting.

Chris J Arges (arges)
Changed in linux (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Dave Chiluk (chiluk) wrote :

This issue was brought up on the kernel mailing list

https://lkml.org/lkml/2013/6/20/581

The issue is partially due to the massive number of soft interrupts being generated by this job, and the frequency at which top is doing the sampling. Increasing the sampling rate should improve accuracy, but increased sampling requires stealing time from the processes themselves. Unfortunately this job type is a limitation of the accounting in the kernel, and can not be fixed without negatively affecting performance or significantly rewriting the kernel accounting subsystem in a way currently not invented.

As a result we will be closing this bug within the next 4 days if there are no objections.

Changed in linux (Ubuntu):
status: Confirmed → In Progress
Revision history for this message
Bryan Quigley (bryanquigley) wrote :

Closing Invalid (actually a Won't Fix though - see above comment)

Changed in linux (Ubuntu):
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.