Comment 8 for bug 1023214

Revision history for this message
Stephen Croll (stephen-d-croll) wrote :

FYI: There are a pair of kernel commits that *might* solve this
issue, at least for newer kernels. The first can be found in
Linus's tree (as well as linux-next) and the second is currently
only in linux-next:

[scroll@scroll-kernel-vm linux-next]$ git remote -v
origin http://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (fetch)
origin http://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (push)

[scroll@scroll-kernel-vm linux-next]$ git log -n1 bea6832cc8c4a0a9a65dd17da6aaa657fe27bc3e
commit bea6832cc8c4a0a9a65dd17da6aaa657fe27bc3e
Author: Stanislaw Gruszka <email address hidden>
Date: Wed Aug 8 11:27:15 2012 +0200

    sched: fix divide by zero at {thread_group,task}_times

    On architectures where cputime_t is 64 bit type, is possible to trigger
    divide by zero on do_div(temp, (__force u32) total) line, if total is a
    non zero number but has lower 32 bit's zeroed. Removing casting is not
    a good solution since some do_div() implementations do cast to u32
    internally.

    This problem can be triggered in practice on very long lived processes:

      PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin"
       #0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
       #1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
       #2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
       #3 [ffff880472a51cd0] die at ffffffff8100f26b
       #4 [ffff880472a51d00] do_trap at ffffffff814f03f4
       #5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
       #6 [ffff880472a51e00] divide_error at ffffffff8100be7b
          [exception RIP: thread_group_times+0x56]
          RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046
          RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad
          RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc
          RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800
          R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08
          R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007
          ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
       #7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
       #8 [ffff880472a51f40] sys_times at ffffffff81088524
       #9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
          RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202
          RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000
          RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0
          RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b
          R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940
          R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940
          ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b

    Cc: <email address hidden>
    Signed-off-by: Stanislaw Gruszka <email address hidden>
    Signed-off-by: Peter Zijlstra <email address hidden>
    Link: http://<email address hidden>
    Signed-off-by: Thomas Gleixner <email address hidden>

[scroll@scroll-kernel-vm linux-next]$ git log -n1 62188451f0d63add7ad0cd2a1ae269d600c1663d
commit 62188451f0d63add7ad0cd2a1ae269d600c1663d
Author: Frederic Weisbecker <email address hidden>
Date: Sat Jan 26 17:19:42 2013 +0100

    cputime: Avoid multiplication overflow on utime scaling

    We scale stime, utime values based on rtime (sum_exec_runtime
    converted to jiffies). During scaling we multiple rtime * utime,
    which seems to be fine, since both values are converted to u64,
    but it's not.

    Let assume HZ is 1000 - 1ms tick. Process consist of 64 threads,
    run for 1 day, threads utilize 100% cpu on user space. Machine
    has 64 cpus.

    Process rtime = utime will be 64 * 24 * 60 * 60 * 1000 jiffies,
    which is 0x149970000. Multiplication rtime * utime result is
    0x1a855771100000000, which can not be covered in 64 bits.

    Result of overflow is stall of utime values visible in user
    space (prev_utime in kernel), even if application still consume
    lot of CPU time.

    A solution to solve this is to perform the multiplication on
    stime instead of utime. It's easy to grow the utime value fast
    with a CPU bound thread in userspace for example. Now we assume
    that doing so with stime is much harder. In most cases a task
    shouldn't ever spend much time in kernel space as it tends to
    sleep waiting for jobs completion when they take long to
    achieve. IO is the typical example of that.

    Hence scaling the cputime by performing the multiplication on
    stime instead of utime should considerably reduce the chances of
    an overflow on most workloads.

    This is largely inspired by a patch from Stanislaw Gruszka:
    http://<email address hidden>

    Inspired-by: Stanislaw Gruszka <email address hidden>
    Reported-by: Stanislaw Gruszka <email address hidden>
    Acked-by: Stanislaw Gruszka <email address hidden>
    Signed-off-by: Frederic Weisbecker <email address hidden>
    Cc: Oleg Nesterov <email address hidden>
    Cc: Peter Zijlstra <email address hidden>
    Cc: Andrew Morton <email address hidden>
    Link: http://<email address hidden>
    Signed-off-by: Ingo Molnar <email address hidden>