[Hyper-V] Race condition in SMP bootup

Bug #1508609 reported by Joshua R. Poulson
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned
Trusty
Invalid
Medium
Joseph Salisbury
Vivid
Fix Released
Medium
Joseph Salisbury
Wily
Fix Released
Medium
Joseph Salisbury
linux-lts-utopic (Ubuntu)
Fix Released
Medium
Joseph Salisbury
Trusty
Fix Released
Medium
Joseph Salisbury

Bug Description

Please integrate the following upstream commit.

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=dd9d3843755da95f63dd3a376f62b3e45c011210

sched: Fix cpu_active_mask/cpu_online_mask race
There is a race condition in SMP bootup code, which may result
in

    WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
    workqueue_cpu_up_callback()
or
    kernel BUG at kernel/smpboot.c:135!

It can be triggered with a bit of luck in Linux guests running
on busy hosts.

 CPU0 CPUn
 ==== ====

 _cpu_up()
   __cpu_up()
        start_secondary()
          set_cpu_online()
     cpumask_set_cpu(cpu,
         to_cpumask(cpu_online_bits));
   cpu_notify(CPU_ONLINE)
     <do stuff, see below>
     cpumask_set_cpu(cpu,
         to_cpumask(cpu_active_bits));

During the various CPU_ONLINE callbacks CPUn is online but not
active. Several things can go wrong at that point, depending on
the scheduling of tasks on CPU0.

Variant 1:

  cpu_notify(CPU_ONLINE)
    workqueue_cpu_up_callback()
      rebind_workers()
        set_cpus_allowed_ptr()

  This call fails because it requires an active CPU; rebind_workers()
  ends with a warning:

    WARNING: CPU: 0 PID: 1 at kernel/workqueue.c:4418
    workqueue_cpu_up_callback()

Variant 2:

  cpu_notify(CPU_ONLINE)
    smpboot_thread_call()
      smpboot_unpark_threads()
       ..
        __kthread_unpark()
          __kthread_bind()
          wake_up_state()
           ..
            select_task_rq()
              select_fallback_rq()

  The ->wake_cpu of the unparked thread is not allowed, making a call
  to select_fallback_rq() necessary. Then, select_fallback_rq() cannot
  find an allowed, active CPU and promptly resets the allowed CPUs, so
  that the task in question ends up on CPU0.

  When those unparked tasks are eventually executed, they run
  immediately into a BUG:

    kernel BUG at kernel/smpboot.c:135!

Just changing the order in which the online/active bits are set
(and adding some memory barriers), would solve the two issues
above. However, it would change the order of operations back to
the one before commit 6acbfb96976f ("sched: Fix hotplug vs.
set_cpus_allowed_ptr()"), thus, reintroducing that particular
problem.

Going further back into history, we have at least the following
commits touching this topic:
- commit 2baab4e90495 ("sched: Fix select_fallback_rq() vs cpu_active/cpu_online")
- commit 5fbd036b552f ("sched: Cleanup cpu_active madness")

Together, these give us the following non-working solutions:

  - secondary CPU sets active before online, because active is assumed to
    be a subset of online;

  - secondary CPU sets online before active, because the primary CPU
    assumes that an online CPU is also active;

  - secondary CPU sets online and waits for primary CPU to set active,
    because it might deadlock.

Commit 875ebe940d77 ("powerpc/smp: Wait until secondaries are
active & online") introduces an arch-specific solution to this
arch-independent problem.

Now, go for a more general solution without explicit waiting and
simply set active twice: once on the secondary CPU after online
was set and once on the primary CPU after online was seen.

set_cpus_allowed_ptr()")

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1508609

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Joshua R. Poulson (jrp) wrote :

This is a request to pick up an upstream submission, and does not require log files.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-da-key kernel-hyper-v
Changed in linux (Ubuntu):
importance: Undecided → Medium
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The patch is already in Wily as the following commit:
0a3b19c sched: Fix cpu_active_mask/cpu_online_mask race

The commit was cc'd to stable. I'll check the progress of it in the upstream stable trees.

Changed in linux (Ubuntu Vivid):
status: New → Confirmed
importance: Undecided → Medium
Changed in linux (Ubuntu Wily):
status: Confirmed → Fix Released
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I confirmed the patch is in the upstream 3.19 kernel linux-3.19.y-queue branch. That means it should be in the next 3.19 upstream release: v3.19.8-ckt8. It will then make it's way into Vivid when the 3.19.8-ckt8 updates are applied to the Vivid kernel.

This is the commit in upstream 3.19:

commit 74686cbfa06190b821424133d0852815cbe338ad
Author: Jan H. Schönherr <email address hidden>
Date: Wed Aug 12 21:35:56 2015 +0200

    sched: Fix cpu_active_mask/cpu_online_mask race

Revision history for this message
Joshua R. Poulson (jrp) wrote :

Can we get this backported to lts-trusty as well?

Thanks! --jrp

Changed in linux (Ubuntu Trusty):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Vivid):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Wily):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Vivid):
status: Confirmed → Fix Committed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The mainline commit that fixes this bug (dd9d3843), claims that this bug was introduced by upstream commit 6acbfb969 which was added to mainline in v3.15-rc8.

Can you test the current Trusty (3.13) based kernel to see if commit dd9d3843 actually needs to be included?

Revision history for this message
Joshua R. Poulson (jrp) wrote :

I don't think this is needed for trusty (by inspection).

Changed in linux (Ubuntu Trusty):
status: In Progress → Invalid
Changed in linux (Ubuntu Vivid):
status: Fix Committed → Fix Released
no longer affects: linux-lts-utopic (Ubuntu Wily)
no longer affects: linux-lts-utopic (Ubuntu Vivid)
Changed in linux-lts-utopic (Ubuntu):
status: New → In Progress
Changed in linux-lts-utopic (Ubuntu Trusty):
status: New → In Progress
Changed in linux-lts-utopic (Ubuntu):
importance: Undecided → Medium
Changed in linux-lts-utopic (Ubuntu Trusty):
importance: Undecided → Medium
Changed in linux-lts-utopic (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux-lts-utopic (Ubuntu Trusty):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux-lts-utopic (Ubuntu):
status: In Progress → Fix Released
Changed in linux-lts-utopic (Ubuntu Trusty):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.