virtio_scsi race can corrupt memory, panic kernel
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Confirmed
|
Undecided
|
Jay Vosburgh | ||
Xenial |
Fix Released
|
Medium
|
Unassigned |
Bug Description
There's a race in the virtio_scsi driver (for all kernels)
That race is inadvertently avoided on most kernels due to a
synchronize_rcu call coincidentally placed in one of the racing code paths
By happenstance, the set of patches backported to the Ubuntu
4.4 kernel ended up without a synchronize_rcu in the relevant place. The
issue first manifests with
commit be2a20802abbde0
Author: Jan Kara <email address hidden>
Date: Wed Feb 8 08:05:56 2017 +0100
block: Move bdi_unregister() to del_gendisk()
BugLink: http://
The race can cause a kernel panic due to corruption of a freelist
pointer in a slab cache. The panics occur in arbitrary places as
the failure occurs at an allocation after the corruption of the
pointer. However, the most common failure observed has been within
virtio_scsi itself during probe processing, e.g.:
[ 3.111628] [<ffffffff811b0
[ 3.112340] [<ffffffff813db
[ 3.113126] [<ffffffff813db
[ 3.113838] [<ffffffff8153d
[ 3.114568] [<ffffffff815ac
[ 3.115401] [<ffffffff815a9
[ 3.116287] [<ffffffff811f1
[ 3.117227] [<ffffffff8154e
[ 3.118048] [<ffffffff815a9
[ 3.118879] [<ffffffff815aa
[ 3.119653] [<ffffffff815aa
[ 3.120506] [<ffffffff815aa
[ 3.121295] [<ffffffff815aa
[ 3.122048] [<ffffffff810a5
[ 3.122846] [<ffffffff8109c
[ 3.123732] [<ffffffff8109c
[ 3.124508] [<ffffffff8109c
Details on the race:
CPU A:
virtscsi_probe
[...]
__scsi_scan_target
scsi_probe_
scsi_probe_lun
[...]
blk_execute_rq
returns up to scsi_probe_
In order for the race to occur, the wakeup must occur on a CPU other than
CPU B.
After being woken up by the completion of the request, the call
returns up the stack to scsi_probe_
__scsi_
__scsi_
blk_cleanup_queue
[ no longer calls bdi_unregister ]
scsi_target_
scsi_target_
kref_put
kref_sub
scsi_target_
scsi_target_destroy
->target_destroy() = virtscsi_
kfree(tgt) <=== FREE TGT
Note that the removal of the call to bdi_unregister in commit
xenial be2a20802abbde block: Move bdi_unregister() to del_gendisk()
and annotated above is the change that gates whether the issue
manifests or not. The other code change from be2a20802abbde has no effect
on the race.
CPU B:
vring_interrupt
virtscsi_
scsi_mq_done (via ->scsi_done())
scsi_mq_done
blk_mq_
__blk_mq_
[...]
blk_end_sync_rq
complete
[ wake up the task from CPU A ]
After waking the CPU A task, execution returns up the stack, and
then calls atomic_
after returning from the call to ->scsi_done.
If the activity on CPU A after it is woken up (starting at
__scsi_
virtscsi_
pointer in the freed slab object that contained tgt. This causes the
system to panic on a subsequent allocation from the per-cpu slab cache.
The call path on CPU B is significantly shorter than that on CPU A
after wakeup, so it is likely that an external event delays CPU B. This
could be either an interrupt within the VM or scheduling delays for the
virtual cpu process on the hypervisor. Whatever the delay is, it is not
the root cause but merely the triggering event.
The virtscsi race window described above exists in all kernels
that have been checked (4.4 upstream LTS, Ubuntu 4.13 and 4.15, and
current mainline as of this writing). However, none of those kernels
exhibit the panic in testing, only the Ubuntu 4.4 kernel after commit
be2a20802abbde.
The reason none of those kernels panic is they all have one thing
in common: an incidental call to synchronize_rcu somewhere in the call
path of CPU A after it is woken up. This causes CPU A to wait for CPU B's
activity, as CPU A will not go on to free the "tgt" memory until after the
RCU grace period passes, which requires waiting for CPU B's activity to
finish. Note that the specific RCU sync call is different between some of
those kernel versions, but all of them have one somewhere deep inside
blk_cleanup_queue. The bdi_unregister function has one (in the call to
bdi_remove_
window on the Ubuntu 4.4 kernel.
Resolving the issue can be accomplished by adding an RCU sync
to virtscsi_
to use a loop of the format:
+ while (atomic_
+ cpu_relax();
but this is higher risk as the loop is non-terminating in the case
of other failure.
CVE References
Changed in linux (Ubuntu): | |
assignee: | nobody → Jay Vosburgh (jvosburgh) |
status: | New → Confirmed |
Changed in linux (Ubuntu Xenial): | |
importance: | Undecided → Medium |
status: | New → In Progress |
Changed in linux (Ubuntu Xenial): | |
status: | In Progress → Fix Committed |
tags: |
added: verification-done-xenial removed: verification-needed-xenial |
SRU Justification:
Impact:
This issue can cause system panics of systems using the
virtio_scsi driver with the affected Ubuntu kernels. The issue manifests
irregularly, as it is timing dependent.
Fix:
The issue is resolved by adding synchronization between the two
code paths that race with one another. The most straightforward fix
is to have the code wait for any outstanding
requests to drain prior to freeing the target structure, e.g.,
--- a/drivers/ scsi/virtio_ scsi.c scsi/virtio_ scsi.c target_ alloc(struct scsi_target *starget) target_ destroy( struct scsi_target *starget) scsi_target_ state *tgt = starget->hostdata; complete_ cmd */ read(&tgt- >reqs))
+++ b/drivers/
@@ -762,6 +762,10 @@ static int virtscsi_
static void virtscsi_
{
struct virtio_
+
+ /* we can race with concurrent virtscsi_
+ while (atomic_
+ cpu_relax();
kfree(tgt);
}
An alternative fix that was considered is to use a synchronize_ rcu_expedited target_ destroy may hold mutexes that
call, as that is the functionality that blocks the race in unaffected kernels.
However, some call paths into virtscsi_
are not held by the upstream RCU sync calls (which enter via the block layer).
For this reason the more confined fix described above was chosen.
Testcase:
This reproduces on Google Cloud, using the current, unmodified
ubuntu-1404-lts image (with the Ubuntu 4.4 kernel). Using the two attached
scripts, run e.g.
./create_ shutdown_ instance. sh 100
to create 100 instances. If an instance runs its startup script
successfully, it'll shut itself down right away. So instances that are
still running after a few minutes likely demonstrate this problem.
The issue reproduces easily with n1-standard-4.
create_ shutdown_ instance. sh:
#!/bin/bash -e
ZONE=us-central1-a
for i in $(seq -w $1); do experiment- $i \ "${ZONE} " \ family= ubuntu- 1404-lts \ project= ubuntu- os-cloud \ type=n1- standard- 4 \ from-file startup- script= immediate_ shutdown. sh &
gcloud compute instances create shutdown-
--zone=
--image-
--image-
--machine-
--scopes compute-rw \
--metadata-
done
wait
immediate_ shutdown. sh:
#!/bin/bash -x
function get_metadata_ value() { metadata. google. internal/ computeMetadata /v1/instance/ $1"
curl -H 'Metadata-Flavor: Google' \
"http://
}
readonly ZONE="$ (get_metadata_ value zone | awk -F'/' '{print $NF}')"
gcloud compute instances delete "$(hostname)" --zone="${ZONE}" --quiet