Comment 2 for bug 1765241

Revision history for this message
Jay Vosburgh (jvosburgh) wrote :

SRU Justification:

Impact:

 This issue can cause system panics of systems using the
virtio_scsi driver with the affected Ubuntu kernels. The issue manifests
irregularly, as it is timing dependent.

Fix:

 The issue is resolved by adding synchronization between the two
code paths that race with one another. The most straightforward fix
is to have the code wait for any outstanding
requests to drain prior to freeing the target structure, e.g.,

--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -762,6 +762,10 @@ static int virtscsi_target_alloc(struct scsi_target *starget)
 static void virtscsi_target_destroy(struct scsi_target *starget)
 {
        struct virtio_scsi_target_state *tgt = starget->hostdata;
+
+ /* we can race with concurrent virtscsi_complete_cmd */
+ while (atomic_read(&tgt->reqs))
+ cpu_relax();
        kfree(tgt);
 }

An alternative fix that was considered is to use a synchronize_rcu_expedited
call, as that is the functionality that blocks the race in unaffected kernels.
However, some call paths into virtscsi_target_destroy may hold mutexes that
are not held by the upstream RCU sync calls (which enter via the block layer).
For this reason the more confined fix described above was chosen.

Testcase:

This reproduces on Google Cloud, using the current, unmodified
ubuntu-1404-lts image (with the Ubuntu 4.4 kernel). Using the two attached
scripts, run e.g.

  ./create_shutdown_instance.sh 100

to create 100 instances. If an instance runs its startup script
successfully, it'll shut itself down right away. So instances that are
still running after a few minutes likely demonstrate this problem.

The issue reproduces easily with n1-standard-4.

create_shutdown_instance.sh:

#!/bin/bash -e

ZONE=us-central1-a

for i in $(seq -w $1); do
  gcloud compute instances create shutdown-experiment-$i \
    --zone="${ZONE}" \
    --image-family=ubuntu-1404-lts \
    --image-project=ubuntu-os-cloud \
    --machine-type=n1-standard-4 \
    --scopes compute-rw \
    --metadata-from-file startup-script=immediate_shutdown.sh &
done

wait

immediate_shutdown.sh:

#!/bin/bash -x

function get_metadata_value() {
  curl -H 'Metadata-Flavor: Google' \
    "http://metadata.google.internal/computeMetadata/v1/instance/$1"
}

readonly ZONE="$(get_metadata_value zone | awk -F'/' '{print $NF}')"
gcloud compute instances delete "$(hostname)" --zone="${ZONE}" --quiet