blkfront driver race, can not attach new volumes

Bug #1326870 reported by Robert C Jennings on 2014-06-05
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux-ec2 (Ubuntu)
Undecided
Unassigned
Lucid
Medium
Stefan Bader

Bug Description

[Impact]

 * If a user detaches a volume before unmount a race is hit (kernel stuck detaching the volume) and new volumes are not recognized

 * Stefan Bader suggested the following patch set to resolve the issue:
   * 0e34582699392d67910bd3919bc8fd9bedce115e
     blkfront: fixes for 'xm block-detach ... --force'
   * 5d7ed20e822ef82117a4d9928b030fa0247b789d
      blkfront: don't access freed struct xenbus_device
   * a66b5aebb7dc9e695dcb4b528906fd398b63f3d9
      blkfront: Clean up vbd release
   * b70f5fa043b318659c936d8c3c696250e6528944
      blkfront: Lock blkfront_info when closing

[Test Case]

The was originally seen with AMI ami-bffa6fd6[0] doing the following:
1. Launch an instance.
2. Attach a new volume to the instance using the API.
3. Mount the volume on the instance.
4. Detach the volume using the API.
5. Wait a few seconds (30 seconds? 60 seconds?).
6. Unmount the volume on the instance.
7. Wait for volume to become available.
8. Delete the volume once it is available and go to step 2.
With about 135 iterations of these steps this problem can be reproduced.

[0] That AMI is "lucid server release 20130124 instance-store amd64 us-east-1 ami-bffa6fd6 aki-88aa75e1 paravirtual"
bffa6fd6
In the ubuntu instance with a self-compiled 2.6.32 kernel with these patches applied the behavior of the kernel is as expected even with the user error.

[Regression Potential]

 * Since EC2 images in Lucid are based on a separate branch,
   we can rule out regressions on the generic/server images.
 * Code changes are limited to the xen-blkfront driver and
   to hot-adding/-removing disk images. Only Xen guests
   using PV disks can be affected.
 * So there is potential for introducing new bugs into the
   process but then should be detected while testing for
   verification.
 * I would consider the risk of regressions as low.

[Other Info]

Root cause of the problem are:
 1. User error: The user first does force detach then unmount.
     Correct usage is: first unmount then detach.
 2. This 2.6.32 kernel has a race bug in the blkfront driver.

When the race bug is hit then the instance kernel is stuck in the detaching code and hence does not recognize the new attached volume.

In the ubuntu instance with a self-compiled 2.6.32 kernel with these patches applied the behavior of the kernel is as expected even with the user error.

$ lsb_release -rd
Description: Ubuntu 10.04.4 LTS
Release: 10.04

$ apt-cache policy linux-ec2
linux-ec2:
  Installed: 2.6.32.350.31
  Candidate: 2.6.32.364.45
  Version table:
     2.6.32.364.45 0
        500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
        500 http://security.ubuntu.com/ubuntu/ lucid-security/main Packages
 *** 2.6.32.350.31 0
        100 /var/lib/dpkg/status
     2.6.32.305.6 0
        500 http://us-east-1.ec2.archive.ubuntu.com/ubuntu/ lucid/main Packages

CVE References

Robert C Jennings (rcj) wrote :

Stefan, could you help me outline the regression potential for these patches? Thank you.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-ec2 (Ubuntu):
status: New → Confirmed
Stefan Bader (smb) on 2014-06-11
Changed in linux-ec2 (Ubuntu):
assignee: nobody → Stefan Bader (smb)
Changed in linux-ec2 (Ubuntu Lucid):
assignee: nobody → Stefan Bader (smb)
Changed in linux-ec2 (Ubuntu):
assignee: Stefan Bader (smb) → nobody
Changed in linux-ec2 (Ubuntu Lucid):
importance: Undecided → Medium
Stefan Bader (smb) on 2014-06-11
Changed in linux-ec2 (Ubuntu Lucid):
status: New → Confirmed
Changed in linux-ec2 (Ubuntu):
status: Confirmed → Fix Released
Stefan Bader (smb) on 2014-06-11
description: updated
Stefan Bader (smb) on 2014-06-11
Changed in linux-ec2 (Ubuntu Lucid):
status: Confirmed → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux-ec2 - 2.6.32-366.80

---------------
linux-ec2 (2.6.32-366.80) lucid; urgency=low

  [ Stefan Bader ]

  * Rebased to Ubuntu-2.6.32-62.125
  * Release Tracking Bug
    - LP: #1328287

  [ Upstream Kernel Changes ]

  * blkfront: fixes for 'xm block-detach ... --force'
    - LP: #1326870
  * blkfront: don't access freed struct xenbus_device
    - LP: #1326870
  * blkfront: Clean up vbd release
    - LP: #1326870
  * blkfront: Lock blkfront_info when closing
    - LP: #1326870

  [ Ubuntu: 2.6.32-62.125 ]

  * SAUCE: (no-up) Fix regression introduced by patch, for CVE-2014-3153
    - LP: #1327300
  * [Config] add debian/gbp.conf
  * filter: prevent nla extensions to peek beyond the end of the message
    - LP: #1319561, #1319563
    - CVE-2014-3145
 -- Stefan Bader <email address hidden> Wed, 11 Jun 2014 11:13:31 +0200

Changed in linux-ec2 (Ubuntu Lucid):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for linux-ec2 has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers