Race condition in network/deallocate_for_instance() leads to security issue

Bug #1021340 reported by Phil Day on 2012-07-05
258
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Phil Day
Essex
High
Unassigned
nova (Ubuntu)
Undecided
Unassigned
Precise
Undecided
Unassigned

Bug Description

During compute/terminate_instance networking is de-allocated by a call to network/api/deallocate_for_instance(), which is implemented as an rpc.cast to the network manager. This in turn will eventually call network/manager/deallocate_fixed_ip(), which in turn call the compute API to trigger a security group refresh, which will get the instance from the database

However because original call to the network manager is a cast there is a chance that the compute manager will delete the instance record in the DB before the compute API (in the network manager) tries to retrieve the instance. At this point the security group refresh fails (leaving rules in place which are a security risk when the IP is reused), and potentially stopping other aspects of the network deallocation from completing.

This is one of the few places left in the network api where a cast is used, and it seems to me that an rpc.call would be better – the compute manager really needs to know that the network has been properly deallocated and cleaned up before it marks the VM as deleted.

In addition to this race condition, in general if there is an error during network clean up it is better that the VM remains in an error condition so that delete can be retried than for the VM to be deleted with the network left in some indeterminate state with no trigger available to retry the clean up. Hence some exception handling for the call to deallocate_for_instance() is required.

Note that aside from this specific use of a cast there are 4 other casts in the network API:
  add_fixed_ip_to_instance
  remove_fixed_ip_from_instance
  add_network_to_project
  release_floating_ip

Maybe these should all be considered for conversion to rpc.call to avoid race conditions and inconsistencies between the compute and network manager.

Related branches

Vish Ishaya (vishvananda) wrote :

Agreed, lets convert everything to call. We may be sacrificing a little performance, but It will prevent a lot of annoying race conditions like this one.

Changed in nova:
status: New → Triaged
importance: Undecided → High
Changed in nova:
assignee: nobody → Phil Day (philip-day)
status: Triaged → In Progress
Russell Bryant (russellb) wrote :

In the future, when a bug has security implications, please mark it as a security issue so that the vulnerability management team can evaluate it and determine whether the patch should go through embargo and a coordinated release.

security vulnerability: no → yes

Reviewed: https://review.openstack.org/9630
Committed: http://github.com/openstack/nova/commit/34f9d7e974d0e09c723e0a04ed6eecd1b482e77d
Submitter: Jenkins
Branch: master

commit 34f9d7e974d0e09c723e0a04ed6eecd1b482e77d
Author: Phil Day <email address hidden>
Date: Wed Jul 11 09:26:11 2012 +0100

    Convert remaining network API casts to calls

    Fix for Bug 1021340

    During compute/terminate_instance networking is de-allocated by a call to
    network/api/deallocate_for_instance(), which is implemented as an rpc.cast to
    the network manager. This in turn will eventually call
    network/manager/deallocate_fixed_ip(), which in turn call the compute API to
    trigger a security group refresh, which will get the instance from the database

    However because original call to the network manager is a cast there is a chance
    that the compute manager will delete the instance record in the DB before the
    compute API (in the network manager) tries to retrieve the instance. At this
    point the security group refresh fails (leaving rules in place which are a
    security risk when the IP is reused), and potentially stopping otheraspects
    of the network deallocation from completing.

    Changing this from rpc.call to rpc.cast will fix this issue.

    Aside from this specific use of a cast there are 4 other casts in the
    network API:
      add_fixed_ip_to_instance
      remove_fixed_ip_from_instance
      add_network_to_project
      release_floating_ip

    and to avoid other timing issues these will also be converted to calls.

    Change-Id: I5cdcc628293d3e7cf165c5ffe4883f138783f73f

Changed in nova:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/10387
Committed: http://github.com/openstack/nova/commit/219c5ca3bed950b7f36e57e942c91d177ec788d5
Submitter: Jenkins
Branch: stable/essex

commit 219c5ca3bed950b7f36e57e942c91d177ec788d5
Author: Phil Day <email address hidden>
Date: Wed Jul 11 09:26:11 2012 +0100

    Convert remaining network API casts to calls

    Fix for Bug 1021340

    During compute/terminate_instance networking is de-allocated by a call to
    network/api/deallocate_for_instance(), which is implemented as an rpc.cast to
    the network manager. This in turn will eventually call
    network/manager/deallocate_fixed_ip(), which in turn call the compute API to
    trigger a security group refresh, which will get the instance from the database

    However because original call to the network manager is a cast there is a chance
    that the compute manager will delete the instance record in the DB before the
    compute API (in the network manager) tries to retrieve the instance. At this
    point the security group refresh fails (leaving rules in place which are a
    security risk when the IP is reused), and potentially stopping otheraspects
    of the network deallocation from completing.

    Changing this from rpc.call to rpc.cast will fix this issue.

    Aside from this specific use of a cast there are 4 other casts in the
    network API:
      add_fixed_ip_to_instance
      remove_fixed_ip_from_instance
      add_network_to_project
      release_floating_ip

    and to avoid other timing issues these will also be converted to calls.

    Change-Id: I5cdcc628293d3e7cf165c5ffe4883f138783f73f
    (cherry picked from commit 34f9d7e974d0e09c723e0a04ed6eecd1b482e77d)

Thierry Carrez (ttx) on 2012-08-16
Changed in nova:
milestone: none → folsom-3
status: Fix Committed → Fix Released
Dave Walker (davewalker) on 2012-08-24
Changed in nova (Ubuntu):
status: New → Fix Released
Changed in nova (Ubuntu Precise):
status: New → Confirmed

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Trunk review: https://review.openstack.org/9630
Stable review: https://review.openstack.org/10387

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Adam Gandelman (gandelman-a) wrote :

Test coverage log.

tags: added: verification-done
Launchpad Janitor (janitor) wrote :
Download full text (5.4 KiB)

This bug was fixed in the package nova - 2012.1.3+stable-20120827-4d2a4afe-0ubuntu1

---------------
nova (2012.1.3+stable-20120827-4d2a4afe-0ubuntu1) precise-proposed; urgency=low

  * New upstream snapshot, fixes FTBFS in -proposed. (LP: #1041120)
  * Resynchronize with stable/essex (4d2a4afe):
    - [5d63601] Inappropriate exception handling on kvm live/block migration
      (LP: #917615)
    - [ae280ca] Deleted floating ips can cause instance delete to fail
      (LP: #1038266)

nova (2012.1.3+stable-20120824-86fb7362-0ubuntu1) precise-proposed; urgency=low

  * New upstream snapshot. (LP: #1041120)
  * Dropped, superseded by new snapshot:
    - debian/patches/CVE-2012-3447.patch: [d9577ce]
    - debian/patches/CVE-2012-3371.patch: [25f5bd3]
    - debian/patches/CVE-2012-3360+3361.patch: [b0feaff]
  * Resynchronize with stable/essex (86fb7362):
    - [86fb736] Libvirt driver reports incorrect error when volume-detach fails
      (LP: #1029463)
    - [272b98d] nova delete lxc-instance umounts the wrong rootfs (LP: #971621)
    - [09217ab] Block storage connections are NOT restored on system reboot
      (LP: #1036902)
    - [d9577ce] CVE-2012-3361 not fully addressed (LP: #1031311)
    - [e8ef050] pycrypto is unused and the existing code is potentially insecure
      to use (LP: #1033178)
    - [3b4ac31] cannot umount guestfs (LP: #1013689)
    - [f8255f3] qpid_heartbeat setting in ineffective (LP: #1030430)
    - [413c641] Deallocation of fixed IP occurs before security group refresh
      leading to potential security issue in error / race conditions
      (LP: #1021352)
    - [219c5ca] Race condition in network/deallocate_for_instance() leads to
      security issue (LP: #1021340)
    - [f2bc403] cleanup_file_locks does not remove stale sentinel files
      (LP: #1018586)
    - [4c7d671] Deleting Flavor currently in use by instance creates error
      (LP: #994935)
    - [7e88e39] nova testsuite errors on newer versions of python-boto (e.g.
      2.5.2) (LP: #1027984)
    - [80d3026] NoMoreFloatingIps: Zero floating ips available after repeatedly
      creating and destroying instances over time (LP: #1017418)
    - [4d74631] Launching with source groups under load produces lazy load error
      (LP: #1018721)
    - [08e5128] API 'v1.1/{tenant_id}/os-hosts' does not return a list of hosts
      (LP: #1014925)
    - [801b94a] Restarting nova-compute removes ip packet filters (LP: #1027105)
    - [f6d1f55] instance live migration should create virtual_size disk image
      (LP: #977007)
    - [4b89b4f] [nova][volumes] Exceeding volumes, gigabytes and floating_ips
      quotas returns general uninformative HTTP 500 error (LP: #1021373)
    - [6e873bc] [nova][volumes] Exceeding volumes, gigabytes and floating_ips
      quotas returns general uninformative HTTP 500 error (LP: #1021373)
    - [7b215ed] Use default qemu-img cluster size in libvirt connection driver
    - [d3a87a2] Listing flavors with marker set returns 400 (LP: #956096)
    - [cf6a85a] nova-rootwrap hardcodes paths instead of using
      /sbin:/usr/sbin:/usr/bin:/bin (LP: #1013147)
    - [2efc87c] affinity filters don't work if scheduler_hints is None
      (LP: #1007573)
  ...

Read more...

Changed in nova (Ubuntu Precise):
status: Confirmed → Fix Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Thierry Carrez (ttx) on 2012-09-27
Changed in nova:
milestone: folsom-3 → 2012.2
To post a comment you must log in.
This report contains Public Security information  Edit
Everyone can see this security related information.

Other bug subscribers