Launching with source groups under load produces lazy load error

Bug #1018721 reported by Vish Ishaya
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Vish Ishaya
Essex
Fix Released
High
Vish Ishaya
nova (Ubuntu)
Fix Released
Undecided
Unassigned
Precise
Fix Released
Undecided
Unassigned

Bug Description

If I attempt to run many concurrent instances with source security groups I get the following traceback:

2012-06-28 01:34:49 ERROR nova.compute.manager [req-6961da6f-2576-43a6-8f3b-d66f6ade61e9 vish vish] [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] Instance failed to spawn
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] Traceback (most recent call last):
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/opt/stack/nova/nova/compute/manager.py", line 637, in _spawn
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] self._legacy_nw_info(network_info), block_device_info)
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/opt/stack/nova/nova/exception.py", line 117, in wrapped
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] temp_level, payload)
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/usr/lib/python2.7/contextlib.py", line 24, in __exit__
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] self.gen.next()
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/opt/stack/nova/nova/exception.py", line 92, in wrapped
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] return f(*args, **kw)
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/opt/stack/nova/nova/virt/libvirt/connection.py", line 976, in spawn
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] self._create_domain_and_network(xml, instance, network_info)
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/opt/stack/nova/nova/virt/libvirt/connection.py", line 1737, in _create_do
main_and_network
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] self.firewall_driver.prepare_instance_filter(instance, network_info)
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/opt/stack/nova/nova/virt/firewall.py", line 150, in prepare_instance_filt
er
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] self.add_filters_for_instance(instance)
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/opt/stack/nova/nova/virt/firewall.py", line 195, in add_filters_for_insta
nce
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] ipv4_rules, ipv6_rules = self.instance_rules(instance, network_info)
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/opt/stack/nova/nova/virt/firewall.py", line 365, in instance_rules
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] instance)
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/opt/stack/nova/nova/network/api.py", line 44, in wrapper
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] res = f(self, context, *args, **kwargs)
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/opt/stack/nova/nova/network/api.py", line 265, in get_instance_nw_info
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] return self._get_instance_nw_info(context, instance)
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/opt/stack/nova/nova/network/api.py", line 271, in _get_instance_nw_info
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] 'rxtx_factor': instance['instance_type']['rxtx_factor'],
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/opt/stack/nova/nova/db/sqlalchemy/models.py", line 74, in __getitem__
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] return getattr(self, key)
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/attributes.py", line 168, in __get__
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] return self.impl.get(instance_state(instance),dict_)
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/attributes.py", line 453, in get
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] value = self.callable_(state, passive)
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] File "/usr/lib/python2.7/dist-packages/sqlalchemy/orm/strategies.py", line 485, in _load_for_state
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] (mapperutil.state_str(state), key)
2012-06-28 01:34:49 TRACE nova.compute.manager [instance: 51720089-48bb-4db7-84c2-a5c2a7a12354] DetachedInstanceError: Parent instance <Instance at 0x47bc8d0> is not bound to a Session; lazy load operation of attribute 'instance_type' cannot proceed

Related branches

Revision history for this message
Vish Ishaya (vishvananda) wrote :

Example script to reproduce issue (on 3 compute host multihost flatdhcp kvm deployment):

#!/bin/bash

for d in $(nova list | cut -f3 -d'|' | grep source); do nova delete ${d}; done
for d in $(nova list | cut -f3 -d'|' | grep rules); do nova delete ${d}; done

while $(nova list | grep source); do
    sleep 5
done

while $(nova list | grep rules); do
    sleep 5
done

nova secgroup-delete sources
nova secgroup-delete rules

nova secgroup-create source source
nova secgroup-add-rule source tcp 22 22 0.0.0.0/0
nova secgroup-add-rule source icmp -1 -1 0.0.0.0/0

nova secgroup-create rules rules

nova secgroup-add-group-rule rules --ip_proto tcp --from_port 22 --to_port 22 source

for d in {1..2}; do nova boot --flavor 2 --image <centos-image-id> --security_groups=source source-${d}; done
for d in {1..6}; do nova boot --flavor 2 --image <centos-image-id> --security_groups=rules rules-${d}; done

Changed in nova:
importance: Undecided → High
Revision history for this message
Vish Ishaya (vishvananda) wrote :

This seems to be some kind of odd interaction with eventlet and sqlalchemy. We have seen it surface before, but something is causing lazy loads to be occasionally lost when there are concurrent db requests going on in greenthreads.

Revision history for this message
Vish Ishaya (vishvananda) wrote :

This patch works around the issue:

diff --git a/nova/network/api.py b/nova/network/api.py
index 96a7a2f..a189a72 100644
--- a/nova/network/api.py
+++ b/nova/network/api.py
@@ -216,9 +216,14 @@ class API(base.Base):

     def get_instance_nw_info(self, context, instance):
         """Returns all network info related to an instance."""
+ try:
+ instance_type = instance['instance_type']
+ except Exception:
+ instance_type = self.db.instance_type_get(
+ context, instance['instance_type_id'])
         args = {'instance_id': instance['id'],
                 'instance_uuid': instance['uuid'],
- 'rxtx_factor': instance['instance_type']['rxtx_factor'],
+ 'rxtx_factor': instance_type['rxtx_factor'],
                 'host': instance['host'],
                 'project_id': instance['project_id']}
         try:

Revision history for this message
Vish Ishaya (vishvananda) wrote :

I really would like to find out what could possibly be causing this though, because I don't see any way for the instance object to be loaded without the join only sometimes.

Changed in nova:
status: New → Triaged
Revision history for this message
Ron Pedde (ron-pedde) wrote :

As a point of interest, this behavior wasn't present in Diablo.

Revision history for this message
Thierry Carrez (ttx) wrote :

Maybe push the workaround in F2 and open a new bug to track the more permanent fix

Changed in nova:
milestone: none → folsom-2
assignee: nobody → Vish Ishaya (vishvananda)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/9297

Changed in nova:
status: Triaged → In Progress
Revision history for this message
Thierry Carrez (ttx) wrote :

Moved to F3 since this is a bit of a corner case, not a regression, and we need F2 out

Changed in nova:
milestone: folsom-2 → folsom-3
Revision history for this message
Mark McLoughlin (markmc) wrote :
Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/essex)

Reviewed: https://review.openstack.org/9642
Committed: http://github.com/openstack/nova/commit/4d74631659b1b6e5b29b763cd808528946cb3721
Submitter: Jenkins
Branch: stable/essex

commit 4d74631659b1b6e5b29b763cd808528946cb3721
Author: Vishvananda Ishaya <email address hidden>
Date: Tue Jul 3 14:50:11 2012 -0700

    Avoid lazy-loading errors on instance_type

    When instances are launched with a source group, the instance_type
    is not lazy_loaded in properly.

    The issue is intermittent because if another query has already loaded
    the instance with a joined instance_type or the instance has not
    yet been expunged from the session, then the load succeeds.

    Therefore this failure only appears under heavy load. Adding an
    explicit joindload fixes the issue.

    fixes bug 1018721

    Change-Id: I450c598c31c416c7be341cbb8cd14e33c6bb2aec
    (cherry picked from commit 2ef0dcfac78468551aac8492269ad856c312dcb5)

Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Dave Walker (davewalker)
Changed in nova (Ubuntu):
status: New → Fix Released
Changed in nova (Ubuntu Precise):
status: New → Confirmed
Revision history for this message
Adam Gandelman (gandelman-a) wrote : Verification report.

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Stable review: https://review.openstack.org/9642

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Test coverage log.

tags: added: verification-done
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (5.4 KiB)

This bug was fixed in the package nova - 2012.1.3+stable-20120827-4d2a4afe-0ubuntu1

---------------
nova (2012.1.3+stable-20120827-4d2a4afe-0ubuntu1) precise-proposed; urgency=low

  * New upstream snapshot, fixes FTBFS in -proposed. (LP: #1041120)
  * Resynchronize with stable/essex (4d2a4afe):
    - [5d63601] Inappropriate exception handling on kvm live/block migration
      (LP: #917615)
    - [ae280ca] Deleted floating ips can cause instance delete to fail
      (LP: #1038266)

nova (2012.1.3+stable-20120824-86fb7362-0ubuntu1) precise-proposed; urgency=low

  * New upstream snapshot. (LP: #1041120)
  * Dropped, superseded by new snapshot:
    - debian/patches/CVE-2012-3447.patch: [d9577ce]
    - debian/patches/CVE-2012-3371.patch: [25f5bd3]
    - debian/patches/CVE-2012-3360+3361.patch: [b0feaff]
  * Resynchronize with stable/essex (86fb7362):
    - [86fb736] Libvirt driver reports incorrect error when volume-detach fails
      (LP: #1029463)
    - [272b98d] nova delete lxc-instance umounts the wrong rootfs (LP: #971621)
    - [09217ab] Block storage connections are NOT restored on system reboot
      (LP: #1036902)
    - [d9577ce] CVE-2012-3361 not fully addressed (LP: #1031311)
    - [e8ef050] pycrypto is unused and the existing code is potentially insecure
      to use (LP: #1033178)
    - [3b4ac31] cannot umount guestfs (LP: #1013689)
    - [f8255f3] qpid_heartbeat setting in ineffective (LP: #1030430)
    - [413c641] Deallocation of fixed IP occurs before security group refresh
      leading to potential security issue in error / race conditions
      (LP: #1021352)
    - [219c5ca] Race condition in network/deallocate_for_instance() leads to
      security issue (LP: #1021340)
    - [f2bc403] cleanup_file_locks does not remove stale sentinel files
      (LP: #1018586)
    - [4c7d671] Deleting Flavor currently in use by instance creates error
      (LP: #994935)
    - [7e88e39] nova testsuite errors on newer versions of python-boto (e.g.
      2.5.2) (LP: #1027984)
    - [80d3026] NoMoreFloatingIps: Zero floating ips available after repeatedly
      creating and destroying instances over time (LP: #1017418)
    - [4d74631] Launching with source groups under load produces lazy load error
      (LP: #1018721)
    - [08e5128] API 'v1.1/{tenant_id}/os-hosts' does not return a list of hosts
      (LP: #1014925)
    - [801b94a] Restarting nova-compute removes ip packet filters (LP: #1027105)
    - [f6d1f55] instance live migration should create virtual_size disk image
      (LP: #977007)
    - [4b89b4f] [nova][volumes] Exceeding volumes, gigabytes and floating_ips
      quotas returns general uninformative HTTP 500 error (LP: #1021373)
    - [6e873bc] [nova][volumes] Exceeding volumes, gigabytes and floating_ips
      quotas returns general uninformative HTTP 500 error (LP: #1021373)
    - [7b215ed] Use default qemu-img cluster size in libvirt connection driver
    - [d3a87a2] Listing flavors with marker set returns 400 (LP: #956096)
    - [cf6a85a] nova-rootwrap hardcodes paths instead of using
      /sbin:/usr/sbin:/usr/bin:/bin (LP: #1013147)
    - [2efc87c] affinity filters don't work if scheduler_hints is None
      (LP: #1007573)
  ...

Read more...

Changed in nova (Ubuntu Precise):
status: Confirmed → Fix Released
Revision history for this message
Clint Byrum (clint-fewbar) wrote : Update Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Thierry Carrez (ttx)
Changed in nova:
milestone: folsom-3 → 2012.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.