NoMoreFixedIps: Zero fixed ips available. Nova seems leaking them.

Bug #1014769 reported by Jaroslav Henner on 2012-06-18
18
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Critical
Vish Ishaya
Essex
Critical
Vish Ishaya
nova (Ubuntu)
Undecided
Unassigned
Precise
Undecided
Unassigned

Bug Description

I ran Tempest tests several times (about 5 full runs) on setup with 128 IPs:

# nova-manage network list
1 192.168.0.0/25 None 192.168.0.2 8.8.4.4 None None None 843689b3-bd29-4a3b-8354-671f028873b3

Instances fails when building network:

==> /net/srh-07/var/log/nova/network.log <==
2012-06-18 19:16:55 DEBUG nova.rpc.amqp [-] received {u'_context_roles': [u'admin', u'KeystoneServiceAdmin', u'KeystoneAdmin'], u'_msg_id': u'0f8308ec31074423b936f7a0183fe6ff', u'_context_read_deleted': u'no', u'_context_request_id': u'req-1151ed40-a179-4f2c-83f6-98e65c4e06be', u'args': {u'instance_uuid': u'f4bca554-94c3-4890-8be0-863689809348', u'vpn': False, u'requested_networks': None, u'instance_id': 238, u'host': u'srh-07.rhev.lab.eng.brq.redhat.com', u'rxtx_factor': 1.0, u'project_id': u'b037486aa9b2455eb8d071d902ecefbe'}, u'_context_auth_token': '<SANITIZED>', u'_context_is_admin': True, u'_context_project_id': u'b037486aa9b2455eb8d071d902ecefbe', u'_context_timestamp': u'2012-06-18T17:16:54.836088', u'_context_user_id': u'f022923dee7b48bc8d50ba64b304a8ee', u'method': u'allocate_for_instance', u'_context_remote_address': u'10.34.65.212'} from (pid=23756) _safe_log /usr/lib/python2.6/site-packages/nova/rpc/common.py:160
2012-06-18 19:16:55 DEBUG nova.rpc.amqp [req-1151ed40-a179-4f2c-83f6-98e65c4e06be f022923dee7b48bc8d50ba64b304a8ee b037486aa9b2455eb8d071d902ecefbe] unpacked context: {'user_id': u'f022923dee7b48bc8d50ba64b304a8ee', 'roles': [u'admin', u'KeystoneServiceAdmin', u'KeystoneAdmin'], 'timestamp': '2012-06-18T17:16:54.836088', 'auth_token': '<SANITIZED>', 'remote_address': u'10.34.65.212', 'is_admin': True, 'request_id': u'req-1151ed40-a179-4f2c-83f6-98e65c4e06be', 'project_id': u'b037486aa9b2455eb8d071d902ecefbe', 'read_deleted': u'no'} from (pid=23756) _safe_log /usr/lib/python2.6/site-packages/nova/rpc/common.py:160
2012-06-18 19:16:55 DEBUG nova.network.manager [req-1151ed40-a179-4f2c-83f6-98e65c4e06be f022923dee7b48bc8d50ba64b304a8ee b037486aa9b2455eb8d071d902ecefbe] floating IP allocation for instance |238| from (pid=23756) allocate_for_instance /usr/lib/python2.6/site-packages/nova/network/manager.py:315
2012-06-18 19:16:55 DEBUG nova.network.manager [req-1151ed40-a179-4f2c-83f6-98e65c4e06be f022923dee7b48bc8d50ba64b304a8ee b037486aa9b2455eb8d071d902ecefbe] network allocations for instance |238| from (pid=23756) allocate_for_instance /usr/lib/python2.6/site-packages/nova/network/manager.py:896
2012-06-18 19:16:55 DEBUG nova.network.manager [req-1151ed40-a179-4f2c-83f6-98e65c4e06be f022923dee7b48bc8d50ba64b304a8ee b037486aa9b2455eb8d071d902ecefbe] networks retrieved for instance |238|: |[<nova.db.sqlalchemy.models.Network object at 0x41c9790>]| from (pid=23756) allocate_for_instance /usr/lib/python2.6/site-packages/nova/network/manager.py:902
2012-06-18 19:16:56 ERROR nova.rpc.amqp [req-1151ed40-a179-4f2c-83f6-98e65c4e06be f022923dee7b48bc8d50ba64b304a8ee b037486aa9b2455eb8d071d902ecefbe] Exception during message handling
2012-06-18 19:16:56 TRACE nova.rpc.amqp Traceback (most recent call last):
2012-06-18 19:16:56 TRACE nova.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/rpc/amqp.py", line 253, in _process_data
2012-06-18 19:16:56 TRACE nova.rpc.amqp rval = node_func(context=ctxt, **node_args)
2012-06-18 19:16:56 TRACE nova.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/network/manager.py", line 257, in wrapped
2012-06-18 19:16:56 TRACE nova.rpc.amqp return func(self, context, *args, **kwargs)
2012-06-18 19:16:56 TRACE nova.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/network/manager.py", line 320, in allocate_for_instance
2012-06-18 19:16:56 TRACE nova.rpc.amqp **kwargs)
2012-06-18 19:16:56 TRACE nova.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/network/manager.py", line 257, in wrapped
2012-06-18 19:16:56 TRACE nova.rpc.amqp return func(self, context, *args, **kwargs)
2012-06-18 19:16:56 TRACE nova.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/network/manager.py", line 906, in allocate_for_instance
2012-06-18 19:16:56 TRACE nova.rpc.amqp requested_networks=requested_networks)
2012-06-18 19:16:56 TRACE nova.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/network/manager.py", line 213, in _allocate_fixed_ips
2012-06-18 19:16:56 TRACE nova.rpc.amqp vpn=vpn, address=address)
2012-06-18 19:16:56 TRACE nova.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/network/manager.py", line 1177, in allocate_fixed_ip
2012-06-18 19:16:56 TRACE nova.rpc.amqp instance_id)
2012-06-18 19:16:56 TRACE nova.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/db/api.py", line 402, in fixed_ip_associate_pool
2012-06-18 19:16:56 TRACE nova.rpc.amqp instance_id, host)
2012-06-18 19:16:56 TRACE nova.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/db/sqlalchemy/api.py", line 102, in wrapper
2012-06-18 19:16:56 TRACE nova.rpc.amqp return f(*args, **kwargs)
2012-06-18 19:16:56 TRACE nova.rpc.amqp File "/usr/lib/python2.6/site-packages/nova/db/sqlalchemy/api.py", line 943, in fixed_ip_associate_pool
2012-06-18 19:16:56 TRACE nova.rpc.amqp raise exception.NoMoreFixedIps()
2012-06-18 19:16:56 TRACE nova.rpc.amqp NoMoreFixedIps: Zero fixed ips available.
2012-06-18 19:16:56 TRACE nova.rpc.amqp

select count(*) from fixed_ips where instance_id in (select uuid from instances where deleted = '1');
32

Related branches

Warik (warik) wrote :

Same problem for me...
It's also happening with the floating ips range!

Warik

Ryan Tidwell (ryan-tidwell) wrote :

I'm seeing that both fixed IP's and floating IP's are being leaked. We are running with --auto_assign_floating_ip=true, so leaking floating IP's blocks provisioning new instances as well.

Ryan Tidwell (ryan-tidwell) wrote :

Looks like what is happening with floating IP's is that a reference to the project (or tenant) gets left on the DB entry for the floating IP. When I null out the project_id column in the floating IP's table, the auto-assignment of floating IP's succeeds. Looks like nulling out this reference fixes the problem. How is this supposed to work in auto-assign mode? Should the reference to a tenant be removed when a floating IP is released in auto-assign mode? Looks to me like it should.

Warik (warik) wrote :

Ryan,

Does update the entry instance_id in the fixed_ips table to NULL will release the ip ? Just like project_ip does for the floating ips?

Jaroslav Henner (jhenner) wrote :

Warik, yes it does.After NULLing the fixed_ips.instance_id, I was able to create machines again.

Ryan Tidwell (ryan-tidwell) wrote :

Jaroslav,

The leaking of fixed IP's seems to be more sporadic than the leaking of floating IP's. It appears that every time I delete an instance with auto_assign_floating_ip enabled, the floating IP is leaked. Not so with fixed IP's, I'm seeing it happen ~10% of the time, so it's a trickier to reproduce.

Vish Ishaya (vishvananda) wrote :

This is odd, as there is specific code to remove the tenant from the floating ip if auto_assign is set in nova/network/manager.py:

 375 # deallocate if auto_assigned
 376 if floating_ip['auto_assigned']:
 377 self.deallocate_floating_ip(read_deleted_context, address,
 378 affect_auto_assigned=True)

cany you check if auto_assigned is set to true in the database?

Warik (warik) wrote :

Vish,

where is it in nova DB ?

Warik (warik) wrote :

I have a trace:

Jun 21 17:51:16 sf1r003s015 2012-06-21 17:51:16 ERROR nova.rpc.amqp [req-5c973f18-87cb-4854-9d6c-8df922c51e7c f75661ae18a14bfdb77cdd07bcc8500c d62a6773fdaa48b1a64eab0e589285d7] Exception during message handling#0122012-06-21 17:51:16 TRACE nova.rpc.amqp Traceback (most recent call last):#0122012-06-21 17:51:16 TRACE nova.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/rpc/amqp.py", line 252, in _process_data#0122012-06-21 17:51:16 TRACE nova.rpc.amqp rval = node_func(context=ctxt, **node_args)#0122012-06-21 17:51:16 TRACE nova.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 258, in wrapped#0122012-06-21 17:51:16 TRACE nova.rpc.amqp return func(self, context, *args, **kwargs)#0122012-06-21 17:51:16 TRACE nova.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 375, in deallocate_for_instance#0122012-06-21 17:51:16 TRACE nova.rpc.amqp affect_auto_assigned=True)#0122012-06-21 17:51:16 TRACE nova.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 258, in wrapped#0122012-06-21 17:51:16 TRACE nova.rpc.amqp return func(self, context, *args, **kwargs)#0122012-06-21 17:51:16 TRACE nova.rpc.amqp File "/usr/lib/python2.7/dist-packages/nova/network/manager.py", line 426, in deallocate_floating_ip#0122012-06-21 17:51:16 TRACE nova.rpc.amqp raise exception.FloatingIpAssociated(address=floating_address)#0122012-06-21 17:51:16 TRACE nova.rpc.amqp FloatingIpAssociated: Floating ip 10.38.12.88 is associated.#0122012-06-21 17:51:16 TRACE nova.rpc.amqp

Ryan Tidwell (ryan-tidwell) wrote :

from network/manager.py:

@wrap_check_policy
    def deallocate_for_instance(self, context, **kwargs):
        """Handles deallocating floating IP resources for an instance.

        calls super class deallocate_for_instance() as well.

        rpc.called by network_api
        """
        instance_id = kwargs.get('instance_id')

        # NOTE(francois.charlier): in some cases the instance might be
        # deleted before the IPs are released, so we need to get deleted
        # instances too
        read_deleted_context = context.elevated(read_deleted='yes')
        LOG.debug(_("floating IP deallocation for instance |%s|"), instance_id,
                                                  context=read_deleted_context)

        try:
            fixed_ips = self.db.fixed_ip_get_by_instance(read_deleted_context,
                                                         instance_id)
        except exception.FixedIpNotFoundForInstance:
            fixed_ips = []
        # add to kwargs so we can pass to super to save a db lookup there
        kwargs['fixed_ips'] = fixed_ips
        for fixed_ip in fixed_ips:
            fixed_id = fixed_ip['id']
            floating_ips = self.db.floating_ip_get_by_fixed_ip_id(context,
                                                                  fixed_id)
            # disassociate floating ips related to fixed_ip
            for floating_ip in floating_ips:
                address = floating_ip['address']
                self.disassociate_floating_ip(read_deleted_context, address,
                                              affect_auto_assigned=True)
                # deallocate if auto_assigned
                if floating_ip['auto_assigned']:
                    self.deallocate_floating_ip(read_deleted_context, address,
                                                affect_auto_assigned=True)

This looks like it is trying to do the right thing, perhaps something is broken in db/sqlalchemy/api.py? I've written some quick and dirty unit tests that should massage this behavior out, but I can't get it to happen in unit tests. I seem to only see it on a live system

Vish Ishaya (vishvananda) wrote :

found the source of the fixed ips not being timed out:

This commit:

627522525e0268a1bc54695ce3dedf16d6fb413d

https://review.openstack.org/#/c/6949/

Fix coming

Changed in nova:
importance: Undecided → Critical
status: New → In Progress
assignee: nobody → Vish Ishaya (vishvananda)
Vish Ishaya (vishvananda) wrote :

bug specific to floating ips here: https://bugs.launchpad.net/nova/+bug/1017418

Thierry Carrez (ttx) on 2012-06-27
Changed in nova:
milestone: none → folsom-2

Reviewed: https://review.openstack.org/9026
Committed: http://github.com/openstack/nova/commit/45e98d415ec661ba7772ed2513f2a24ebe68f1dd
Submitter: Jenkins
Branch: master

commit 45e98d415ec661ba7772ed2513f2a24ebe68f1dd
Author: Vishvananda Ishaya <email address hidden>
Date: Tue Jun 26 20:37:02 2012 +0000

    Stop nova_ipam_lib from changing the timeout setting

     * Fixes incorrect logic for when timeout should be used. We
       want to timeout along with dhcp, and release immediately
       without dhcp
     * Also makes ipam_lib release fixed_ips after floating ips, as
       this is the way that the internal network managers do it.
     * Fixes bug 1014769

    Change-Id: I687fac314264b8a89da128ed77c02d9b1ca140ff

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2012-07-04
Changed in nova:
status: Fix Committed → Fix Released

Reviewed: https://review.openstack.org/9030
Committed: http://github.com/openstack/nova/commit/3cb6e57587b58864ce70e99cef9f8c87c6082882
Submitter: Jenkins
Branch: stable/essex

commit 3cb6e57587b58864ce70e99cef9f8c87c6082882
Author: Vishvananda Ishaya <email address hidden>
Date: Tue Jun 26 20:37:02 2012 +0000

    Stop nova_ipam_lib from changing the timeout setting

     * Fixes incorrect logic for when timeout should be used. We
       want to timeout along with dhcp, and release immediately
       without dhcp
     * Also makes ipam_lib release fixed_ips after floating ips, as
       this is the way that the internal network managers do it.
     * Fixes bug 1014769

    Change-Id: I687fac314264b8a89da128ed77c02d9b1ca140ff
    (cherry picked from commit 45e98d415ec661ba7772ed2513f2a24ebe68f1dd)

Dave Walker (davewalker) on 2012-08-24
Changed in nova (Ubuntu):
status: New → Fix Released
Changed in nova (Ubuntu Precise):
status: New → Confirmed

Please find the attached test log from the Ubuntu Server Team's CI infrastructure. As part of the verification process for this bug, Nova has been deployed and configured across multiple nodes using precise-proposed as an installation source. After successful bring-up and configuration of the cluster, a number of exercises and smoke tests have be invoked to ensure the updated package did not introduce any regressions. A number of test iterations were carried out to catch any possible transient errors.

Please Note the list of installed packages at the top and bottom of the report.

For records of upstream test coverage of this update, please see the Jenkins links in the comments of the relevant upstream code-review(s):

Trunk review: https://review.openstack.org/9026
Stable review: https://review.openstack.org/9030

As per the provisional Micro Release Exception granted to this package by the Technical Board, we hope this contributes toward verification of this update.

Adam Gandelman (gandelman-a) wrote :

Test coverage log.

tags: added: verification-done
Launchpad Janitor (janitor) wrote :
Download full text (5.4 KiB)

This bug was fixed in the package nova - 2012.1.3+stable-20120827-4d2a4afe-0ubuntu1

---------------
nova (2012.1.3+stable-20120827-4d2a4afe-0ubuntu1) precise-proposed; urgency=low

  * New upstream snapshot, fixes FTBFS in -proposed. (LP: #1041120)
  * Resynchronize with stable/essex (4d2a4afe):
    - [5d63601] Inappropriate exception handling on kvm live/block migration
      (LP: #917615)
    - [ae280ca] Deleted floating ips can cause instance delete to fail
      (LP: #1038266)

nova (2012.1.3+stable-20120824-86fb7362-0ubuntu1) precise-proposed; urgency=low

  * New upstream snapshot. (LP: #1041120)
  * Dropped, superseded by new snapshot:
    - debian/patches/CVE-2012-3447.patch: [d9577ce]
    - debian/patches/CVE-2012-3371.patch: [25f5bd3]
    - debian/patches/CVE-2012-3360+3361.patch: [b0feaff]
  * Resynchronize with stable/essex (86fb7362):
    - [86fb736] Libvirt driver reports incorrect error when volume-detach fails
      (LP: #1029463)
    - [272b98d] nova delete lxc-instance umounts the wrong rootfs (LP: #971621)
    - [09217ab] Block storage connections are NOT restored on system reboot
      (LP: #1036902)
    - [d9577ce] CVE-2012-3361 not fully addressed (LP: #1031311)
    - [e8ef050] pycrypto is unused and the existing code is potentially insecure
      to use (LP: #1033178)
    - [3b4ac31] cannot umount guestfs (LP: #1013689)
    - [f8255f3] qpid_heartbeat setting in ineffective (LP: #1030430)
    - [413c641] Deallocation of fixed IP occurs before security group refresh
      leading to potential security issue in error / race conditions
      (LP: #1021352)
    - [219c5ca] Race condition in network/deallocate_for_instance() leads to
      security issue (LP: #1021340)
    - [f2bc403] cleanup_file_locks does not remove stale sentinel files
      (LP: #1018586)
    - [4c7d671] Deleting Flavor currently in use by instance creates error
      (LP: #994935)
    - [7e88e39] nova testsuite errors on newer versions of python-boto (e.g.
      2.5.2) (LP: #1027984)
    - [80d3026] NoMoreFloatingIps: Zero floating ips available after repeatedly
      creating and destroying instances over time (LP: #1017418)
    - [4d74631] Launching with source groups under load produces lazy load error
      (LP: #1018721)
    - [08e5128] API 'v1.1/{tenant_id}/os-hosts' does not return a list of hosts
      (LP: #1014925)
    - [801b94a] Restarting nova-compute removes ip packet filters (LP: #1027105)
    - [f6d1f55] instance live migration should create virtual_size disk image
      (LP: #977007)
    - [4b89b4f] [nova][volumes] Exceeding volumes, gigabytes and floating_ips
      quotas returns general uninformative HTTP 500 error (LP: #1021373)
    - [6e873bc] [nova][volumes] Exceeding volumes, gigabytes and floating_ips
      quotas returns general uninformative HTTP 500 error (LP: #1021373)
    - [7b215ed] Use default qemu-img cluster size in libvirt connection driver
    - [d3a87a2] Listing flavors with marker set returns 400 (LP: #956096)
    - [cf6a85a] nova-rootwrap hardcodes paths instead of using
      /sbin:/usr/sbin:/usr/bin:/bin (LP: #1013147)
    - [2efc87c] affinity filters don't work if scheduler_hints is None
      (LP: #1007573)
  ...

Read more...

Changed in nova (Ubuntu Precise):
status: Confirmed → Fix Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Thierry Carrez (ttx) on 2012-09-27
Changed in nova:
milestone: folsom-2 → 2012.2
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers