Test job fails with FixedIpLimitExceeded with nova network

Bug #1353962 reported by Attila Fazekas
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned
tempest
Invalid
Undecided
Unassigned
description: updated
Tracy Jones (tjones-i)
tags: added: testing
Revision history for this message
Joe Gordon (jogo) wrote :

Is it possible tempest is leaking servers?

Revision history for this message
Joe Gordon (jogo) wrote :

https://review.openstack.org/#/c/112581/ should help sort this one out

Changed in nova:
status: New → Incomplete
Revision history for this message
Joe Gordon (jogo) wrote :

marking as incomplete in nova, since unable to confirm this is a nova bug

Revision history for this message
Matt Riedemann (mriedem) wrote :

This looks like a duplicate of bug 1338844.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Nevermind, 1338844 was fixed in tempest (quota issue), so this is different. There are 26 hits in 10 days.

Revision history for this message
Matthew Treinish (treinish) wrote :

Even if tempest leaked every server it created (which is definitely not happening) if the network allocates a /24 there should still be plenty of ips available. From the logs above tempest only allocated a total of 171 servers:

http://logs.openstack.org/23/112523/1/check/check-tempest-dsvm-postgres-full/acac6d9/logs/qemu/

But the exception is a quota one, the default tenant quota for a fixed ips is definitely more than 2. (otherwise we'd have much larger issues) I'm pretty sure we set it to -1, but I'm not sure. Looking at the test class in question:

http://git.openstack.org/cgit/openstack/tempest/tree/tempest/api/compute/admin/test_servers_negative.py

The the tenant for that class will only have 2 server create calls. One in setUpClass and one in this test. I have a hard time believing 2 fixed ips exceeded the quota limit. Looking at the tests running around the same time, starting at:

http://logs.openstack.org/23/112523/1/check/check-tempest-dsvm-postgres-full/acac6d9/console.html.gz#_2014-08-07_09_42_03_634

which was the last test from a previous class on the same thread, until the same thing after the failed test:

http://logs.openstack.org/23/112523/1/check/check-tempest-dsvm-postgres-full/acac6d9/console.html.gz#_2014-08-07_09_42_27_778

I don't see anything that should have accidentally changed the quota to be < 2. I guess there could be a more hidden cross interaction because there are a bunch of admin api tests running around the same time (the joys of a default alphabetical test order), but I don't see one yet.

Revision history for this message
David Kranz (david-kranz) wrote :

This happened 26 times between 9/10 and 9/13, and zero times between 9/13 and 9/16. Perhaps it is fixed but we should keep watching.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Just hit it again:

http://logs.openstack.org/98/125398/11/check/check-tempest-dsvm-postgres-full/f0d9495/logs/screen-n-net.txt.gz?level=TRACE

There are 17 hits in the last 7 days, check and gate, all failures.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (4.6 KiB)

Taking a look at the code in this stacktrace:

2014-10-06 08:00:23.848 DEBUG nova.network.manager [req-bf60b9d9-452c-4fbc-9f7a-cc90ed762536 ServersAdminTestJSON-1220859943 ServersAdminTestJSON-233825707] [instance: 50156a3a-8b14-40ea-affe-97e5a510ec32] Allocate fixed ip on network cbcee1f9-13d7-4ec6-a450-d777a3d11120 allocate_fixed_ip /opt/stack/new/nova/nova/network/manager.py:859
2014-10-06 08:00:23.874 DEBUG nova.network.manager [req-bf60b9d9-452c-4fbc-9f7a-cc90ed762536 ServersAdminTestJSON-1220859943 ServersAdminTestJSON-233825707] Quota exceeded for 3292f7b35abf4565a53d099ad878a335, tried to allocate fixed IP allocate_fixed_ip /opt/stack/new/nova/nova/network/manager.py:875
2014-10-06 08:00:23.874 ERROR oslo.messaging.rpc.dispatcher [req-bf60b9d9-452c-4fbc-9f7a-cc90ed762536 ServersAdminTestJSON-1220859943 ServersAdminTestJSON-233825707] Exception during message handling: Maximum number of fixed ips exceeded
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 134, in _dispatch_and_reply
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher incoming.message))
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 177, in _dispatch
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 123, in _do_dispatch
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/network/floating_ips.py", line 114, in allocate_for_instance
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher **kwargs)
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/network/manager.py", line 511, in allocate_for_instance
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher requested_networks=requested_networks)
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/network/manager.py", line 192, in _allocate_fixed_ips
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher vpn=vpn, address=address)
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/new/nova/nova/network/manager.py", line 876, in allocate_fixed_ip
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher raise exception.FixedIpLimitExceeded()
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher FixedIpLimitExceeded: Maximum number of fixed ips exceeded
2014-10-06 08:00:23.874 25609 TRACE oslo.messaging.rpc.dispatcher

We're using the FlatDHCPManager:

http://logs.openstack.org/98/125398/11/check/check-tempest-dsvm-postgr...

Read more...

Matt Riedemann (mriedem)
tags: added: gate-failure
Revision history for this message
Matt Riedemann (mriedem) wrote :

Looking at tests that extend BaseV2ComputeAdminTest because tempest.api.compute.admin.test_servers creates 2 test instances for the JSON run alone (and there is an xml version also). Those would be on the same admin tenant, so if those are running concurrently we could have at least 4 instances running at the same time under that tenant.

We have 4 test workers, so we could have 2 other tests running concurrently.

There are 36 test classes that extend BaseV2ComputeAdminTest, not all of them create test servers though.

Looking at the console log when our test fails:

http://logs.openstack.org/98/125398/11/check/check-tempest-dsvm-postgres-full/f0d9495/console.html#_2014-10-06_08_00_28_464

And looking at what's running before it, we see:

tempest.api.compute.admin.test_quotas_negative.QuotasAdminNegativeTestXML

and

tempest.api.compute.admin.test_quotas.QuotaClassesAdminTestJSON

We've had race issues with QuotaClassesAdminTestJSON in the past where it would set the default quota class values to 0 so we'd hit over-quota. That was changed awhile back to take the default quotas and add 100 to the limit, so that shouldn't be causing issues.

QuotasAdminNegativeTestXML is running with force_tenant_isolation=False

So the quota admin tests should be running in tenant isolation and not impact ServersAdminTestJSON.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Chatted with Matt Treinish and the json and xml tests should have separate tenants because in the gate we running with allow_tenant_isolation=True and the credentials are created in setUpClass of each test, which is different for the json and xml runs.

So I guess I'm back to wondering what's going on with the fixed_ips over-quota if it's a per-project resource quota and the test is only creating at most 4 test servers.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/126401

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/126402

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/126401
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=044b18e7f02c588738bc21c43ad8066c9fcd7426
Submitter: Jenkins
Branch: master

commit 044b18e7f02c588738bc21c43ad8066c9fcd7426
Author: Matt Riedemann <email address hidden>
Date: Mon Oct 6 08:04:01 2014 -0700

    nova-net: add more useful logging before raising FixedIpLimitExceeded

    The OverQuota exception that we get back from the DB API contains some
    useful information about the quota limits for the project and resource
    and the current usage, so we can log that before raising
    FixedIpLimitExceeded. Also logs the actual project ID passed to
    quotas.reserve rather than context.project_id since it's possible those
    are different.

    This should help us figure out if we're leaking resources in gate runs.

    Related-Bug: #1353962

    Change-Id: Ia8a6d2336ddbbe28178be8d3ce4d97e6d4ab3787

Revision history for this message
Joe Gordon (jogo) wrote : Re: Test job failes with FixedIpLimitExceeded with nova network

after further investigation with Matt Riedemann, it looks like there are some issues in the quotas code, although its still not clear if nova is that is the root cause of this bug or not.

Changed in nova:
status: Incomplete → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/126628

Revision history for this message
Matt Riedemann (mriedem) wrote : Re: Test job failes with FixedIpLimitExceeded with nova network
Revision history for this message
Matt Riedemann (mriedem) wrote :

Bug 1284930 might be related, i.e. if we're not rolling back quota reservations on a fixed IP deallocate failure we could be leaking quotas that way.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/126402
Reason: I was confusing this for nova.quota.DbQuotaDriver.rollback but it's actually nova.objects.quotas.Quotas.rollback.

You were all right for not believing me!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/126628
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ff2748990389deee62bed97456bf7a81cc43a47c
Submitter: Jenkins
Branch: master

commit ff2748990389deee62bed97456bf7a81cc43a47c
Author: Matt Riedemann <email address hidden>
Date: Wed Oct 8 08:28:32 2014 -0700

    Log quota refresh in_use message at INFO level for logstash

    Commit 5aef26496786c02a1ed324bbd2eb0b9994a2018d added the log message to
    give some idea when we're unnecessarily refreshing quota usage.

    Logstash only collects INFO+ level messages, so bump this to an INFO
    level message so we can track gate runs to see if this is actually a
    problem.

    Related-Bug: #1353962

    Change-Id: I60ffa9dbda22e252f290bf316fa2e472d3e92f2f

Revision history for this message
jichenjc (jichenjc) wrote : Re: Test job failes with FixedIpLimitExceeded with nova network

I hit it again here today

http://logs.openstack.org/60/111660/13/check/check-tempest-dsvm-full/d0f4d9b/logs/screen-n-net.txt.gz

really confusing me is following log

WARNING nova.network.manager [req-49960ae4-3327-4770-8ff1-e1002786ee79 ServersAdminNegativeTestXML-863236903 ServersAdminNegativeTestXML-1636622346] [instance: 7f330ee3-90fb-463f-a325-f4a8bd7c3a02] Quota exceeded for project 5e8c727ff61142d9a7ad424cdb8ca8b5, tried to allocate fixed IP. 0 of 99 are in use or are already reserved.

from code analysis, it should not report an exception because the used count is 0
I checked quota related code but I can't see any useful log to indicate why quota failed to allocate

so ,I will ad more logs to _raise_overquota_exception and see whether we can know what happened

Revision history for this message
jichenjc (jichenjc) wrote :

@matt, can you guide me how to see logs in gate ? I am not familiar with it .... thanks

Revision history for this message
jichenjc (jichenjc) wrote :

BTW: I didn't see logs changed at https://review.openstack.org/126628
so maybe it can be another cause of the problem?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/131737

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/131737
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5c9537b40092e701615e33e0dcf90a082d172f72
Submitter: Jenkins
Branch: master

commit 5c9537b40092e701615e33e0dcf90a082d172f72
Author: jichenjc <email address hidden>
Date: Tue Oct 28 04:55:50 2014 +0800

    Add debug log when over quota exception occurs

    There are some weird logs showed in bug 1353962, with
    current logs we can't decide which part is wrong.
    Add more logs will help operater to know the real reason
    that make the quota exceed.

    If this change makes too much log out, after bug 1353962
    is fixed, we can talk about remove those logs, but currently
    it will be helpful.

    Change-Id: I2ab1fb62efc06b1cee213920539980e331870bf5
    Related-Bug: #1353962

Revision history for this message
Matthew Treinish (treinish) wrote : Re: Test job failes with FixedIpLimitExceeded with nova network

I'm not seeing anything tempest related in this bug. It looks like nova is just falling down and enforcing a quota even when the limits aren't even close.

Changed in tempest:
status: New → Incomplete
Changed in nova:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/167814

Revision history for this message
Matt Riedemann (mriedem) wrote : Re: Test job failes with FixedIpLimitExceeded with nova network

Here we go, we have better logging in the DB API when this fails now:

http://logs.openstack.org/01/167101/5/check/check-tempest-dsvm-full/9f94b8c/logs/screen-n-cond.txt.gz#_2015-03-25_19_19_37_678

Raise OverQuota exception because: project_quotas: {'fixed_ips': -1}, user_quotas: {'fixed_ips': 99L}, deltas: {u'fixed_ips': 1}, overs: [u'fixed_ips'], project_usages: {u'ram': {'total': 128L, 'reserved': 0L, 'in_use': 128L}, u'floating_ips': {'total': 1L, 'reserved': 0L, 'in_use': 1L}, u'instances': {'total': 2L, 'reserved': 0L, 'in_use': 2L}, u'cores': {'total': 2L, 'reserved': 0L, 'in_use': 2L}, u'fixed_ips': {'total': 1L, 'reserved': 0L, 'in_use': 1L}, u'security_groups': {'total': 1L, 'reserved': 0L, 'in_use': 1L}}, user_usages: {u'ram': <nova.db.sqlalchemy.models.QuotaUsage object at 0x7faaccedba50>, u'floating_ips': <nova.db.sqlalchemy.models.QuotaUsage object at 0x7faaccedb5d0>, u'instances': <nova.db.sqlalchemy.models.QuotaUsage object at 0x7faaccedbc90>, u'cores': <nova.db.sqlalchemy.models.QuotaUsage object at 0x7faaccedb750>, u'fixed_ips': <nova.db.sqlalchemy.models.QuotaUsage object at 0x7faaccedb510>, u'security_groups': <nova.db.sqlalchemy.models.QuotaUsage object at 0x7faaccedbf90>}

So project_quotas and user_quotas are different:

project_quotas: {'fixed_ips': -1}, user_quotas: {'fixed_ips': 99L}

The user_quotas are what's passed back to the network manager which is why the log message is confusing:

http://git.openstack.org/cgit/openstack/nova/tree/nova/db/sqlalchemy/api.py?id=2015.1.0b3#n3578

So we see this in the n-net logs:

http://logs.openstack.org/01/167101/5/check/check-tempest-dsvm-full/9f94b8c/logs/screen-n-net.txt.gz?level=TRACE#_2015-03-25_19_19_37_685

2015-03-25 19:19:37.685 WARNING nova.network.manager [req-0f01e070-e2ac-4058-a7cb-9d619bb557d6 FloatingIPsTestJSON-1260806534 FloatingIPsTestJSON-109816172] [instance: c90d1a08-4be7-4f88-8a73-36424d70acd6] Quota exceeded for project e3bba3eb15c844cda054b8c73ec68c0b, tried to allocate fixed IP. 1 of 99 are in use or are already reserved.

So from project_usages, we have this:

u'fixed_ips': {'total': 1L, 'reserved': 0L, 'in_use': 1L}

Which is why we go over-quota, because it thinks we have 1 total fixed_ip to use and it's already in use.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This might also be useful, we are being forced to refresh quotas in the gate but only for fixed_ips:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwicXVvdGFfdXNhZ2VzIG91dCBvZiBzeW5jLCB1cGRhdGluZ1wiIEFORCB0YWdzOlwic2NyZWVuLW4tY29uZC50eHRcIiIsImZpZWxkcyI6W10sIm9mZnNldCI6MCwidGltZWZyYW1lIjoiNjA0ODAwIiwiZ3JhcGhtb2RlIjoiY291bnQiLCJ0aW1lIjp7InVzZXJfaW50ZXJ2YWwiOjB9LCJzdGFtcCI6MTQyNzM4ODk2NDUyOCwibW9kZSI6IiIsImFuYWx5emVfZmllbGQiOiIifQ==

e.g.

[req-962680b8-2147-462e-a3bc-13d942b8989f MultipleCreateTestJSON-1247330694 MultipleCreateTestJSON-1141436516] quota_usages out of sync, updating. project_id: 5b4715def338466c9a6eea5886fb8f38, user_id: None, resource: fixed_ips, tracked usage: 0, actual usage: 1

[req-75c73b51-9f2a-4275-8ab7-1e8c270a3848 MultipleCreateTestJSON-1809221276 MultipleCreateTestJSON-267241457] quota_usages out of sync, updating. project_id: 647717c69bba4377b9421b2c832f4991, user_id: None, resource: fixed_ips, tracked usage: -1, actual usage: 1

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/167814
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fe5003bc1400e4e467699ef857071717e75146e7
Submitter: Jenkins
Branch: master

commit fe5003bc1400e4e467699ef857071717e75146e7
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 25 14:58:04 2015 -0700

    Add some logging in the quota.reserve flow

    Try to figure out what's going on with a race in the gate
    when we go over-quota trying to allocate a fixed IP.

    Change-Id: Ib23450826a5b4a995360d1958e6070b23a3e732c
    Related-Bug: #1353962

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/168158

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/168158
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=af4ce3ee63e4e4af9113405396c3201eb3efe3c6
Submitter: Jenkins
Branch: master

commit af4ce3ee63e4e4af9113405396c3201eb3efe3c6
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 26 13:46:40 2015 -0700

    Add debug logging to quota_reserve flow

    There are some places in the quota_reserve flow that could use some
    debug logging for helping to figure out the problem when we over quota:

    1. Log when we detect a desync in _is_quota_refresh_needed and force a
       refresh.
    2. There is a comment in quota_reserve after we call
       _refresh_quota_usages saying we assume the sync was OK and
       essentially everything is hunky dory. Log if we actually refreshed
       without a change to usage.
    3. Log the details from project/user quotas/usages when we go over
       quota in _calculate_overquota.
    4. Log if we update project_usages using values from user_usages in
       quota_reserve.
    5. Finally, before we raise the OverQuota exception in quota_reserve,
       the user_usages dict values are UserQuota objects which are sqla
       dict-like objects and don't stringify nicely for the logs, so
       convert those to actual dict values for logging.

    Related-Bug: #1353962

    Change-Id: Ic6e3d5fdf51c0a13c97509d0888a3bba00fb3193

Matt Riedemann (mriedem)
summary: - Test job failes with FixedIpLimitExceeded with nova network
+ Test job fails with FixedIpLimitExceeded with nova network
Revision history for this message
Yaroslav Lobankov (ylobankov) wrote :

It looks like this bug has nothing to do with Tempest. So moving the status of the bug to "Invalid".

Changed in tempest:
status: Incomplete → Invalid
tags: added: quotas
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote : Cleanup EOL bug report

This is an automated cleanup. This bug report has been closed because it
is older than 18 months and there is no open code change to fix this.
After this time it is unlikely that the circumstances which lead to
the observed issue can be reproduced.

If you can reproduce the bug, please:
* reopen the bug report (set to status "New")
* AND add the detailed steps to reproduce the issue (if applicable)
* AND leave a comment "CONFIRMED FOR: <RELEASE_NAME>"
  Only still supported release names are valid (LIBERTY, MITAKA, OCATA, NEWTON).
  Valid example: CONFIRMED FOR: LIBERTY

Changed in nova:
importance: Medium → Undecided
status: Confirmed → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.