nova will try to create unlimited instances concurrently and timeout when resources are depleted

Bug #1418155 reported by Joe Talerico
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Dan Smith

Bug Description

running with --num-instances=16 I saw a couple of instances go into ERROR State, on the hypervisor side, i saw the following issue:

2015-02-04 09:03:02.840 5077 ERROR nova.compute.manager [-] [instance: e277cf66-167f-4e81-a141-8dec12290015] Instance failed to spawn
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] Traceback (most recent call last):
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2243, in _build_resources
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] yield resources
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2113, in _build_and_run_instance
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] block_device_info=block_device_info)
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2622, in spawn
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] block_device_info, disk_info=disk_info)
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4508, in _create_domain_and_network
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] power_on=power_on)
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4432, in _create_domain
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] LOG.error(err)
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] File "/usr/lib/python2.7/site-packages/nova/openstack/common/excutils.py", line 82, in __exit__
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] six.reraise(self.type_, self.value, self.tb)
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4423, in _create_domain
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] domain.createWithFlags(launch_flags)
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 183, in doit
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] result = proxy_call(self._autowrap, f, *args, **kwargs)
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 141, in proxy_call
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] rv = execute(f, *args, **kwargs)
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 122, in execute
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] six.reraise(c, e, tb)
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 80, in tworker
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] rv = meth(*args, **kwargs)
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] File "/usr/lib64/python2.7/site-packages/libvirt.py", line 993, in createWithFlags
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015] libvirtError: error from service: CreateMachine: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
2015-02-04 09:03:02.840 5077 TRACE nova.compute.manager [instance: e277cf66-167f-4e81-a141-8dec12290015]
2015-02-04 09:03:02.843 5077 AUDIT nova.compute.manager [req-663bcedd-8f56-4a84-81b1-4e7321a5f30e None] [instance: e277cf66-167f-4e81-a141-8dec12290015] Terminating instance

Changed in nova:
assignee: nobody → Dan Smith (danms)
status: New → In Progress
Dan Smith (danms)
Changed in nova:
importance: Undecided → Medium
milestone: none → kilo-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/153004
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5a542e770648469b0fbb638f6ba53f95424252ec
Submitter: Jenkins
Branch: master

commit 5a542e770648469b0fbb638f6ba53f95424252ec
Author: Dan Smith <email address hidden>
Date: Wed Feb 4 10:10:25 2015 -0800

    Add max_concurrent_builds limit configuration

    Right now, nova-compute will attempt to build an infinite number of
    instances, if asked to do so. This won't work on any machine, regardless
    of the resources, if the number of instances is too large.

    We could default this to zero to retain the current behavior, but
    the current behavior is really not sane in any case, so I think we
    should default to something. Ten instances for a single compute node
    seems like as reasonable default. If you can do more than ten at a
    time, you're definitely not running a cloud based on default config.

    DocImpact: Adds a new configuration variable

    Closes-Bug: #1418155

    Change-Id: I412d2849fd16430e6926fc983c031babb7ad04d0

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Revision history for this message
Solganik Alexander (solganik) wrote :

I had experienced similar issue running centos7 as a hypervisor. I had found out that the real cause of the issue a bug in systemd-machined DBUS message handling. Bug is filed here: https://bugs.centos.org/view.php?id=8564.

Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-3 → 2015.1.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/219637

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/juno)

Change abandoned by Radomir Dopieralski (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/219637
Reason: Sorry everyone, this wasn't supposed to go to this branch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.