OpenStack Compute (nova)

Server booted to a server group stucks in scheduling state if the policy filter is not configured

Bug #1408326 reported by Balazs Gibizer on 2015-01-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Low	Ildiko Vancsa	OpenStack Compute (nova) 2015.1.0 "kilo"

Bug Description

To reproduce the problem:

Remove ServerGroupAntiAffinity from scheduler_default_filters in nova.conf

$ nova server-group-create my-anti-affinity anti-affinity
$ nova boot --flavor 42 --image 1b0ef685-2a7a-4c11-8e2d-a549215c1b3a --hint group=<uuid of the group created above> vm23

The following exception is visible in the conductor log:
2015-01-07 15:33:44.988 ERROR nova.scheduler.utils [req-675e5f28-59c4-4406-952c-182202aae805 admin admin] ServerGroupAntiAffinityFilter not configured
2015-01-07 15:33:44.988 ERROR oslo.messaging.rpc.dispatcher [req-675e5f28-59c4-4406-952c-182202aae805 admin admin] Exception during message handling: No valid host was found. ServerGroupAntiAffinityFilter not configured
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher Traceback (most recent call last):
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 137, in _dispatch_and_reply
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher incoming.message))
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 180, in _dispatch
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher return self._do_dispatch(endpoint, method, ctxt, args)
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher File "/usr/local/lib/python2.7/dist-packages/oslo/messaging/rpc/dispatcher.py", line 126, in _do_dispatch
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher result = getattr(endpoint, method)(ctxt, **new_args)
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/conductor/manager.py", line 626, in build_instances
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher filter_properties)
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/scheduler/utils.py", line 311, in setup_instance_group
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher group_info = _get_group_details(context, instance_uuid, group_hosts)
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher File "/opt/stack/nova/nova/scheduler/utils.py", line 291, in _get_group_details
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher raise exception.NoValidHost(reason=msg)
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher NoValidHost: No valid host was found. ServerGroupAntiAffinityFilter not configured
2015-01-07 15:33:44.988 TRACE oslo.messaging.rpc.dispatcher

Tags:

Balazs Gibizer (balazs-gibizer) on 2015-01-07

Changed in nova:
assignee:	nobody → Balazs Gibizer (balazs-gibizer)

Revision history for this message

Pasquale Porreca (pasquale-porreca) wrote on 2015-01-07:

Can you share your nova.conf (at least the part relative to Filter Scheduler)?

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2015-01-08:

The only thing I changed from the default devstack nova.conf is that I removed the ServerGroupAntiAffinityFilter form the scheduler_default_filters to trigger the code path that rejects the scheduling of a server to a group with anti-affinity policy as the
filter is not available.

So basically I added the following line to the default devstack nova.conf:
scheduler_default_filters = RetryFilter, AvailabilityZoneFilter, RamFilter, ComputeFilter, ComputeCapabilitiesFilter, ImagePropertiesFilter, ServerGroupAffinityFilter

I checked the conductor code and the not handled exception is coming from here https://github.com/openstack/nova/blob/master/nova/conductor/manager.py#L625 and the solution is straightforward. Basically we have to move the scheduler_utils.setup_instance_group() call to the try - except block below.

I'm working on a patch that will fi the problem.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-08: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/145761

Changed in nova:
status:	New → In Progress

Claudiu Belu (cbelu) on 2015-01-08

tags:

added: scheduler

Revision history for this message

Claudiu Belu (cbelu) wrote on 2015-01-08:

Is this bug also present in Juno? If so, can you add the appropiate juno-backport-potential? Thanks.

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2015-01-08:

I will test it on Juno to see if it needs to be backported.

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2015-01-09:

This bug is not present in Juno (or Icehouse) as the problem is introduced in this change https://review.openstack.org/#/c/128058/ and that change merged in Kilo-1 only. Therefore no backporting effort is needed.

OpenStack Infra (hudson-openstack) on 2015-01-20

Changed in nova:
assignee:	Balazs Gibizer (balazs-gibizer) → Ildiko Vancsa (ildiko-vancsa)

Revision history for this message

Jay Pipes (jaypipes) wrote on 2015-01-25:

IMO, the bug is that setup_instance_group() is raising NoValidHost when the anti-affinity filter is not enabled. The exception raised should be more descriptive of the problem... for example something like FilterConfigurationError or UnsupportedPolicyException.

Changed in nova:
importance:	Undecided → Low

Revision history for this message

Jay Pipes (jaypipes) wrote on 2015-01-25:

Setting to Low because this is partly a configuration issue, and not something that is common to run into.

OpenStack Infra (hudson-openstack) on 2015-01-26

Changed in nova:
assignee:	Ildiko Vancsa (ildiko-vancsa) → Balazs Gibizer (balazs-gibizer)

OpenStack Infra (hudson-openstack) on 2015-01-27

Changed in nova:
assignee:	Balazs Gibizer (balazs-gibizer) → Ildiko Vancsa (ildiko-vancsa)

OpenStack Infra (hudson-openstack) on 2015-01-27

Changed in nova:
assignee:	Ildiko Vancsa (ildiko-vancsa) → Balazs Gibizer (balazs-gibizer)

OpenStack Infra (hudson-openstack) on 2015-01-30

Changed in nova:
assignee:	Balazs Gibizer (balazs-gibizer) → Ildiko Vancsa (ildiko-vancsa)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-02: Fix merged to nova (master)

Reviewed: https://review.openstack.org/145761
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=28aedc604d993be4b1b38822fd08500df94b6ce3
Submitter: Jenkins
Branch: master

commit 28aedc604d993be4b1b38822fd08500df94b6ce3
Author: Balazs Gibizer <email address hidden>
Date: Wed Jan 7 16:01:38 2015 +0100

Fix leaking exceptions from scheduler utils

    scheduler_utils.setup_instance_group() can raise NoValidHost
    exception and the conductor does not handle that properly which
    causes the server to get stuck in a scheduling state instead of
    going to ERROR.

    The function is called several times in conductor/manager.py. This
    patch moves the function calls in the try blocks already present.
    It handles the missing Affinity filters as UnsupportedPolicyException,
    which is a new exception type added here. This way the exception
    raised by scheduler_utils.setup_instance_group() will cause the
    VM to end up in ERROR state instead of NOSTATE, as it is expected.
    Quota rollback is also handled properly, when it is needed.

Closes-bug: #1408326
Change-Id: I3a8ba37ded8ac3fa9b5f3f35b09feb82d822d583

Changed in nova:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2015-02-05

Changed in nova:
milestone:	none → kilo-2
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-04-30

Changed in nova:
milestone:	kilo-2 → 2015.1.0

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.