Resize instance fails after creating host aggregate

Bug #1444841 reported by Qin Zhao
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Medium
Unassigned

Bug Description

Latest Kilo code

Reproduce steps:

1. Do not define any host aggregate. AZ of host is 'nova'. Boot one instance named 'zhaoqin-nova' whose AZ is 'nova'

2. Create host aggregate 'zhaoqin' whose AZ is 'zhaoqin-az'. Add host to 'zhaoqin' aggregate. Now AZ of instance 'zhaoqin-nova' in db is still 'nova'; and 'nova list' displays AZ of 'zhaoqin-nova' is 'zhaoqin-az'.

3. Resize 'zhaoqin-nova' fails, no valid host.

4. Boot another instance 'zhaoqin-my-az' whose AZ is 'zhaoqin-az'. Resize 'zhaoqin-my-az' succeed.

5. Remove host from aggregate 'zhaoqin'.

6. Resize 'zhaoqin-nova' succeed. Resize 'zhaoqin-my-az' fails, no valid host.

Revision history for this message
Qin Zhao (zhaoqin) wrote :

nova-scheduler reports 'no valid host', because AvailabilityZoneFilter gets AZ from instance property. If instance availability_zone column in db is inconsistent with host AZ, this filter will not select the host.

When we add one host to a host aggregate, should we update instance AZ who running on this host? Or should we make instance AZ obsolete, and make AvailabilityZoneFilter to get AZ from host as 'nova list' does?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/174269

Changed in nova:
assignee: nobody → Qin Zhao (zhaoqin)
status: New → In Progress
Revision history for this message
jichenjc (jichenjc) wrote :

I guess we need make the instance update its az otherwise there might be other potential inconsistent bugs

Revision history for this message
Matt Riedemann (mriedem) wrote :

Is this a duplicate of bug 1431194?

tags: added: resize
Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → Medium
Revision history for this message
Lingxian Kong (kong) wrote :

yes, this bug is really duplicated with bug 1431194, at least the root cause is the same, the workaround i submmited is not so perfect. and according to the discussion with Sylvain, maybe we should work together to solve related issues like this after the Vancouver Summit.

btw, please refer to https://etherpad.openstack.org/p/YVR-nova-scheduler-in-liberty if you're interested.

Revision history for this message
Qin Zhao (zhaoqin) wrote :

@kong, are you in Vancouver now?

Revision history for this message
Lingxian Kong (kong) wrote :

hi qin,

Unfortunately, I'm not in Vancouver. You can refer to Rui Chen if you wanna a discussion about this problem.

BTW, you could join #openstack-chinese if you don't mind :)

Matt Riedemann (mriedem)
tags: added: kilo-backport-potential
tags: added: juno-backport-potential
Revision history for this message
Alexis Lee (alexisl) wrote :

So having talked to bauzas about this a bit, it seems like instance.az should go away. Instead, all code should refer to instance.host.az (accepting that this may not be set immediately) and/or the instance's request_spec.az. Sylvain is still working on getting request spec committed and has a giant tail of patches on that already. He says alaski will provide a means of looking up the request spec for an instance. Once that is available we can get rid of the get_instance_availability_zone method entirely.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Okay, so I feel the bug should be marked as Invalid. Why ? Let me explain :

While any instance can be shown with an AZ, it doesn't mean that the instance.az field is set with that value but rather showing the value of CONF.default_availability_zone if that field is left blank.
How to the instance.az field is set ? That's populated once in the instance lifetime at the Compute API level here:
https://github.com/openstack/nova/blob/79fe4d8e076c9c7bb76f0afb1b2787d51b2c5037/nova/compute/api.py#L1147-L1161

As you can see, it calls _handle_availability_zone which reads what the API received and defaults to CONF.default_schedule_zone :
https://github.com/openstack/nova/blob/79fe4d8e076c9c7bb76f0afb1b2787d51b2c5037/nova/compute/api.py#L596-L597

As CONF.default_schedule_zone is defaulted to None ( https://github.com/openstack/nova/blob/79fe4d8e076c9c7bb76f0afb1b2787d51b2c5037/nova/compute/api.py#L92-L93 ) that means that a default nova boot command (without using the --availability_zone flag) will create an Instance entry in the table with an AZ field equals to NULL.

When it comes to the AZ filter, if the instance.az field is set to None, then the filter always returns True (which makes sense because the user didn't specify an AZ to stick with).

So, now that I explained how it works, lemme explain the error here : by specifying an AZ in the boot command, it will do the exact opposite : it will stick the instance to be created to the AZ provided. Since the bug reporter provided a value (even for the default value of "nova"), it means that then the instance.az field became "nova".

For the original boot, the AZ filter checked if the host was having an aggregate. Since it was not the case, it checked if the instance AZ (here "nova') was equal to CONF.default_availability_zone (defaulted to "nova') https://github.com/openstack/nova/blob/3aff2d7bff7f6e9edb5fa8b688287265722c27fb/nova/scheduler/filters/availability_zone_filter.py#L54 Yay, it worked.

Now, what happened once the host was part of the aggregate ? It didn't change the instance.AZ field since that field doesn't change for the whole lifetime of the instance (kept as an information of what the user requested) but it ends to the AZ Filter which then sees that the host belongs now to an aggregate and consequently matches the host.AZ with the instance.az which was False this time https://github.com/openstack/nova/blob/3aff2d7bff7f6e9edb5fa8b688287265722c27fb/nova/scheduler/filters/availability_zone_filter.py#L51

To be honest, rule of thumb : Never ever calls explicitely an AZ "nova", either when booting an instance or when putting an AZ to an aggregate, that will just prevent the default behaviour to work unless you modify CONF.default_schedule_zone

Changed in nova:
status: In Progress → Invalid
Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm trying to pursue options to prevent a user from getting into this situation in the first place. It seems if we can detect that the user is requesting an instance in the CONF.default_availability_zone explicitly but that's not actually in an aggregate, we could return a 400 (maybe that would have to take into account CONF.default_schedule_zone if it's set).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/223802

Revision history for this message
Matt Riedemann (mriedem) wrote :

By the way, isn't it a bug that the admin can add the host to the AZ when there are instances running on that host which have a different AZ? Since that's going to essentially break those instances from migrating somewhere?

Revision history for this message
Qin Zhao (zhaoqin) wrote :

If I do not specify 'nova' as instance's AZ name... Assuming that I create two aggregates, whose AZ name is "abc" and "xyz". Host A is in AZ abc. If I boot one instance 'zhaoqin' with "--availability_zone abc", its az property will be abc. And then, I move the host A to AZ xyz, and attempt to resize instance 'zhaoqin', I think this resize will still fail.

I think this should be a valid bug.

Revision history for this message
lee jian (leejian0612) wrote :

In my comprehension, the available zone for vms should be the same with the host(compute node) or none, which means the vm can be scheduled to any host, even though the host belong to an AZ. And from the bug, we know the AZ for the host may changed from the default(nova) to aggrate's AZ, when adding host to the aggregate; on the opposite, when removing the host from the aggregate, the AZ for the host will return to the default one. This is where the bug comes, when the AZ for the host changed, the AZ for the VMs on this host is still the old the one, for example, when added to a aggregate, AZ for the vms on this host will still be the default(none is reasonable), and when removed from the aggregate, the AZ for vms on this host may stay the aggregate'a AZ. That will lead to AZ inconsistence between vms and the host, and will casue potential issue like this one. So we should add some sync mechanism to make the vm's az is the same with the host, when using aggregate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/226683

Changed in nova:
assignee: Qin Zhao (zhaoqin) → lee jian (leejian0612)
status: Invalid → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/223802
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=74abd14baf0f664a94d88272d43283ac7e6bbc1f
Submitter: Jenkins
Branch: master

commit 74abd14baf0f664a94d88272d43283ac7e6bbc1f
Author: Sylvain Bauza <email address hidden>
Date: Tue Sep 15 22:33:24 2015 +0200

    Add some devref for AZs

    Since the AZ knowledge is mostly tribal and can have some corner cases, we could
    help the operators by giving more visibility on how it's made and what to prevent.

    The related ticket mentioned below is one example of a common mistake that is
    quite not easily fixable from the Nova standpoint since the design is mostly broken.

    Change-Id: I092c8caa9e450a68a7a952940b0bb288b8fe6fb0
    Related-Bug: #1444841

Alan Pevec (apevec)
tags: removed: juno-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/251788

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/174269
Reason: No updates since July, considering this abandoned.

Changed in nova:
assignee: lee jian (leejian0612) → Qin Zhao (zhaoqin)
Changed in nova:
assignee: Qin Zhao (zhaoqin) → nobody
status: In Progress → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/251788
Reason: This patch has been idle for a long time, so I am abandoning it to keep the review clean sane. If you're interested in still working on this patch, then please unabandon it and upload a new patchset.

Revision history for this message
Charlotte Han (hanrong) wrote :

@Sylvain Bauza (sylvain-bauza)
@Matt Riedemann (mriedem)
@Qin Zhao (zhaoqin)
Availability_zone field in instance table would be set value by boot instance with --availability_zone flag,showing which physical resource could be used by this instance.

When instance'host reset a new host aggregate with a different availability_zone, admin role should do some maintenance work for influential instances.

1. admin modify influential instances' availability_zone value to the new az name.
2. admin do nothing, instances' host is not consistent with instances' instances' availability_zone. That's OK, because this is mean these instances should be in their original physical az. When any migration occur, nova will select new host from it's original physical az, and if no valid host, please Contact the administrator.

But a question, does nova have a api to update instance's availability_zone filed?

Revision history for this message
Qin Zhao (zhaoqin) wrote :

@hanrong, I did not look at this bug for a long while... I see in api doc that "Update server" api can have AZ in request, but I did not ever test it.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/226683
Reason: This code hasn't been updated in a long time, and is in merge conflict. I am going to abandon this review, but feel free to restore it if you're still working on this.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Marking the bug as invalid as it was explained in c#12 and also as the doc describes the problem which is a configuration issue https://docs.openstack.org/developer/nova/aggregates.html#availability-zones-azs

Changed in nova:
status: Confirmed → Invalid
Revision history for this message
Diana Clarke (diana-clarke) wrote :

I spent some time on this today (downstream), so I figured I should probably add my notes here.

I was indeed able to reproduce the issue described in this bug, but only if I used the Horizon dashboard to create instances before creating my first real availability zone. My manual testing notes can be found here on GitHub (using devstack, stable/ocata).

    https://github.com/dianaclarke/openstack-notes/wiki/host%20aggregates

A quick summary:

    0. On a fresh install without instances or availability zones
    1. Create an instance via the Horizon dashboard (instance-1)
    2. Create an availability zone and host aggregate for the same host
    3. Attempt to resize instance-1 (and note the following error)

    ERROR (BadRequest): No valid host was found. No valid host found for resize (HTTP 400)

The trouble appears to be that the Horizon dashboard doesn't allow you to not specify an availability zone when you create an instance, and if you haven't yet created any "real" availability zones, it will send the name of the default availability zone ("nova").

This is a documented, known, no-no: "it is highly recommended for users to never ever ask for booting an instance by specifying an explicit AZ named 'nova'"

    https://docs.openstack.org/developer/nova/aggregates.html#availability-zones-azs

If you use the command line interface to create instances (nova boot) and don't specify an availability zone, you will get a NULL in the database for 'availability_zone' (this is correct). If you use Horizon, you will get the string "nova" which will result in resize issues once you create other "real" availability zones.

Perhaps reassign this to Horizon?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.