cross_az_attach=False doesn't honor BDMs with source=image and dest=volume

Bug #2018318 reported by Rafael Weingartner
26
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Wishlist
Unassigned

Bug Description

The config flag cross_az_attach allows an instance to be pinned to the related volume AZ if the value of that config option is set to False.

We fixed the case of a volume-backed instance by https://review.opendev.org/c/openstack/nova/+/469675/ if the volume was created before the instance but we haven't yet resolved the case of an BFV-instance created from an image (the BDM shortcut that allows a late creation of a volume by the compute service).

Since the volume is created based on the current instance.AZ, it does respect the current AZ, but since the instance isn't pinned to that AZ, it can move from an AZ to another while the volume will continue to exist in the original AZ.

As a consequence, the problem is only seen after a move operation but the problem exists since the instance creation.

=== ORIGINAL BUG REPORT BELOW ===

Before I start, let me describe the agents involved in the process migration and/or resize flow of OpenStack (in this case, Nova component). These are the mapping and interpretation I created while troubleshooting the reported problem.

- Nova-API: the agent responsible for receiving the HTTP requests (create/resize/migrate) from the OpenStack end-user. It does some basic validation, and then sends a message with the requested command via RPC call to other agents.
- Nova-conductor: the agent responsible to "conduct/guide" the workflow. Nova-conductor will read the commands from the RPC queue and then process the request from Nova-API. It does some extra validation, and for every command (create/resize/migrate), it asks for the scheduler to define the target host for the operation (if the target host was not defined by the user).
- Nova-scheduler: the agent responsible to "schedule" VMs on hosts. It defines where a VM must reside. It receives the "select host request", and processes the algorithms to determine where the VM can be allocated. Before applying the scheduling algorithms, it calls/queries the Placement system to get the possible hosts where VMs might be allocated. I mean, hosts that fit the requested parameters, such as being in a given Cell, availability zone (AZ), having available/free computing resources to support the VM. The call from Nova-scheduler to Placement is an HTTP request.
- Placement: behaves as an inventory system. It tracks where resources are allocated, their characteristics, and providers (hosts/storage/network system) where resources are (can be) allocated. It also has some functions to return the possible hosts where a "request spec" can be fulfilled.
- Nova: the agent responsible to execute/process the commands and implement actions in the hypervisor.

Then, we have the following workflow from the different processes.

- migrate: Nova API ->(via RPC call -- nova.conductor.manager.ComputeTaskManager.live_migrate_instance) Nova Conductor (loads request spec) -> (via RPC call) Nova scheduler -> (via HTTP) Placement -> (after the placement return) Nova scheduler executes the filtering of the hosts, based on active filters. - > (return for the other processes in conductor) -> (via RPC call) Nova to execute the migration.

- resize: Nova API ->(via RPC call -- nova.conductor.manager.ComputeTaskManager.migrate_server -- _cold_migrate) Nova Conductor (loads request spec) -> (via RPC call) Nova scheduler -> (via HTTP) Placement -> (after the placement return) nova scheduler executes the filtering of the hosts, based on active filters - > (return for the other processes), in Nova conductor -> (RPC call) Nova to execute the cold migration and start the VM again with the new computing resource definition

As a side note, this mapping also explains why the "resize" was not executing the CPU compatibility check that the "migration" is executing (this is something else that I was checking, but it is worth mentioning here). The resize is basically a cold migration to a new host, where a new flavor (definition of the VM) is applied; thus, it does not need to evaluate CPU feature set compatibility.

The problem we are reporting happens with both "migrate" and "resize" operations. Therefore, I had to add some logs to see what was going on there (that whole process is/was "logless"). The issue happens because Placement always returns all hosts of the environment for a given VM being migrated (resize is a migration process); this only happens if the VM is deployed without defining its availability zone in the request spec.

To be more precise, Nova-conductor in `nova.conductor.tasks.live_migrate.LiveMigrationTask._get_request_spec_for_select_destinations` (https://github.com/openstack/nova/blob/3d83bb3356e10355437851919e161f258cebf761/nova/conductor/tasks/live_migrate.py#L460) always uses the original request specification, used to deploy the VM, to find a new host to migrate it to. Therefore, if the VM is deployed to a specific AZ, it will always send this AZ to Placement (because the AZ is in the request spec), and Placement will filter out hosts that are not from that AZ. However, if the VM is deployed without defining the AZ, Nova will select a host (from an AZ) to deploy it (the VM), and when migrating the VM, Nova is not trying to find another host in the same AZ where the VM is already running. It is always behaving as a new deployment process to select the host.

That raised a question. How is it possible that the create (deploy VM) process works? It works because of the parameter "cross_az_attach" configured in Nova. As we can see in https://github.com/openstack/nova/blob/3d83bb3356e10355437851919e161f258cebf761/nova/virt/block_device.py#L53, if this parameter is False, when creating a volume, Nova is going to use the AZ where the VM was scheduled to create the volume in Cinder. Everything works because the host selection process is executed before the volume is created in Cinder.

After discovering all that, we were under the impression that OpenStack was designed to have (require) different Cells to implement multiple AZs. Therefore, we assumed that the problem was caused due to this code/line (https://github.com/openstack/nova/blob/3d83bb3356e10355437851919e161f258cebf761/nova/conductor/tasks/live_migrate.py#L495). Whenever a request is made to Nova-scheduler, Nova conductor always sends the current Cell where the VM resides to Nova Placement. Therefore, if we had multiple AZs, each one of them with different Cell configurations, we would never have had this situation; that is why we were thinking that the problem might be a setup one.

However, while discussing, and after checking the documentation (https://docs.openstack.org/nova/latest/admin/availability-zones.html) that describes the use of AZs, we concluded that there is an issue with the code. It should be possible to have multiple AZs sharing the same Cell. We conclude that similar to what happens when "cross_az_attach" is False and we deploy a VM, Nova is going to allocate the Cinder volume in a specific AZ, then when executing migrations this parameter "cross_az_attach" should be evaluated, and the current AZ of the VM should be added in the request spec to Placement to list the possible hosts where the VM can be moved to.

We also discussed if the Placement should be the one doing this check before returning the possible hosts to migrate the VM to. However, this does not seem to be in the Placement context/goal/design. Therefore, the place where we need a patch/fix is in Nova.

Furthermore, the solution proposed (https://review.opendev.org/c/openstack/nova/+/469675/12/nova/compute/api.py#1173) is only addressing the cases when the VM is created based on volumes, and then it sets the AZ of the volumes in the request spec of the VM (even though the user is not setting that in the request spec of the VM). That is why everything works for the setups where cross_az_attach=False. However, if we create a VM based on an image, and then it (Nova) creates a new volume in Cinder, the AZ is not set in the request spec, but it (the request spec) is used to execute the first call to placement to select the hosts, as we described above here.

Following the same process that is used with Nova cells, we propose the solution for this situation at https://review.opendev.org/c/openstack/nova/+/864760.

Any other comments and reviews are welcome!

Changed in nova:
status: New → In Progress
Changed in nova:
status: In Progress → Invalid
Revision history for this message
sean mooney (sean-k-mooney) wrote :

setting this to invalid

we shoudl discuss this more however it is perfectly valid for a resize to move a VM to a different az as long as the VM was not created with an AZ in the original request.

are not expected to align to cells and you can have multiple cells in the same az and multiple AZ in the same cell concurrently.

if an instance does not request an az in the VM create request there is no expectation that the VM should be tied to the AZ for its lifetime when migrated resize evacuated or otherwise moved.

adding schduler support for cross az attach would be a feature not a bug which is why i marked this as invalid.

the current behaviour is expected.

the https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.default_schedule_zone

can be used to force the vms to have an AZ when one is not requested

if that is not set and no az is provided vms are expected to be able to float across AZs

Revision history for this message
Rafael Weingartner (rafaelweingartner) wrote :

Sean, You are not considering the parameter/option "cross_az_attach". That feature defines a constraint in the cloud environment. Therefore, it should be considered when executing migrations processes. I see that you want to call a feature, but it is clearly a bug (at least under our understanding). The code is there, valid, and resolves the problem with a simple solution. If you guys would not like it to be accepted, that is fine...¯\_(ツ)_/¯....

Anyways, it seems that this is not going to move further. At least, we have all of this mapped.

Revision history for this message
Florian Engelmann (engelmann) wrote :

Our Openstack Cloud is affected by that issue as well. Regarding "cross_az_attach" the documentation is clear:

https://docs.openstack.org/nova/latest/admin/availability-zones.html

"cross_az_attach=False is not widely used nor tested extensively and thus suffers from some known issues:"

So there are known issues with this parameter and the one described here is another bug/issue and not a feature.

And the parameter is explained like:
https://docs.openstack.org/nova/latest/configuration/config.html

"If False, volumes attached to an instance must be in the same availability zone in Cinder as the instance availability zone in Nova."

So by moving the VM to another AZ (eg. by a resize) we "break" this description.

I can definitely confirm that this issue is a major problem for operations.

Revision history for this message
s.peters (dr-ev1l) wrote :

This bug affects me as a simple user too.

I expect the VM to resize and not "also" to move across AZs. Why should the VM move to a different AZ if I did not explicitly tell the VM to move? Does not make sense.

From the user perspective this is buggy. And it reminds me of Cloudstack, where unexpected things happened all the time. :-/

Revision history for this message
Rafael Weingartner (rafaelweingartner) wrote :
Download full text (5.5 KiB)

Hello guys, I built a Matrix that maybe can help us see the impact of the changes I am proposing. Here (in this system) the formatting is not great, but you can copy and paste it in a notepad, and everything should be fine. With this matrix, you can see the impact of the patch I am proposing.

I do agree that VMs without AZ can float around the cloud. However, the constraint `cross_az_attach` has to be considered when migration/resize happens. This parameter is already considered in other parts of Nova, but not in the migrate/resize workflow.

| OpenStack VM (server) Action | cross_az_attach | Patch applied #864760 | Result |
| start | false | no | Use the host where the VM was previously created |
| start | true | no | Use the host where the VM was previously created |
| start | false | yes | Use the host where the VM was previously created |
| start | true | yes | Use the host where the VM was previously created |

| resize | false | no | Resize will trigger a cold migration. |
| | | | If the setup has mutiple AZs in the same region, |
| | | | and if these AZs are in the same Cell, any host |
| | | | of any AZ is going to be used. Therefore an error|
| | | | will happen when the system (randomly) selects a |
| | | | host in another AZ. |

| resize | true | no | Resize will trigger a cold migration. |
| | | | If the setup has mutiple AZs in the same region, |
| | | | and if these AZs are in the same Cell, any host |
| | | | of any AZ is going to be used. No aparent error |
| | | | will happen; however, latency issues might start |
| | | | to happen as soon as the VM is resized, and might|
| | | | be hard for operators to identify the root cause.|

| resize | false | yes | Resize will trigger a cold migration. Therefore, |
| | | | if the setup has mutiple AZs in the same region, |
| | | | and if these AZs are in the same Cell, a host |
...

Read more...

Revision history for this message
Florian Engelmann (engelmann) wrote :
Download full text (8.4 KiB)

Hi,

I also do agree that VMs without an AZ specified can float around the cloud - BUT - only if this "is possible"/"does make sense" from a technical point of view. AZs can be designed and implemented in many different ways but if a provider
chose to use AZs to be distinct locations within a region that are engineered to be isolated from failures in other AZs there are VERY good reasons to not allow "cross_az_attach"-ments. I am happy to explain all those reasons if needed.

I took the liberty of adding this aspect to Rafael's matrix. I hope that's okay.
I attached the matrix in a file like Rafael did cause lanchpad does not support tables.

| OpenStack VM (server) Action | cross_az_attach | Patch applied #864760 | Result | Expectations as an operator |
| start | false | no | Use the host where the VM was previously created | Use the host where the VM was previously created |
| start | true | no | Use the host where the VM was previously created | Use the host where the VM was previously created |
| start | false | yes | Use the host where the VM was previously created | Use the host where the VM was previously created |
| start | true | yes | Use the host where the VM was previously created | Use the host where the VM was previously created |

| resize | false | no | Resize will trigger a cold migration. | An operator would not(!) expect the VM to |
| | | | If the setup has mutiple AZs in the same region, | migrate to a host in a different cell OR AZ. |
| | | | and if these AZs are in the same Cell, any host | This is true for all VMs having any(!) "non |
| | | | of any AZ is going to be used. Therefore an error| local" volume attched. So even VMs having their |
| | | | will happen when the system (randomly) selects a | nova root volume on ceph should NOT move to an |
| | | | host in another AZ. | other AZ. |

| resize | true | no | Resize will trigger a cold migration. | An operator would expect the VM to migrate to |
| | | | If the setup has mutiple AZs in the same region, | any host. |
| | | | and if these AZs are in the same Cell, any host | |
| | | ...

Read more...

Revision history for this message
Francois Scheurer (scheuref) wrote :

Dear OpenStack Compute (nova) Code Maintainer(s),

I hope this message finds you well.

The current behavior, when migrating or flavor resizing VMs that were created without an explicit availability zone (AZ), leads to the following problems:

    1) Despite having the 'cinder_cross_az_attach' configuration set to 'false', cross-AZ attached Cinder volumes still occur.
        This results in volume-based VMs being restarted in the incorrect AZ, leading to degraded disk latency.

    2) Image-based VMs come to a halt and become stuck in an error status.
        This is because their Ceph RBD images cannot be found in the Nova pool of the wrong AZ.
        Recovering such VMs involves administrator intervention and changing the database directly.

We have received multiple customer tickets related to this issue, wherein they have uploaded images, created VMs, and encountered difficulties when attempting to resize them.

By ensuring that the 'cinder_cross_az_attach' configuration is honored during migration or resizing, we can significantly improve user experience and reduce related issues.

We kindly request you to review and consider merging the proposed bug fix into the upstream code.

Thank you for your time and attention to this matter. We believe that this fix will greatly benefit the OpenStack Nova community.

Best regards,
Francois Scheurer

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :
Download full text (3.8 KiB)

I left a very large comment on Gerrit but I'll add it here for better visibility.

FWIW, I think the problem is legit and needs to be addressed. I'm gonna change the title and the subject to make it clearer but I also think that the solution isn't simple at first and requires some design discussion, hence the Wishlist status.

Now, the comment I wrote explaining my -1 (you can find it here https://review.opendev.org/c/openstack/nova/+/864760/comment/b2b03637_f15d6dd2/ )

=================
> Just because you say so? =)

> Can you provide a more technical explanation on why not? I mean, why would that be wrong? Or, what alternative would be better, and why?

Sorry, that's kind of a non-documented design consensus (or a tribal knowledge if you prefer)
We, as the Nova community, want to keep the RequestSpec.availability_zone record as an immutable object, that is only set when creating the RequestSpec, so then we know whether the user wanted to pin the instance to a specific AZ or not.

> What is your proposal? We see the following two different alternatives so far. [...]

Maybe you haven't seen my proposal before, but I was talking of https://review.opendev.org/c/openstack/nova/+/469675/12/nova/compute/api.py#1173 that was merged.
See again my comment https://review.opendev.org/c/openstack/nova/+/864760/comments/4a302ce3_9805e7c6
TBC, lemme explain the problem and what we need to fix : if an user creates an instance with an image and asking to create a volume on that image, then we need to modify the AZ for the related Request if and only if cross_az_attach=False

Now, let's discuss the implementation :
1/ we know that volumes are created way later in the instance boot by the compute service, but we do pass the information of the instance.az to Cinder to tell it to create a volume within that AZ if cross_az_attach=False :
https://github.com/openstack/nova/blob/b3fdd7ccf01bafb68e37a457f703b79119dbfa86/nova/virt/block_device.py#L427
https://github.com/openstack/nova/blob/b3fdd7ccf01bafb68e37a457f703b79119dbfa86/nova/virt/block_device.py#L53-L78

2/ unfortunately,instance.availability_zone is only trustworthy if the instance is pinned to an AZ

3/ we know that at the API level, we're able to know whether we will create a volume based on an image since we have the BDMs and we do check them :
https://github.com/openstack/nova/blob/b3fdd7ccf01bafb68e37a457f703b79119dbfa86/nova/compute/api.py#L1460
https://github.com/openstack/nova/blob/b3fdd7ccf01bafb68e37a457f703b79119dbfa86/nova/compute/api.py#L1866
https://github.com/openstack/nova/blob/b3fdd7ccf01bafb68e37a457f703b79119dbfa86/nova/compute/api.py#L1960-L1965C43

4/ Accordingly, we are able to follow the same logic than in https://github.com/openstack/nova/blob/b3fdd7ccf01bafb68e37a457f703b79119dbfa86/nova/compute/api.py#L1396-L1397 by checking the BDMs and see whether we gonna create a volume. If so, we SHALL pin the AZ exactly like https://github.com/openstack/nova/blob/b3fdd7ccf01bafb68e37a457f703b79119dbfa86/nova/compute/api.py#L1264

Unfortunately, since the user didn't specify an AZ, Nova doesn' know which AZ to pin the instance to. Consequently, we have multiple options :

1/ we could return ...

Read more...

Changed in nova:
status: Invalid → Confirmed
importance: Undecided → Wishlist
summary: - 'openstack server resize --flavor' should not migrate VMs to another AZ
+ cross_az_attach=False doesn't honor BDMs with source=image and
+ dest=volume
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Based on the above comments, I also see a lot of questions whether cross_az_attach works with the current releases. TBC, yes if a volume is created *before* an instance, and I added some other functional tests to prove it :

https://review.opendev.org/c/openstack/nova/+/878948

description: updated
Revision history for this message
Khoi (khoinh5) wrote :

This happened with me too. It makes operate system hardly. We cannot control when a lot of vms was moved/migrated. I think we should have a solution for this. I think if we should check current AZ before rescheduling it by resized/migrated.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.