resize error on the same current host with enough vcpu resource

Bug #1609193 reported by Charlotte Han
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Wishlist
Unassigned

Bug Description

Steps to reproduce
==================
A chronological list of steps which will bring off the
issue you noticed:
* I had a compute node, set allow_resize_to_same_host=true.

* I did boot a instance with flavor m1.tiny
1 | m1.tiny | 512 | 1 | 0 | | 1 | 1.0 | True

* then I did boot my instance, let free vcpus of this compute node is 3
* then I did resize this instance use flavor 1-1 which vcpu number is 4.
| 1-1 | hanrong_cpu_4 | 512 | 1 | 0 | | 4 | 1.0 | True |

* then the instance was error.
{"message": "Insufficient compute resources: Free vcpu 3.00 VCPU < requested 4 VCPU.", "code": 400, "created": "2016-08-02T12:12:36Z"}

Expected result
===============

I hope this instance resize successfully and freen vcpus of this compute node is 0 after resize.

Actual result
=============
instance error, and freen vcpus of this compute node is 3 after resize.

Environment
===========
1.
   If this is from git, please provide
       $ git log -1
commit 2b0557e4ee6737f44cb6fa845d7f59446bca90bf
Merge: 40913fe 15a9458
Author: Jenkins <email address hidden>
Date: Thu Jun 16 01:51:48 2016 +0000

    Merge "Added missed response to test_server_tags"

Tags: compute resize
Charlotte Han (hanrong)
tags: added: scheduler
description: updated
Changed in nova:
assignee: nobody → Charlotte Han (hanrong)
Charlotte Han (hanrong)
Changed in nova:
assignee: Charlotte Han (hanrong) → nobody
tags: added: resize
Changed in nova:
assignee: nobody → Maciej Szankin (mszankin)
status: New → In Progress
description: updated
Revision history for this message
Matt Riedemann (mriedem) wrote :

what's probably happening is it's hitting ComputeResourcesUnavailable which triggers a reschedule, but since there is nowhere to reschedule to, it fails and is set to error.

yeah it gets into rt.resize_claim which does the claim test and raises the ComputeResourcesUnavailable exception which can't reschedule b/c it's resize to same host / single node and that all happens within a _error_out_instance_on_exception context manager so the instance is put in error state.

so i guess you'd have to handle ComputeResourcesUnavailable in _error_out_instance_on_exception and not set the instance to error state.

Revision history for this message
Sarafraj Singh (sarafraj-singh) wrote :

So this is not going to work in Nova as resources currently used by an instance are not considered free. But if instance goes to error that might qualify for a bug.
Acc to Matt "it gets into rt.resize_claim which does the claim test and raises the ComputeResourcesUnavailable exception"

Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → Medium
tags: added: compute
removed: scheduler
Revision history for this message
Sarafraj Singh (sarafraj-singh) wrote :

mriedem "b/c it's resize to same host / single node
 and that all happens within a _error_out_instance_on_exception context manager
 so the instance is put in error state
 so i guess you'd have to handle ComputeResourcesUnavailable in _error_out_instance_on_exception and not set the instance to error state"

Revision history for this message
Charlotte Han (hanrong) wrote :

Thank you.

The resize action of this instance failed because scheduler core_filter return 0 hosts. So I think this is a scheduler problem.

One host's free vpus is 3, a instance with 1 vcpus on this host. when this instance want to be resized to 4 vpus, I think this resize action should be successful, because host_free_vcpus + instance_current_used_vcpus == resized_instance_required_vcpus.(3 + 1 == 4)

Revision history for this message
Maciej Szankin (mszankin) wrote :

But this would mean that resize gets a special treatment. It was designed this way, I do not think of it as a bug, rather as an inconvenience. Plus, counting a difference between flavors leaves us with more things to reconsider, not only CPU, but also RAM and so on.

Revision history for this message
Charlotte Han (hanrong) wrote :

Yea, resize action was designed this way. But as a cloud user, I hope resource can be used as much possible, especially when I need it. So I think this is a bug, or any other api can resolve this problem?

Revision history for this message
John Garbutt (johngarbutt) wrote :

So interestingly, we are going to hit this with live-resize as well, as that will always resize on the same host.

The problem here, is we are currently re-writing the scheduler. Once we are done, and instance will actually have a claim in the scheduler, so we should be able to tell the scheduler, increase the size of the claim, rather than just give me a new claim of the new size.

It feels like this is blocked until we get that scheduler work complete. It has been a priority feature for the last few cycle, we are making some progress.

Changed in nova:
importance: Medium → Wishlist
Revision history for this message
Charlotte Han (hanrong) wrote :

Thank you

I will wait for live-resize function, but I am worried about numa topology. For numa-topology change, resize a shutdown instance is more easily.

Changed in nova:
assignee: Maciej Szankin (mszankin) → nobody
status: In Progress → Confirmed
Revision history for this message
Matt Riedemann (mriedem) wrote :

Per comment 7, the scheduler claims resources in placement since pike but that still doesn't resolve this issue, see bug 1790204. The resource tracking code is not smart enough to consider that the instance is being resized on the same host where it's already consuming resources and count those as available, so in this case you start with VCPU=1 and want to go to VCPU=4, there are 4 total VCPU on the host. nova/placement is going to try to claim 4 when there are only 3 available and fail. See my attempts at fixing bug 1790204 - this isn't going to be fixed anytime soon and I think is best documented as a known limitation.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.