Live migration should use the same memory over subscription logic as instance boot

Bug #1452009 reported by Michael Petersen
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Won't Fix
Wishlist
Timofey Durakov
5.1.x
Won't Fix
Medium
Timofey Durakov
6.0.x
Won't Fix
Medium
Timofey Durakov
6.1.x
Won't Fix
Medium
Timofey Durakov
7.0.x
Won't Fix
Medium
Timofey Durakov
8.0.x
Won't Fix
Wishlist
MOS Nova

Bug Description

This is copying over the bug from:

https://bugs.launchpad.net/nova/+bug/1214943

I encounter an issue when live migrate an instance specified the target host, i think the operation will be successes , but it is failed for below reason:

MigrationPreCheckError: Migration pre-check error: Unable to migrate a34f9b88-1e07-4798-af46-ca3b3dbaceda to hchenos2: Lack of memory(host:336 <= instance:512)

  1 . My OpenStack cluster information :

1). There are two compute nodes in my cluster, and i created 4 instance(1vcpu/512Mmemory) on these hosts

-----------
mysql> select hypervisor_hostname,vcpus,vcpus_used,running_vms,memory_mb,memory_mb_used,free_ram_mb,deleted from compute_nodes where deleted=0;
+----------------------------------+-------+------------+-------------+-----------+----------------+-------------+---------+
| hypervisor_hostname | vcpus | vcpus_used | running_vms | memory_mb | memory_mb_used | free_ram_mb | deleted |
+----------------------------------+-------+------------+-------------+-----------+----------------+-------------+---------+
| hchenos1.eng.platformlab.ibm.com | 2 | 2 | 2 | 1872 | 1536 | 336 | 0 |
| hchenos2.eng.platformlab.ibm.com | 2 | 2 | 2 | 1872 | 1536 | 336 | 0 |
+----------------------------------+-------+------------+-------------+-----------+----------------+-------------+---------+
2 rows in set (0.00 sec)

mysql>
------------------------
[root@hchenos ~]# nova list
+--------------------------------------+------+--------+----------+
| ID | Name | Status | Networks |
+--------------------------------------+------+--------+----------+
| a34f9b88-1e07-4798-af46-ca3b3dbaceda | vm1 | ACTIVE | | >>> on host 'hchenos1'
| f6aaeff9-2220-4693-8e5a-710f4c52b774 | vm2 | ACTIVE | | >>>> on host 'hchenos2'
| bbee57a2-81cd-4933-a943-1c2272f7f550 | vm4 | ACTIVE | | >>>> on host 'hchenos1'
| 74fe26ec-919c-4fa7-890f-f59abe09ef4f | vm5 | ACTIVE | | >>>> on host 'hchenos2'
+--------------------------------------+------+--------+----------+
[root@hchenos ~]#

 2). I also enable the ComputeFilter,RamFilter and CoreFilter in nova.conf, but don't config over commit ratio for both vcpu and memory, so the default ratio will be used.

2. In the above conditions, live migrate instance vm1 to hchenos2 failed:

[root@hchenos ~]# nova live-migration vm1 hchenos2
ERROR: Live migration of instance a34f9b88-1e07-4798-af46-ca3b3dbaceda to host hchenos2 failed (HTTP 400) (Request-ID: req-68244b99-e438-4000-8bdb-cc43b275c018)

 conductor log:
...
ckages/nova/conductor/tasks/live_migrate.py", line 87, in _check_requested_destination\n self._check_destination_has_enough_memory()\n\n File "/usr/lib/python2.6/site-packages/nova/conductor/tasks/live_migrate.py", line 108, in _check_destination_has_enough_memory\n mem_inst=mem_inst))\n\nMigrationPreCheckError: Migration pre-check error: Unable to migrate a34f9b88-1e07-4798-af46-ca3b3dbaceda to hchenos2: Lack of memory(host:336 <= instance:512)\n\n']

I think the reason for above as below:

the free_ram_mb for 'hchenos2 ' is 336M, the request memory is 512M, so the operation is failed.

free_ram_mb = memory_mb (1872) - 512(reserved_host_memory_mb) - 2*512(instance consume) = 336

3. But successfully boot an instance on 'hchenos2'

[root@hchenos ~]# nova boot --image cirros-0.3.0-x86_64 --flavor 1 --availability-zone nova:hchenos2 xhu

[root@hchenos ~]# nova list
+--------------------------------------+------+--------+----------+
| ID | Name | Status | Networks |
+--------------------------------------+------+--------+----------+
| a34f9b88-1e07-4798-af46-ca3b3dbaceda | vm1 | ACTIVE | |
| f6aaeff9-2220-4693-8e5a-710f4c52b774 | vm2 | ACTIVE | |
| bbee57a2-81cd-4933-a943-1c2272f7f550 | vm4 | ACTIVE | |
| 74fe26ec-919c-4fa7-890f-f59abe09ef4f | vm5 | ACTIVE | |
| 364d1a01-67ed-4966-bbfd-d21b6bc3067c | xhu | ACTIVE | | >>>> is active
+--------------------------------------+------+--------+----------+
[root@hchenos ~]#

mysql> select hypervisor_hostname,vcpus,vcpus_used,running_vms,memory_mb,memory_mb_used,free_ram_mb,deleted from compute_nodes where deleted=0;
+----------------------------------+-------+------------+-------------+-----------+----------------+-------------+---------+
| hypervisor_hostname | vcpus | vcpus_used | running_vms | memory_mb | memory_mb_used | free_ram_mb | deleted |
+----------------------------------+-------+------------+-------------+-----------+----------------+-------------+---------+
| hchenos1.eng.platformlab.ibm.com | 2 | 2 | 2 | 1872 | 1536 | 336 | 0 |
| hchenos2.eng.platformlab.ibm.com | 2 | 3 | 3 | 1872 | 2048 | -176 | 0 |
+----------------------------------+-------+------------+-------------+-----------+----------------+-------------+---------+
2 rows in set (0.00 sec)

mysql>

So, I'm very confused for above test result, why boot an instance is OK on 'hchenos2', but live migration an instance to this host failed due to "not enough memory" ?

After carefully go through NOVA source code (live_migrate.py: execute()) , i think below will cause this issue:

1). The function '_check_destination_has_enough_memory' doesn't consider the ram allocation ratio(default value is 1.5) when calculate host free memory('free_ram_mb'), it is inconsistent with 'RamFilter' for memory check when boot instance.

I think the free memory of host 'hchenos2' should be:

free_ram_mb = memory_mb (1872) * ram_allocation_ratio(1.5) - memory_mb_used('1536') = 1272

2) why not check vcpu for live migration target host, only check memory is enough?

live_migrate.py: execute

        self._check_instance_is_running()
        self._check_host_is_up(self.source)

        if not self.destination:
            self.destination = self._find_destination()
        else:
            self._check_requested_destination() >>>>

    def _check_requested_destination(self):
        self._check_destination_is_not_source()
        self._check_host_is_up(self.destination)
        self._check_destination_has_enough_memory() >>>> Only check memory, why not check vcpu together?
        self._check_compatible_with_source_hypervisor(self.destination)
        self._call_livem_checks_on_host(self.destination)

3) The VM status need to be considering as well, for example, if the instance is off, it doesn't consume compute node resource anymore on KVM platform(is different form IBM PowerVM), but in resource_tracker.py:_update_usage_from_instances() , only instance 'deleted' flag
is taken into account when calculate resource usage.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Prioritized as High to align with the upstream bug, although based on our triage criteria this looks more like a Medium:
https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Confirm_and_triage_bugs

tags: added: customer-found
no longer affects: mos/7.0.x
no longer affects: fuel
no longer affects: fuel/5.1.x
no longer affects: fuel/6.0.x
no longer affects: fuel/6.1.x
tags: added: nova
Revision history for this message
Timofey Durakov (tdurakov) wrote :

There is difference in logic of ram checks during instance boot and live-migration with specified destination host.
In first case scheduler make checks and selects appropriate host, in second case nova conductor checks destination host by itself.
Scheduler use ram_allocation_ratio from nova.conf. This difference should be eliminated in Liberty. Tested in MOS 6.1. it isn't reproduced on default deployment, as ram_allocation_ration is set to 1.0(1.5 - default to upstream).

Revision history for this message
Timofey Durakov (tdurakov) wrote :

If customer wants patch right now, please run escalation, so I could provide fast patch.

tags: added: release-notes
Revision history for this message
Miroslav Anashkin (manashkin) wrote :

OK, sent escalation mail message.

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/nova (openstack-ci/fuel-5.1/2014.1.1)

Fix proposed to branch: openstack-ci/fuel-5.1/2014.1.1
Change author: Timofey Durakov <email address hidden>
Review: https://review.fuel-infra.org/6905

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/nova (openstack-ci/fuel-5.1-updates/2014.1.1)

Fix proposed to branch: openstack-ci/fuel-5.1-updates/2014.1.1
Change author: Timofey Durakov <email address hidden>
Review: https://review.fuel-infra.org/6910

tags: added: release-notes-done
removed: release-notes
Revision history for this message
Timofey Durakov (tdurakov) wrote :

There is activity in upstream to refactor scheduler/conductor, so to align mos code with upstream this issue would be fixed in 8.0 release.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

To clarify Timofey's comment a bit:

the proper fix would require a major refactoring of Nova internals, which is currently ongoing upstream. We could possibly add a work around for this (Timofey prepared one earlier - there's a link to a Gerrit review), but thus we would introduce a divergence from upstream Nova code, which is something we'd really like to avoid.

We won't merge this fix into 7.0, but will wait till it lands upstream first. Still, if someone needs it, it's available in the form of a custom patch.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I understand, this is marked as customer-found, but it's more like an enhancement request, which we can't do without introducing a divergence from upstream.

There is a spec for Mitaka, which should address the upstream bug.

tags: added: enhacement
tags: added: enhancement
removed: enhacement
Changed in mos:
milestone: 6.1 → 8.0
importance: High → Wishlist
status: Triaged → Won't Fix
tags: added: wontfix-feature
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/nova (openstack-ci/fuel-6.1/2014.2)

Fix proposed to branch: openstack-ci/fuel-6.1/2014.2
Change author: Timofey Durakov <email address hidden>
Review: https://review.fuel-infra.org/32177

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.