Nova Compute unintentionally stopped to monitor live migration

Bug #1745073 reported by Yuki Nishiwaki
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Wishlist
Unassigned

Bug Description

Description
===========

There is the case that nova-compute unintentionally stop to monitor live migration although live migration operation thread is still running (_live_migration_operation).
This cause the problem that nova-compute result in reporting "migration was succeeded" to Nova conductor and Nova compute periodic task try to delete all instance related information inside /var/lib/nova/instances/<instance-id> because live migration was succeeded from nova point of view.
This could cause the problem of live migration and also this is led to misunderstanding for the status of live migration operation to the operator.

"So it must be better at least Nova compute monitor live migration during _live_migration_operation thread be running"

Above case won't happen usually as long as libvirtd correctly maintain domain job information and correctly clean up after job is completed even if nova-compute won't check if live migration operation thread is finished or not.
But If libvirtd couldn't maintain domain job information correctly or something happened in clean up phase, nova-compute could misunderstand live migration as successful although still in progress "obviously" because the _live_migration_operation thread is running.

We could think it as just libvirtd matter and Nova doesn't have to take care of these.
But I think it must be better to implement more safety way if we can take it and actually I faced this situation with libvirtd 3.2.0 and it took a bit time to notice live migration operation thread is never finished from log and migration status in the database.

More specifically, I think here (finish_event) should be always checked not only the time when job type is VIR_DOMAIN_JOB_NONE
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L6871

libvirtd side problem is already fixed by https://www.redhat.com/archives/libvir-list/2017-April/msg00387.html and this is included in 3.3.0, but still I think nova compute should change behaviour for future problem that could be happened

Steps to reproduce
==================

* *Use libvirtd-3.2.0 having bug related to live migration*
   -> This version of libvirtd would often (not always) block for ever in virDomainMigrateToURI3 method and cause _live_migration_operation thrad is running for ever

* Create test vm with swap disk

   ```
   $ openstack flavor create --ram 1024 --disk 20 --swap 4048 --vcpus 1 test
   +----------------------------+--------------------------------------+
   | Field | Value |
   +----------------------------+--------------------------------------+
   | disk | 20 |
   | id | d4e400a7-fd10-4c18-9dbc-f89f24e668af |
   | name | test |
   | os-flavor-access:is_public | True |
   | ram | 1024 |
   | rxtx_factor | 1.0 |
   | swap | 4048 |
   | vcpus | 1 |
   +----------------------------+--------------------------------------+
   ```

   ```
   $ openstack server create --flavor test --image <something image> --nic net-id=<something network> test_server
   ```

* Nova "block" live migration test vm from HV1 to HV2

   ```
   $ nova live-migration --block-migrate test_server HV2
   ```

* Check migration status

   ```
   $ nova migration-list
   +-----+-----------------------+-----------------------+----------------+--------------+-------------+-----------+--------------------------------------+------------+------------+----------------------------+----------------------------+----------------+
   | Id | Source Node | Dest Node | Source Compute | Dest Compute | Dest Host | Status | Instance UUID | Old Flavor | New Flavor | Created At | Updated At | Type |
   +-----+-----------------------+-----------------------+----------------+--------------+-------------+-----------+--------------------------------------+------------+------------+----------------------------+----------------------------+----------------+
   | 1 | - | - | HV1 | HV2 | - | completed | e484eb18-2794-4651-a357-d2070940ed32 | 6 | 6 | 2018-01-09T03:02:10.000000 | 2018-01-09T03:02:20.000000 | live-migration |
   ```

* Check vm status

```
$ nova list
+--------------------------------------+--------------------------+--------+------------+-------------+--------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+--------------------------+--------+------------+-------------+--------------------+
| a221c19b-4d4e-46d4-8888-10c14ca0fe27 | test_server | ACTIVE | - | Paused | net1=192.168.11.11 |
+--------------------------------------+--------------------------+--------+------------+-------------+--------------------+
```

Expected result
===============

The host running nova-api/nova-conductor
 * migration status should not be changed to "completed" until _live_migration_operation thread finish
 * VM status should not be changed to "ACTIVE" until _live_migration_operation thread finish

The host running nova-compute
 * continue on monitoring live migration during _live_migration_operation thread being running

Actual result
=============

The host running nova-api
* migration status was changed to "completed" in nova database (check by nova migration-list)
* VM status was changed to "ACTIVE" (check by nova list)

The host running nova-compute
* stop to monitor live migration although _live_migration_operation thread is still running (check by log displaying "Live migration monitoring is all done")

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/
   * 13.1.0-1.el7 (Centos7)

2. Which hypervisor did you use?
   * libvirt + KVM
       * libvirt-daemon: 3.2.0-14.el7_4.7
       * qemu-kvm: 2.6.0-28.el7.10.1

2. Which storage type did you use?
   * local storage (just ephemeral disk)

tags: added: libvirt live-migration
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

While I understand your concern about eventually making sure that the live-migration is done, guessing whether libvirtd is done for migrating is really difficult. Nova tries to look at how many memory bytes are left to be migrated but sometimes it can be wrong.

There are also a lot of config opts for live-migration that could help you, like trying to enable post-copy.

Leaving that bug as a Wontfix/Wishlist but please provide a blueprint if you really think Nova should be modified and how to do.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :
Changed in nova:
status: New → Won't Fix
importance: Undecided → Wishlist
status: Won't Fix → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.