commit 7dd6a4a19311136c02d89cd2afd97236b0f4cc27
Author: Daniel P. Berrange <email address hidden>
Date: Thu Jan 29 14:33:32 2015 +0000
libvirt: proper monitoring of live migration progress
The current live migration code simply invokes migrateToURI
and waits for it to finish, or raise an exception. It considers
all exceptions to mean the live migration aborted and the VM is
still running on the source host. This is totally bogus, as there
are a number of reasons why an error could be raised from the
migrateToURI call. There are at least 5 different scenarios for
what the VM might be doing on source + dest host upon error.
The migration might even still be going on, even if after the
error has occurred.
A more reliable way to deal with this is to actively query
libvirt for the domain job status. This gives an indication
of whether the job is completed, failed or cancelled. Even
with that though, there is a need for a few heuristics to
distinguish some of the possible error scenarios.
This change to do active monitoring of the live migration process
also opens the door for being able to tune live migration on the
fly to adjust max downtime or bandwidth to improve chances of
getting convergence, or to automatically abort it after too much
time has elapsed instead of letting it carry on until the end of
the universe. This change merely records memory transfer progress
and leaves tuning improvements to a later date.
Reviewed: https:/ /review. openstack. org/151664 /git.openstack. org/cgit/ openstack/ nova/commit/ ?id=7dd6a4a1931 1136c02d89cd2af d97236b0f4cc27
Committed: https:/
Submitter: Jenkins
Branch: master
commit 7dd6a4a19311136 c02d89cd2afd972 36b0f4cc27
Author: Daniel P. Berrange <email address hidden>
Date: Thu Jan 29 14:33:32 2015 +0000
libvirt: proper monitoring of live migration progress
The current live migration code simply invokes migrateToURI
and waits for it to finish, or raise an exception. It considers
all exceptions to mean the live migration aborted and the VM is
still running on the source host. This is totally bogus, as there
are a number of reasons why an error could be raised from the
migrateToURI call. There are at least 5 different scenarios for
what the VM might be doing on source + dest host upon error.
The migration might even still be going on, even if after the
error has occurred.
A more reliable way to deal with this is to actively query
libvirt for the domain job status. This gives an indication
of whether the job is completed, failed or cancelled. Even
with that though, there is a need for a few heuristics to
distinguish some of the possible error scenarios.
This change to do active monitoring of the live migration process
also opens the door for being able to tune live migration on the
fly to adjust max downtime or bandwidth to improve chances of
getting convergence, or to automatically abort it after too much
time has elapsed instead of letting it carry on until the end of
the universe. This change merely records memory transfer progress
and leaves tuning improvements to a later date.
Closes-bug: #1414065 08c861bb3a84b56 bd096882004
Change-Id: I6fcbfa31a79c78