live-migrate left in migrating as domain not found
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Medium
|
John Garbutt | |||
Newton |
Medium
|
Shane Peters | |||
Ocata |
Medium
|
Matt Riedemann | |||
Ubuntu Cloud Archive |
Medium
|
Shane Peters | |||
Mitaka |
Medium
|
Unassigned | |||
Newton |
Medium
|
Unassigned | |||
Ocata |
Medium
|
Unassigned | |||
Pike |
Medium
|
Shane Peters | |||
nova (Ubuntu) |
Medium
|
Unassigned | |||
Xenial |
Medium
|
Unassigned | |||
Zesty |
Medium
|
Unassigned | |||
Artful |
Medium
|
Unassigned |
Bug Description
A live-migration stress test was working fine when suddenly a VM stopped migrating. It failed with this error:
ERROR nova.virt.
The full stack trace:
2017-02-05 02:33:41.787 19770 INFO nova.virt.
2017-02-05 02:33:45.795 19770 INFO nova.compute.
2017-02-05 02:33:45.870 19770 INFO nova.compute.
2017-02-05 02:33:45.883 19770 INFO nova.virt.
2017-02-05 02:33:45.884 19770 INFO nova.compute.
2017-02-05 02:33:46.156 19770 INFO os_vif [req-df91ac40-
2017-02-05 02:33:46.189 19770 INFO nova.virt.
2017-02-05 02:33:46.195 19770 INFO nova.virt.
2017-02-05 02:33:46.334 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 WARNING nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.363 19770 ERROR nova.virt.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
2017-02-05 02:33:46.364 19770 ERROR nova.compute.
Fix proposed to branch: master
Review: https:/
Changed in nova: | |
status: | New → In Progress |
Related fix proposed to branch: master
Review: https:/
John Garbutt (johngarbutt) wrote : | #4 |
So there are kind of two problems here, a race in undefine domain that leads to occasional live-migration failures when you run lots and lots of live-migrations.
On top of that, when errors happen in that part of the code, we don't set the instance to the error state, so the instance just stays in the migrating state.
Changed in nova: | |
importance: | Undecided → Medium |
tags: | added: ocata-rc-potential |
Matt Riedemann (mriedem) wrote : | #5 |
Is this really an ocata release candidate potential bug? It sounds pretty latent and something we can backport to stable/ocata after 15.0.0 is released. Or was this the result of a regression in Ocata itself?
tags: |
added: ocata-backport-potential removed: ocata-rc-potential |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit b706155888d7408
Author: John Garbutt <email address hidden>
Date: Tue Feb 7 18:55:26 2017 +0000
Stop _undefine_domain erroring if domain not found
During live-migration stress testing we are seeing the following log:
Error from libvirt during undefine. Code=42 Error=Domain not found
There appears to be a race while trying to undefine the domain, and
something else is possibly also doing some kind of clean up. While this
does paper over that race, it stops otherwise completed live-migrations
from failing. It also matches similar error handling done for when
deleting the domain.
The next part of the bug fix is to ensure if we have any similar
unexpected errors during this later phase of the live-migration we don't
leave the instance stuck in the migrating state, it should move to an
ERROR state. This is covered in a follow on patch.
Partial-Bug: #1662626
Change-Id: I23ed9819061bfa
OpenStack Infra (hudson-openstack) wrote : | #7 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit b56f8fc2d1392f4
Author: John Garbutt <email address hidden>
Date: Tue Feb 7 19:12:50 2017 +0000
Stop failed live-migrates getting stuck migrating
When there are failures in driver.cleanup, we are seeing live-migrations
that get stuck in the live-migrating state. While there has been a patch
to stop the cause listed in the bug this closes, there are other
failures (such as a token timeout when talking to cinder or neutron)
that could trigger this same failure mode.
When we hit an error this late in live-migration, it should be a very
rare event, so its best to just put the instance and migration into an
error state, and help alert both the operator and API user to the
failure that has occurred.
Closes-Bug: #1662626
Change-Id: Idfdce9e7dd8106
Changed in nova: | |
status: | In Progress → Fix Released |
This issue was fixed in the openstack/nova 16.0.0.0b1 development milestone.
Fix proposed to branch: stable/ocata
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/ocata
commit 012fa9353ff18d3
Author: John Garbutt <email address hidden>
Date: Tue Feb 7 19:12:50 2017 +0000
Stop failed live-migrates getting stuck migrating
When there are failures in driver.cleanup, we are seeing live-migrations
that get stuck in the live-migrating state. While there has been a patch
to stop the cause listed in the bug this closes, there are other
failures (such as a token timeout when talking to cinder or neutron)
that could trigger this same failure mode.
When we hit an error this late in live-migration, it should be a very
rare event, so its best to just put the instance and migration into an
error state, and help alert both the operator and API user to the
failure that has occurred.
Closes-Bug: #1662626
Change-Id: Idfdce9e7dd8106
(cherry picked from commit b56f8fc2d1392f4
This issue was fixed in the openstack/nova 15.0.4 release.
Fix proposed to branch: stable/newton
Review: https:/
Fix proposed to branch: stable/mitaka
Review: https:/
Change abandoned by Joshua Hesketh (<email address hidden>) on branch: stable/mitaka
Review: https:/
Reason: This branch (stable/mitaka) is at End Of Life
tags: |
added: libvirt removed: ocata-backport-potential |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/newton
commit 017e853b950ddc1
Author: John Garbutt <email address hidden>
Date: Tue Feb 7 19:12:50 2017 +0000
Stop failed live-migrates getting stuck migrating
When there are failures in driver.cleanup, we are seeing live-migrations
that get stuck in the live-migrating state. While there has been a patch
to stop the cause listed in the bug this closes, there are other
failures (such as a token timeout when talking to cinder or neutron)
that could trigger this same failure mode.
When we hit an error this late in live-migration, it should be a very
rare event, so its best to just put the instance and migration into an
error state, and help alert both the operator and API user to the
failure that has occurred.
For backport into Newton, 'migrate_
in the unit test (nova/tests/
Closes-Bug: #1662626
Change-Id: Idfdce9e7dd8106
(cherry picked from commit b56f8fc2d1392f4
(cherry picked from commit 012fa9353ff18d3
This issue was fixed in the openstack/nova 14.0.8 release.
Fix proposed to branch: stable/ocata
Review: https:/
Fix proposed to branch: stable/newton
Review: https:/
Changed in cloud-archive: | |
importance: | Undecided → Medium |
assignee: | nobody → Shane Peters (shaner) |
status: | New → Confirmed |
Change abandoned by Liping Mao (<email address hidden>) on branch: stable/ocata
Review: https:/
Change abandoned by Liping Mao (<email address hidden>) on branch: stable/newton
Review: https:/
Changed in cloud-archive: | |
status: | Confirmed → Fix Released |
Changed in nova (Ubuntu): | |
status: | New → Fix Released |
importance: | Undecided → Medium |
Changed in nova (Ubuntu Zesty): | |
status: | New → Fix Released |
Changed in nova (Ubuntu Xenial): | |
status: | New → Triaged |
importance: | Undecided → Medium |
Changed in nova (Ubuntu Zesty): | |
importance: | Undecided → Medium |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/ocata
commit a0525f650af1fa7
Author: John Garbutt <email address hidden>
Date: Tue Feb 7 18:55:26 2017 +0000
Stop _undefine_domain erroring if domain not found
During live-migration stress testing we are seeing the following log:
Error from libvirt during undefine. Code=42 Error=Domain not found
There appears to be a race while trying to undefine the domain, and
something else is possibly also doing some kind of clean up. While this
does paper over that race, it stops otherwise completed live-migrations
from failing. It also matches similar error handling done for when
deleting the domain.
The next part of the bug fix is to ensure if we have any similar
unexpected errors during this later phase of the live-migration we don't
leave the instance stuck in the migrating state, it should move to an
ERROR state. This is covered in a follow on patch.
Partial-Bug: #1662626
Change-Id: I23ed9819061bfa
(cherry picked from commit b706155888d7408
tags: | added: in-stable-ocata |
Still digging on the details with this bug. Looking at if we keep seeing the same thing several times.