2019-03-25 13:56:09 |
Rodrigo Barbieri |
bug |
|
|
added bug |
2019-03-25 13:59:42 |
Matt Riedemann |
nova: status |
New |
Triaged |
|
2019-03-25 13:59:44 |
Matt Riedemann |
nova: importance |
Undecided |
Medium |
|
2019-03-25 14:00:28 |
Matt Riedemann |
tags |
|
compute placement resize |
|
2019-03-25 14:22:56 |
Matt Riedemann |
nova: assignee |
|
Matt Riedemann (mriedem) |
|
2019-03-25 15:55:16 |
Matt Riedemann |
nominated for series |
|
nova/rocky |
|
2019-03-25 15:55:16 |
Matt Riedemann |
bug task added |
|
nova/rocky |
|
2019-03-25 15:55:16 |
Matt Riedemann |
nominated for series |
|
nova/pike |
|
2019-03-25 15:55:16 |
Matt Riedemann |
bug task added |
|
nova/pike |
|
2019-03-25 15:55:16 |
Matt Riedemann |
nominated for series |
|
nova/stein |
|
2019-03-25 15:55:16 |
Matt Riedemann |
bug task added |
|
nova/stein |
|
2019-03-25 15:55:16 |
Matt Riedemann |
nominated for series |
|
nova/queens |
|
2019-03-25 15:55:16 |
Matt Riedemann |
bug task added |
|
nova/queens |
|
2019-03-25 15:55:24 |
Matt Riedemann |
nova/pike: status |
New |
Triaged |
|
2019-03-25 15:55:28 |
Matt Riedemann |
nova/rocky: status |
New |
Triaged |
|
2019-03-25 15:55:31 |
Matt Riedemann |
nova/stein: status |
New |
Triaged |
|
2019-03-25 15:55:47 |
Matt Riedemann |
nova/rocky: importance |
Undecided |
Medium |
|
2019-03-25 15:55:53 |
Matt Riedemann |
nova/pike: importance |
Undecided |
Medium |
|
2019-03-25 15:56:36 |
Matt Riedemann |
nova/queens: status |
New |
Triaged |
|
2019-03-25 15:56:38 |
Matt Riedemann |
nova/queens: importance |
Undecided |
Medium |
|
2019-03-25 15:56:43 |
Matt Riedemann |
nova/stein: importance |
Undecided |
Medium |
|
2019-03-25 17:20:33 |
OpenStack Infra |
nova: status |
Triaged |
In Progress |
|
2019-03-28 19:35:05 |
OpenStack Infra |
nova: assignee |
Matt Riedemann (mriedem) |
Eric Fried (efried) |
|
2019-03-29 13:29:25 |
Matt Riedemann |
nova: assignee |
Eric Fried (efried) |
Matt Riedemann (mriedem) |
|
2019-04-02 18:33:19 |
OpenStack Infra |
nova/stein: status |
Triaged |
In Progress |
|
2019-04-02 18:33:19 |
OpenStack Infra |
nova/stein: assignee |
|
Matt Riedemann (mriedem) |
|
2019-04-05 01:02:20 |
OpenStack Infra |
nova: status |
In Progress |
Fix Released |
|
2019-04-10 07:39:22 |
David Negreira |
bug |
|
|
added subscriber David Negreira |
2019-04-12 18:58:12 |
OpenStack Infra |
nova/rocky: status |
Triaged |
In Progress |
|
2019-04-12 18:58:12 |
OpenStack Infra |
nova/rocky: assignee |
|
Matt Riedemann (mriedem) |
|
2019-04-12 20:56:26 |
OpenStack Infra |
nova/queens: status |
Triaged |
In Progress |
|
2019-04-12 20:56:26 |
OpenStack Infra |
nova/queens: assignee |
|
Matt Riedemann (mriedem) |
|
2019-04-12 23:20:38 |
OpenStack Infra |
tags |
compute placement resize |
compute in-stable-stein placement resize |
|
2019-04-12 23:20:48 |
OpenStack Infra |
nova/stein: status |
In Progress |
Fix Committed |
|
2019-04-16 15:17:20 |
OpenStack Infra |
tags |
compute in-stable-stein placement resize |
compute in-stable-rocky in-stable-stein placement resize |
|
2019-04-16 15:37:05 |
OpenStack Infra |
nova/rocky: status |
In Progress |
Fix Committed |
|
2019-05-13 21:30:38 |
Matt Riedemann |
nova/rocky: status |
Fix Committed |
In Progress |
|
2019-05-15 00:13:40 |
s10 |
bug |
|
|
added subscriber s10 |
2019-05-24 10:24:10 |
OpenStack Infra |
tags |
compute in-stable-rocky in-stable-stein placement resize |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize |
|
2019-05-24 19:11:10 |
OpenStack Infra |
nova/rocky: status |
In Progress |
Fix Committed |
|
2019-05-29 00:51:07 |
OpenStack Infra |
nova/queens: assignee |
Matt Riedemann (mriedem) |
Tony Breeds (o-tony) |
|
2019-05-30 10:13:24 |
OpenStack Infra |
nova/queens: status |
In Progress |
Fix Committed |
|
2019-06-07 17:03:08 |
Rodrigo Barbieri |
description |
Description:
When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed.
The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node.
When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active.
Steps to reproduce:
Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver.
Expected results:
Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug.
Actual results:
Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state.
Environment:
I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver. |
Description:
When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed.
The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node.
When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active.
Steps to reproduce:
Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver.
Expected results:
Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug.
Actual results:
Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state.
Environment:
I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver.
[Impact]
If users attempting to perform cold migrations face any issues when
the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely.
This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring.
This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually.
[Test Case]
The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised.
An example when using libvirt is to add a line:
raise Exception("TEST")
in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012 and then restart the nova-compute service.
Then, invoke a cold migration through "openstack server migrate {id}", wait for VERIFY_RESIZE status, and then invoke "openstack server resize {id} --confirm". The confirmation will fail asynchronously and the instance will be in ERROR status, while the migration database record is in "confirming" state and the stale allocations for the source host is still present in the "allocations" database table.
[Regression Potential]
The patches have been cherry-picked from upstream which helps to reduce the regression potential of these fixes.
[Other Info]
None |
|
2019-06-07 17:11:00 |
Rodrigo Barbieri |
attachment added |
|
lp_1821594_bionic_queens.debdiff https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5269378/+files/lp_1821594_bionic_queens.debdiff |
|
2019-06-07 17:11:26 |
Rodrigo Barbieri |
attachment added |
|
lp_1821594_cosmic_rocky.debdiff https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5269379/+files/lp_1821594_cosmic_rocky.debdiff |
|
2019-06-07 17:16:11 |
Rodrigo Barbieri |
description |
Description:
When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed.
The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node.
When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active.
Steps to reproduce:
Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver.
Expected results:
Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug.
Actual results:
Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state.
Environment:
I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver.
[Impact]
If users attempting to perform cold migrations face any issues when
the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely.
This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring.
This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually.
[Test Case]
The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised.
An example when using libvirt is to add a line:
raise Exception("TEST")
in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012 and then restart the nova-compute service.
Then, invoke a cold migration through "openstack server migrate {id}", wait for VERIFY_RESIZE status, and then invoke "openstack server resize {id} --confirm". The confirmation will fail asynchronously and the instance will be in ERROR status, while the migration database record is in "confirming" state and the stale allocations for the source host is still present in the "allocations" database table.
[Regression Potential]
The patches have been cherry-picked from upstream which helps to reduce the regression potential of these fixes.
[Other Info]
None |
Description:
When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed.
The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node.
When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active.
Steps to reproduce:
Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver.
Expected results:
Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug.
Actual results:
Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state.
Environment:
I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver.
[Impact]
If users attempting to perform cold migrations face any issues when
the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely.
This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring.
This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually.
[Test Case]
The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised.
An example when using libvirt is to add a line:
raise Exception("TEST")
in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012 and then restart the nova-compute service.
Then, invoke a cold migration through "openstack server migrate {id}", wait for VERIFY_RESIZE status, and then invoke "openstack server resize {id} --confirm". The confirmation will fail asynchronously and the instance will be in ERROR status, while the migration database record is in "confirming" state and the stale allocations for the source host is still present in the "allocations" database table.
[Regression Potential]
New functional test https://review.opendev.org/#/c/657870/ validated the fix and was backported all the way to Queens. The fix being backported caused no functional test to fail.
[Other Info]
None |
|
2019-06-10 13:42:13 |
Rodrigo Barbieri |
attachment removed |
lp_1821594_cosmic_rocky.debdiff https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5269379/+files/lp_1821594_cosmic_rocky.debdiff |
|
|
2019-06-10 13:42:22 |
Rodrigo Barbieri |
attachment removed |
lp_1821594_bionic_queens.debdiff https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5269378/+files/lp_1821594_bionic_queens.debdiff |
|
|
2019-06-10 13:42:44 |
Rodrigo Barbieri |
attachment added |
|
bug_1821594_cosmic_rocky.debdiff https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5269929/+files/bug_1821594_cosmic_rocky.debdiff |
|
2019-06-10 13:43:07 |
Rodrigo Barbieri |
attachment added |
|
bug_1821594_bionic_queens.debdiff https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5269930/+files/bug_1821594_bionic_queens.debdiff |
|
2019-06-11 11:50:17 |
Rodrigo Barbieri |
tags |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts |
|
2019-06-12 16:14:09 |
Edward Hope-Morley |
bug task added |
|
cloud-archive |
|
2019-06-12 16:14:31 |
Edward Hope-Morley |
nominated for series |
|
cloud-archive/queens |
|
2019-06-12 16:14:31 |
Edward Hope-Morley |
bug task added |
|
cloud-archive/queens |
|
2019-06-12 16:14:31 |
Edward Hope-Morley |
nominated for series |
|
cloud-archive/stein |
|
2019-06-12 16:14:31 |
Edward Hope-Morley |
bug task added |
|
cloud-archive/stein |
|
2019-06-12 16:14:31 |
Edward Hope-Morley |
nominated for series |
|
cloud-archive/train |
|
2019-06-12 16:14:31 |
Edward Hope-Morley |
bug task added |
|
cloud-archive/train |
|
2019-06-12 16:14:31 |
Edward Hope-Morley |
nominated for series |
|
cloud-archive/rocky |
|
2019-06-12 16:14:31 |
Edward Hope-Morley |
bug task added |
|
cloud-archive/rocky |
|
2019-06-12 16:15:21 |
Edward Hope-Morley |
bug task added |
|
nova (Ubuntu) |
|
2019-06-12 16:15:34 |
Edward Hope-Morley |
nominated for series |
|
Ubuntu Bionic |
|
2019-06-12 16:15:34 |
Edward Hope-Morley |
bug task added |
|
nova (Ubuntu Bionic) |
|
2019-06-12 16:15:34 |
Edward Hope-Morley |
nominated for series |
|
Ubuntu Eoan |
|
2019-06-12 16:15:34 |
Edward Hope-Morley |
bug task added |
|
nova (Ubuntu Eoan) |
|
2019-06-12 16:15:34 |
Edward Hope-Morley |
nominated for series |
|
Ubuntu Cosmic |
|
2019-06-12 16:15:34 |
Edward Hope-Morley |
bug task added |
|
nova (Ubuntu Cosmic) |
|
2019-06-12 16:15:34 |
Edward Hope-Morley |
nominated for series |
|
Ubuntu Disco |
|
2019-06-12 16:15:34 |
Edward Hope-Morley |
bug task added |
|
nova (Ubuntu Disco) |
|
2019-06-12 16:15:58 |
Edward Hope-Morley |
tags |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed |
|
2019-06-12 16:17:07 |
Rodrigo Barbieri |
description |
Description:
When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed.
The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node.
When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active.
Steps to reproduce:
Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver.
Expected results:
Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug.
Actual results:
Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state.
Environment:
I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver.
[Impact]
If users attempting to perform cold migrations face any issues when
the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely.
This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring.
This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually.
[Test Case]
The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised.
An example when using libvirt is to add a line:
raise Exception("TEST")
in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012 and then restart the nova-compute service.
Then, invoke a cold migration through "openstack server migrate {id}", wait for VERIFY_RESIZE status, and then invoke "openstack server resize {id} --confirm". The confirmation will fail asynchronously and the instance will be in ERROR status, while the migration database record is in "confirming" state and the stale allocations for the source host is still present in the "allocations" database table.
[Regression Potential]
New functional test https://review.opendev.org/#/c/657870/ validated the fix and was backported all the way to Queens. The fix being backported caused no functional test to fail.
[Other Info]
None |
Description:
When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed.
The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node.
When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active.
Steps to reproduce:
Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver.
Expected results:
Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug.
Actual results:
Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state.
Environment:
I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver.
=======================================================================
[Impact]
If users attempting to perform cold migrations face any issues when
the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely.
This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring.
This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually.
[Test Case]
The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised.
An example when using libvirt is to add a line:
raise Exception("TEST")
in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012 and then restart the nova-compute service.
Then, invoke a cold migration through "openstack server migrate {id}", wait for VERIFY_RESIZE status, and then invoke "openstack server resize {id} --confirm". The confirmation will fail asynchronously and the instance will be in ERROR status, while the migration database record is in "confirming" state and the stale allocations for the source host is still present in the "allocations" database table.
[Regression Potential]
New functional test https://review.opendev.org/#/c/657870/ validated the fix and was backported all the way to Queens. The fix being backported caused no functional test to fail.
[Other Info]
None |
|
2019-06-12 16:19:15 |
Edward Hope-Morley |
summary |
Error in confirm_migration leaves stale allocations and 'confirming' migration state |
[SRU] Error in confirm_migration leaves stale allocations and 'confirming' migration state |
|
2019-06-12 16:20:47 |
Edward Hope-Morley |
nova (Ubuntu Eoan): status |
New |
Fix Committed |
|
2019-06-12 16:20:58 |
Edward Hope-Morley |
cloud-archive/train: status |
New |
Fix Committed |
|
2019-06-17 18:51:39 |
Corey Bryant |
cloud-archive/stein: importance |
Undecided |
Medium |
|
2019-06-17 18:51:39 |
Corey Bryant |
cloud-archive/stein: status |
New |
Triaged |
|
2019-06-17 18:51:57 |
Corey Bryant |
cloud-archive/rocky: importance |
Undecided |
Medium |
|
2019-06-17 18:51:57 |
Corey Bryant |
cloud-archive/rocky: status |
New |
Triaged |
|
2019-06-17 18:52:12 |
Corey Bryant |
cloud-archive/queens: importance |
Undecided |
Medium |
|
2019-06-17 18:52:12 |
Corey Bryant |
cloud-archive/queens: status |
New |
Triaged |
|
2019-06-17 18:52:37 |
Corey Bryant |
nova (Ubuntu Disco): importance |
Undecided |
Medium |
|
2019-06-17 18:52:37 |
Corey Bryant |
nova (Ubuntu Disco): status |
New |
Triaged |
|
2019-06-17 18:52:52 |
Corey Bryant |
nova (Ubuntu Cosmic): importance |
Undecided |
Medium |
|
2019-06-17 18:52:52 |
Corey Bryant |
nova (Ubuntu Cosmic): status |
New |
Triaged |
|
2019-06-17 18:53:06 |
Corey Bryant |
nova (Ubuntu Bionic): importance |
Undecided |
Medium |
|
2019-06-17 18:53:06 |
Corey Bryant |
nova (Ubuntu Bionic): status |
New |
Triaged |
|
2019-06-17 18:55:28 |
Corey Bryant |
bug |
|
|
added subscriber Ubuntu Stable Release Updates Team |
2019-06-18 20:37:42 |
Brian Murray |
nova (Ubuntu Cosmic): status |
Triaged |
Fix Committed |
|
2019-06-18 20:37:48 |
Brian Murray |
bug |
|
|
added subscriber SRU Verification |
2019-06-18 20:37:52 |
Brian Murray |
tags |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-needed verification-needed-cosmic |
|
2019-06-26 13:08:54 |
Corey Bryant |
cloud-archive/rocky: status |
Triaged |
Fix Committed |
|
2019-06-26 13:08:56 |
Corey Bryant |
tags |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-needed verification-needed-cosmic |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-needed verification-needed-cosmic verification-rocky-needed |
|
2019-06-26 18:59:46 |
Rodrigo Barbieri |
description |
Description:
When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed.
The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node.
When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active.
Steps to reproduce:
Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver.
Expected results:
Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug.
Actual results:
Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state.
Environment:
I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver.
=======================================================================
[Impact]
If users attempting to perform cold migrations face any issues when
the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely.
This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring.
This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually.
[Test Case]
The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised.
An example when using libvirt is to add a line:
raise Exception("TEST")
in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012 and then restart the nova-compute service.
Then, invoke a cold migration through "openstack server migrate {id}", wait for VERIFY_RESIZE status, and then invoke "openstack server resize {id} --confirm". The confirmation will fail asynchronously and the instance will be in ERROR status, while the migration database record is in "confirming" state and the stale allocations for the source host is still present in the "allocations" database table.
[Regression Potential]
New functional test https://review.opendev.org/#/c/657870/ validated the fix and was backported all the way to Queens. The fix being backported caused no functional test to fail.
[Other Info]
None |
Description:
When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed.
The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node.
When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active.
Steps to reproduce:
Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver.
Expected results:
Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug.
Actual results:
Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state.
Environment:
I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver.
=======================================================================
[Impact]
If users attempting to perform cold migrations face any issues when
the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely.
This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring.
This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually.
[Test Case]
1. Reproducing the bug
1a. Inject failure
The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised.
An example when using libvirt is to add a line:
raise Exception("TEST")
in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012
1b. Restart nova-compute service: systemctl restart nova-compute
1c. Create a VM
1d. Then, invoke a cold migration: "openstack server migrate {id}"
1e. Wait for instance status: VERIFY_RESIZE
1f. Invoke "openstack server resize {id} --confirm"
1g. Wait for instance status: ERROR
1h. Check migration stuck in "confirming" status: nova migration-list
1i. Check allocations, you should see 2 allocations, one with the VM ID, the other with the migration uuid
export ENDPOINT=<placement_endpoint>
export TOKEN=`openstack token issue| grep ' id '| awk '{print $4}'`
for id in $(curl -k -s -X GET $ENDPOINT/resource_providers -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" | jq -r .resource_providers[].uuid); do curl -k -s -X GET $ENDPOINT/resource_providers/$id/allocations -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" | jq [.allocations]; done
2. Cleanup
2a. Delete the VM
2b. Delete the stale allocation:
export ID=<migration_uuid>
curl -k -s -X DELETE $ENDPOINT/allocations/$ID -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement placement 1.17"
3. Install package that contains the fixed code
4. Confirm bug is fixed
4a. Repeat steps 1a through 1g
4b. Check migration with "error" status: nova migration-list
4c. Check allocations, you should see only 1 allocation with the VM ID
for id in $(curl -k -s -X GET $ENDPOINT/resource_providers -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" | jq -r .resource_providers[].uuid); do curl -k -s -X GET $ENDPOINT/resource_providers/$id/allocations -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" | jq [.allocations]; done
5. Cleanup
5a. Delete the VM
[Regression Potential]
New functional test https://review.opendev.org/#/c/657870/ validated the fix and was backported all the way to Queens. The fix being backported caused no functional test to fail.
[Other Info]
None |
|
2019-06-27 00:34:01 |
Rodrigo Barbieri |
attachment added |
|
cosmic-rocky-validation.txt https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5273645/+files/cosmic-rocky-validation.txt |
|
2019-06-27 00:34:32 |
Rodrigo Barbieri |
attachment added |
|
bionic-rocky-validation.txt https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5273646/+files/bionic-rocky-validation.txt |
|
2019-07-01 13:53:04 |
Rodrigo Barbieri |
tags |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-needed verification-needed-cosmic verification-rocky-needed |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done-cosmic verification-needed verification-rocky-needed |
|
2019-07-01 15:17:36 |
Rodrigo Barbieri |
tags |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done-cosmic verification-needed verification-rocky-needed |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done-cosmic verification-needed verification-rocky-done |
|
2019-07-01 15:17:48 |
Rodrigo Barbieri |
tags |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done-cosmic verification-needed verification-rocky-done |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-cosmic verification-rocky-done |
|
2019-07-02 16:37:04 |
Brian Murray |
removed subscriber Ubuntu Stable Release Updates Team |
|
|
|
2019-07-02 16:37:21 |
Launchpad Janitor |
nova (Ubuntu Cosmic): status |
Fix Committed |
Fix Released |
|
2019-07-03 12:53:46 |
James Page |
cloud-archive/rocky: status |
Fix Committed |
Fix Released |
|
2019-07-03 16:25:24 |
Brian Murray |
nova (Ubuntu Bionic): status |
Triaged |
Fix Committed |
|
2019-07-03 16:25:28 |
Brian Murray |
bug |
|
|
added subscriber Ubuntu Stable Release Updates Team |
2019-07-03 16:26:57 |
Brian Murray |
tags |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-cosmic verification-rocky-done |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done-cosmic verification-needed verification-needed-bionic verification-rocky-done |
|
2019-07-05 15:16:00 |
Rodrigo Barbieri |
attachment added |
|
bionic-queens-validation.txt https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5275280/+files/bionic-queens-validation.txt |
|
2019-07-05 15:17:11 |
Rodrigo Barbieri |
tags |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done-cosmic verification-needed verification-needed-bionic verification-rocky-done |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-bionic verification-done-cosmic verification-rocky-done |
|
2019-07-08 09:29:04 |
Edward Hope-Morley |
cloud-archive/stein: status |
Triaged |
Fix Committed |
|
2019-07-08 09:29:21 |
Edward Hope-Morley |
nova (Ubuntu Disco): status |
Triaged |
Fix Committed |
|
2019-07-11 15:57:51 |
Corey Bryant |
cloud-archive/queens: status |
Triaged |
Fix Committed |
|
2019-07-11 15:57:55 |
Corey Bryant |
tags |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-bionic verification-done-cosmic verification-rocky-done |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-bionic verification-done-cosmic verification-queens-needed verification-rocky-done |
|
2019-07-11 19:32:45 |
Rodrigo Barbieri |
attachment added |
|
xenial-queens-validation.txt https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5276539/+files/xenial-queens-validation.txt |
|
2019-07-11 19:33:22 |
Rodrigo Barbieri |
tags |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-bionic verification-done-cosmic verification-queens-needed verification-rocky-done |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-bionic verification-done-cosmic verification-queens-done verification-rocky-done |
|
2019-07-15 13:44:57 |
Edward Hope-Morley |
nova/stein: status |
Fix Committed |
Fix Released |
|
2019-07-15 13:45:19 |
Edward Hope-Morley |
cloud-archive/stein: status |
Fix Committed |
Fix Released |
|
2019-07-15 13:45:30 |
Edward Hope-Morley |
nova (Ubuntu Disco): status |
Fix Committed |
Fix Released |
|
2019-08-06 17:39:36 |
Launchpad Janitor |
nova (Ubuntu Bionic): status |
Fix Committed |
Fix Released |
|
2019-08-29 13:01:13 |
Corey Bryant |
cloud-archive/queens: status |
Fix Committed |
Fix Released |
|
2019-08-29 13:02:30 |
Corey Bryant |
cloud-archive/train: status |
Fix Committed |
Fix Released |
|
2019-08-29 13:02:47 |
Corey Bryant |
nova (Ubuntu Eoan): status |
Fix Committed |
Fix Released |
|
2019-09-09 13:25:09 |
Edward Hope-Morley |
tags |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-bionic verification-done-cosmic verification-queens-done verification-rocky-done |
compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-done verification-done verification-done-bionic verification-done-cosmic verification-queens-done verification-rocky-done |
|
2019-10-09 06:24:31 |
Alvaro Uria |
bug |
|
|
added subscriber Canonical IS BootStack |