OpenStack Compute (nova)

Bug #1821594
Activity log

Activity log for bug #1821594

Date	Who	What changed	Old value	New value	Message
2019-03-25 13:56:09	Rodrigo Barbieri	bug			added bug
2019-03-25 13:59:42	Matt Riedemann	nova: status	New	Triaged
2019-03-25 13:59:44	Matt Riedemann	nova: importance	Undecided	Medium
2019-03-25 14:00:28	Matt Riedemann	tags		compute placement resize
2019-03-25 14:22:56	Matt Riedemann	nova: assignee		Matt Riedemann (mriedem)
2019-03-25 15:55:16	Matt Riedemann	nominated for series		nova/rocky
2019-03-25 15:55:16	Matt Riedemann	bug task added		nova/rocky
2019-03-25 15:55:16	Matt Riedemann	nominated for series		nova/pike
2019-03-25 15:55:16	Matt Riedemann	bug task added		nova/pike
2019-03-25 15:55:16	Matt Riedemann	nominated for series		nova/stein
2019-03-25 15:55:16	Matt Riedemann	bug task added		nova/stein
2019-03-25 15:55:16	Matt Riedemann	nominated for series		nova/queens
2019-03-25 15:55:16	Matt Riedemann	bug task added		nova/queens
2019-03-25 15:55:24	Matt Riedemann	nova/pike: status	New	Triaged
2019-03-25 15:55:28	Matt Riedemann	nova/rocky: status	New	Triaged
2019-03-25 15:55:31	Matt Riedemann	nova/stein: status	New	Triaged
2019-03-25 15:55:47	Matt Riedemann	nova/rocky: importance	Undecided	Medium
2019-03-25 15:55:53	Matt Riedemann	nova/pike: importance	Undecided	Medium
2019-03-25 15:56:36	Matt Riedemann	nova/queens: status	New	Triaged
2019-03-25 15:56:38	Matt Riedemann	nova/queens: importance	Undecided	Medium
2019-03-25 15:56:43	Matt Riedemann	nova/stein: importance	Undecided	Medium
2019-03-25 17:20:33	OpenStack Infra	nova: status	Triaged	In Progress
2019-03-28 19:35:05	OpenStack Infra	nova: assignee	Matt Riedemann (mriedem)	Eric Fried (efried)
2019-03-29 13:29:25	Matt Riedemann	nova: assignee	Eric Fried (efried)	Matt Riedemann (mriedem)
2019-04-02 18:33:19	OpenStack Infra	nova/stein: status	Triaged	In Progress
2019-04-02 18:33:19	OpenStack Infra	nova/stein: assignee		Matt Riedemann (mriedem)
2019-04-05 01:02:20	OpenStack Infra	nova: status	In Progress	Fix Released
2019-04-10 07:39:22	David Negreira	bug			added subscriber David Negreira
2019-04-12 18:58:12	OpenStack Infra	nova/rocky: status	Triaged	In Progress
2019-04-12 18:58:12	OpenStack Infra	nova/rocky: assignee		Matt Riedemann (mriedem)
2019-04-12 20:56:26	OpenStack Infra	nova/queens: status	Triaged	In Progress
2019-04-12 20:56:26	OpenStack Infra	nova/queens: assignee		Matt Riedemann (mriedem)
2019-04-12 23:20:38	OpenStack Infra	tags	compute placement resize	compute in-stable-stein placement resize
2019-04-12 23:20:48	OpenStack Infra	nova/stein: status	In Progress	Fix Committed
2019-04-16 15:17:20	OpenStack Infra	tags	compute in-stable-stein placement resize	compute in-stable-rocky in-stable-stein placement resize
2019-04-16 15:37:05	OpenStack Infra	nova/rocky: status	In Progress	Fix Committed
2019-05-13 21:30:38	Matt Riedemann	nova/rocky: status	Fix Committed	In Progress
2019-05-15 00:13:40	s10	bug			added subscriber s10
2019-05-24 10:24:10	OpenStack Infra	tags	compute in-stable-rocky in-stable-stein placement resize	compute in-stable-queens in-stable-rocky in-stable-stein placement resize
2019-05-24 19:11:10	OpenStack Infra	nova/rocky: status	In Progress	Fix Committed
2019-05-29 00:51:07	OpenStack Infra	nova/queens: assignee	Matt Riedemann (mriedem)	Tony Breeds (o-tony)
2019-05-30 10:13:24	OpenStack Infra	nova/queens: status	In Progress	Fix Committed
2019-06-07 17:03:08	Rodrigo Barbieri	description	Description: When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed. The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node. When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active. Steps to reproduce: Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver. Expected results: Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug. Actual results: Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state. Environment: I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver.	Description: When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed. The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node. When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active. Steps to reproduce: Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver. Expected results: Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug. Actual results: Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state. Environment: I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver. [Impact] If users attempting to perform cold migrations face any issues when the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely. This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring. This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually. [Test Case] The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised. An example when using libvirt is to add a line: raise Exception("TEST") in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012 and then restart the nova-compute service. Then, invoke a cold migration through "openstack server migrate {id}", wait for VERIFY_RESIZE status, and then invoke "openstack server resize {id} --confirm". The confirmation will fail asynchronously and the instance will be in ERROR status, while the migration database record is in "confirming" state and the stale allocations for the source host is still present in the "allocations" database table. [Regression Potential] The patches have been cherry-picked from upstream which helps to reduce the regression potential of these fixes. [Other Info] None
2019-06-07 17:11:00	Rodrigo Barbieri	attachment added		lp_1821594_bionic_queens.debdiff https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5269378/+files/lp_1821594_bionic_queens.debdiff
2019-06-07 17:11:26	Rodrigo Barbieri	attachment added		lp_1821594_cosmic_rocky.debdiff https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5269379/+files/lp_1821594_cosmic_rocky.debdiff
2019-06-07 17:16:11	Rodrigo Barbieri	description	Description: When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed. The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node. When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active. Steps to reproduce: Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver. Expected results: Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug. Actual results: Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state. Environment: I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver. [Impact] If users attempting to perform cold migrations face any issues when the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely. This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring. This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually. [Test Case] The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised. An example when using libvirt is to add a line: raise Exception("TEST") in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012 and then restart the nova-compute service. Then, invoke a cold migration through "openstack server migrate {id}", wait for VERIFY_RESIZE status, and then invoke "openstack server resize {id} --confirm". The confirmation will fail asynchronously and the instance will be in ERROR status, while the migration database record is in "confirming" state and the stale allocations for the source host is still present in the "allocations" database table. [Regression Potential] The patches have been cherry-picked from upstream which helps to reduce the regression potential of these fixes. [Other Info] None	Description: When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed. The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node. When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active. Steps to reproduce: Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver. Expected results: Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug. Actual results: Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state. Environment: I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver. [Impact] If users attempting to perform cold migrations face any issues when the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely. This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring. This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually. [Test Case] The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised. An example when using libvirt is to add a line: raise Exception("TEST") in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012 and then restart the nova-compute service. Then, invoke a cold migration through "openstack server migrate {id}", wait for VERIFY_RESIZE status, and then invoke "openstack server resize {id} --confirm". The confirmation will fail asynchronously and the instance will be in ERROR status, while the migration database record is in "confirming" state and the stale allocations for the source host is still present in the "allocations" database table. [Regression Potential] New functional test https://review.opendev.org/#/c/657870/ validated the fix and was backported all the way to Queens. The fix being backported caused no functional test to fail. [Other Info] None
2019-06-10 13:42:13	Rodrigo Barbieri	attachment removed	lp_1821594_cosmic_rocky.debdiff https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5269379/+files/lp_1821594_cosmic_rocky.debdiff
2019-06-10 13:42:22	Rodrigo Barbieri	attachment removed	lp_1821594_bionic_queens.debdiff https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5269378/+files/lp_1821594_bionic_queens.debdiff
2019-06-10 13:42:44	Rodrigo Barbieri	attachment added		bug_1821594_cosmic_rocky.debdiff https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5269929/+files/bug_1821594_cosmic_rocky.debdiff
2019-06-10 13:43:07	Rodrigo Barbieri	attachment added		bug_1821594_bionic_queens.debdiff https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5269930/+files/bug_1821594_bionic_queens.debdiff
2019-06-11 11:50:17	Rodrigo Barbieri	tags	compute in-stable-queens in-stable-rocky in-stable-stein placement resize	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts
2019-06-12 16:14:09	Edward Hope-Morley	bug task added		cloud-archive
2019-06-12 16:14:31	Edward Hope-Morley	nominated for series		cloud-archive/queens
2019-06-12 16:14:31	Edward Hope-Morley	bug task added		cloud-archive/queens
2019-06-12 16:14:31	Edward Hope-Morley	nominated for series		cloud-archive/stein
2019-06-12 16:14:31	Edward Hope-Morley	bug task added		cloud-archive/stein
2019-06-12 16:14:31	Edward Hope-Morley	nominated for series		cloud-archive/train
2019-06-12 16:14:31	Edward Hope-Morley	bug task added		cloud-archive/train
2019-06-12 16:14:31	Edward Hope-Morley	nominated for series		cloud-archive/rocky
2019-06-12 16:14:31	Edward Hope-Morley	bug task added		cloud-archive/rocky
2019-06-12 16:15:21	Edward Hope-Morley	bug task added		nova (Ubuntu)
2019-06-12 16:15:34	Edward Hope-Morley	nominated for series		Ubuntu Bionic
2019-06-12 16:15:34	Edward Hope-Morley	bug task added		nova (Ubuntu Bionic)
2019-06-12 16:15:34	Edward Hope-Morley	nominated for series		Ubuntu Eoan
2019-06-12 16:15:34	Edward Hope-Morley	bug task added		nova (Ubuntu Eoan)
2019-06-12 16:15:34	Edward Hope-Morley	nominated for series		Ubuntu Cosmic
2019-06-12 16:15:34	Edward Hope-Morley	bug task added		nova (Ubuntu Cosmic)
2019-06-12 16:15:34	Edward Hope-Morley	nominated for series		Ubuntu Disco
2019-06-12 16:15:34	Edward Hope-Morley	bug task added		nova (Ubuntu Disco)
2019-06-12 16:15:58	Edward Hope-Morley	tags	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed
2019-06-12 16:17:07	Rodrigo Barbieri	description	Description: When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed. The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node. When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active. Steps to reproduce: Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver. Expected results: Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug. Actual results: Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state. Environment: I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver. [Impact] If users attempting to perform cold migrations face any issues when the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely. This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring. This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually. [Test Case] The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised. An example when using libvirt is to add a line: raise Exception("TEST") in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012 and then restart the nova-compute service. Then, invoke a cold migration through "openstack server migrate {id}", wait for VERIFY_RESIZE status, and then invoke "openstack server resize {id} --confirm". The confirmation will fail asynchronously and the instance will be in ERROR status, while the migration database record is in "confirming" state and the stale allocations for the source host is still present in the "allocations" database table. [Regression Potential] New functional test https://review.opendev.org/#/c/657870/ validated the fix and was backported all the way to Queens. The fix being backported caused no functional test to fail. [Other Info] None	Description: When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed. The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node. When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active. Steps to reproduce: Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver. Expected results: Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug. Actual results: Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state. Environment: I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver. ======================================================================= [Impact] If users attempting to perform cold migrations face any issues when the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely. This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring. This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually. [Test Case] The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised. An example when using libvirt is to add a line: raise Exception("TEST") in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012 and then restart the nova-compute service. Then, invoke a cold migration through "openstack server migrate {id}", wait for VERIFY_RESIZE status, and then invoke "openstack server resize {id} --confirm". The confirmation will fail asynchronously and the instance will be in ERROR status, while the migration database record is in "confirming" state and the stale allocations for the source host is still present in the "allocations" database table. [Regression Potential] New functional test https://review.opendev.org/#/c/657870/ validated the fix and was backported all the way to Queens. The fix being backported caused no functional test to fail. [Other Info] None
2019-06-12 16:19:15	Edward Hope-Morley	summary	Error in confirm_migration leaves stale allocations and 'confirming' migration state	[SRU] Error in confirm_migration leaves stale allocations and 'confirming' migration state
2019-06-12 16:20:47	Edward Hope-Morley	nova (Ubuntu Eoan): status	New	Fix Committed
2019-06-12 16:20:58	Edward Hope-Morley	cloud-archive/train: status	New	Fix Committed
2019-06-17 18:51:39	Corey Bryant	cloud-archive/stein: importance	Undecided	Medium
2019-06-17 18:51:39	Corey Bryant	cloud-archive/stein: status	New	Triaged
2019-06-17 18:51:57	Corey Bryant	cloud-archive/rocky: importance	Undecided	Medium
2019-06-17 18:51:57	Corey Bryant	cloud-archive/rocky: status	New	Triaged
2019-06-17 18:52:12	Corey Bryant	cloud-archive/queens: importance	Undecided	Medium
2019-06-17 18:52:12	Corey Bryant	cloud-archive/queens: status	New	Triaged
2019-06-17 18:52:37	Corey Bryant	nova (Ubuntu Disco): importance	Undecided	Medium
2019-06-17 18:52:37	Corey Bryant	nova (Ubuntu Disco): status	New	Triaged
2019-06-17 18:52:52	Corey Bryant	nova (Ubuntu Cosmic): importance	Undecided	Medium
2019-06-17 18:52:52	Corey Bryant	nova (Ubuntu Cosmic): status	New	Triaged
2019-06-17 18:53:06	Corey Bryant	nova (Ubuntu Bionic): importance	Undecided	Medium
2019-06-17 18:53:06	Corey Bryant	nova (Ubuntu Bionic): status	New	Triaged
2019-06-17 18:55:28	Corey Bryant	bug			added subscriber Ubuntu Stable Release Updates Team
2019-06-18 20:37:42	Brian Murray	nova (Ubuntu Cosmic): status	Triaged	Fix Committed
2019-06-18 20:37:48	Brian Murray	bug			added subscriber SRU Verification
2019-06-18 20:37:52	Brian Murray	tags	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-needed verification-needed-cosmic
2019-06-26 13:08:54	Corey Bryant	cloud-archive/rocky: status	Triaged	Fix Committed
2019-06-26 13:08:56	Corey Bryant	tags	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-needed verification-needed-cosmic	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-needed verification-needed-cosmic verification-rocky-needed
2019-06-26 18:59:46	Rodrigo Barbieri	description	Description: When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed. The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node. When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active. Steps to reproduce: Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver. Expected results: Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug. Actual results: Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state. Environment: I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver. ======================================================================= [Impact] If users attempting to perform cold migrations face any issues when the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely. This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring. This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually. [Test Case] The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised. An example when using libvirt is to add a line: raise Exception("TEST") in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012 and then restart the nova-compute service. Then, invoke a cold migration through "openstack server migrate {id}", wait for VERIFY_RESIZE status, and then invoke "openstack server resize {id} --confirm". The confirmation will fail asynchronously and the instance will be in ERROR status, while the migration database record is in "confirming" state and the stale allocations for the source host is still present in the "allocations" database table. [Regression Potential] New functional test https://review.opendev.org/#/c/657870/ validated the fix and was backported all the way to Queens. The fix being backported caused no functional test to fail. [Other Info] None	Description: When performing a cold migration, if an exception is raised by the driver during confirm_migration (this runs in the source node), the migration record is stuck in "confirming" state and the allocations against the source node are not removed. The instance is fine at the destination in this stage, but the source host has allocations that is not possible to clean without going to the database or invoking the Placement API via curl. After several migration attempts that fail in the same spot, the source node is filled with these allocations that prevent new instances from being created or instances migrated to this node. When confirm_migration fails in this stage, the migrating instance can be saved through a hard reboot or a reset state to active. Steps to reproduce: Unfortunately, I don't have logs of the real root cause of the problem inside driver.confirm_migration running libvirt driver. However, the stale allocations and migration status problem can be easily reproduced by raising an exception in libvirt driver's confirm_migration method, and it would affect any driver. Expected results: Discussed this issue with efried and mriedem over #openstack-nova on March 25th, 2019. They confirmed that allocations not being cleared up is a bug. Actual results: Instance is fine at the destination after a reset-state. Source node has stale allocations that prevent new instances from being created/migrated to the source node. Migration record is stuck in "confirming" state. Environment: I verified this bug on on pike, queens and stein branches. Running libvirt KVM driver. ======================================================================= [Impact] If users attempting to perform cold migrations face any issues when the virt driver is running the "Confirm Migration" step, the failure leaves stale allocation records in the database, in migration records in "confirming" state. The stale allocations are not cleaned up by nova, consuming the user's quota indefinitely. This bug was confirmed from pike to stein release, and a fix was implemented for queens, rocky and stein. It should be backported to those releases to prevent the issue from reoccurring. This fix prevents new stale allocations being left over, by cleaning them up immediately when the failures occur. At the moment, the users affected by this bug have to clean their previous stale allocations manually. [Test Case] 1. Reproducing the bug 1a. Inject failure The root cause for this problem may vary for each driver and environment, so to reproduce the bug, it is necessary first to inject a failure in the driver's confirm_migration method to cause an exception to be raised. An example when using libvirt is to add a line: raise Exception("TEST") in https://github.com/openstack/nova/blob/a57b990c6bffa4c7447081b86573972866c696d2/nova/virt/libvirt/driver.py#L9012 1b. Restart nova-compute service: systemctl restart nova-compute 1c. Create a VM 1d. Then, invoke a cold migration: "openstack server migrate {id}" 1e. Wait for instance status: VERIFY_RESIZE 1f. Invoke "openstack server resize {id} --confirm" 1g. Wait for instance status: ERROR 1h. Check migration stuck in "confirming" status: nova migration-list 1i. Check allocations, you should see 2 allocations, one with the VM ID, the other with the migration uuid export ENDPOINT=<placement_endpoint> export TOKEN=`openstack token issue\| grep ' id '\| awk '{print $4}'` for id in $(curl -k -s -X GET $ENDPOINT/resource_providers -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" \| jq -r .resource_providers[].uuid); do curl -k -s -X GET $ENDPOINT/resource_providers/$id/allocations -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" \| jq [.allocations]; done 2. Cleanup 2a. Delete the VM 2b. Delete the stale allocation: export ID=<migration_uuid> curl -k -s -X DELETE $ENDPOINT/allocations/$ID -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement placement 1.17" 3. Install package that contains the fixed code 4. Confirm bug is fixed 4a. Repeat steps 1a through 1g 4b. Check migration with "error" status: nova migration-list 4c. Check allocations, you should see only 1 allocation with the VM ID for id in $(curl -k -s -X GET $ENDPOINT/resource_providers -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" \| jq -r .resource_providers[].uuid); do curl -k -s -X GET $ENDPOINT/resource_providers/$id/allocations -H "Accept: application/json" -H "X-Auth-Token: $TOKEN" -H "OpenStack-API-Version: placement 1.17" \| jq [.allocations]; done 5. Cleanup 5a. Delete the VM [Regression Potential] New functional test https://review.opendev.org/#/c/657870/ validated the fix and was backported all the way to Queens. The fix being backported caused no functional test to fail. [Other Info] None
2019-06-27 00:34:01	Rodrigo Barbieri	attachment added		cosmic-rocky-validation.txt https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5273645/+files/cosmic-rocky-validation.txt
2019-06-27 00:34:32	Rodrigo Barbieri	attachment added		bionic-rocky-validation.txt https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5273646/+files/bionic-rocky-validation.txt
2019-07-01 13:53:04	Rodrigo Barbieri	tags	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-needed verification-needed-cosmic verification-rocky-needed	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done-cosmic verification-needed verification-rocky-needed
2019-07-01 15:17:36	Rodrigo Barbieri	tags	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done-cosmic verification-needed verification-rocky-needed	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done-cosmic verification-needed verification-rocky-done
2019-07-01 15:17:48	Rodrigo Barbieri	tags	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done-cosmic verification-needed verification-rocky-done	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-cosmic verification-rocky-done
2019-07-02 16:37:04	Brian Murray	removed subscriber Ubuntu Stable Release Updates Team
2019-07-02 16:37:21	Launchpad Janitor	nova (Ubuntu Cosmic): status	Fix Committed	Fix Released
2019-07-03 12:53:46	James Page	cloud-archive/rocky: status	Fix Committed	Fix Released
2019-07-03 16:25:24	Brian Murray	nova (Ubuntu Bionic): status	Triaged	Fix Committed
2019-07-03 16:25:28	Brian Murray	bug			added subscriber Ubuntu Stable Release Updates Team
2019-07-03 16:26:57	Brian Murray	tags	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-cosmic verification-rocky-done	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done-cosmic verification-needed verification-needed-bionic verification-rocky-done
2019-07-05 15:16:00	Rodrigo Barbieri	attachment added		bionic-queens-validation.txt https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5275280/+files/bionic-queens-validation.txt
2019-07-05 15:17:11	Rodrigo Barbieri	tags	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done-cosmic verification-needed verification-needed-bionic verification-rocky-done	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-bionic verification-done-cosmic verification-rocky-done
2019-07-08 09:29:04	Edward Hope-Morley	cloud-archive/stein: status	Triaged	Fix Committed
2019-07-08 09:29:21	Edward Hope-Morley	nova (Ubuntu Disco): status	Triaged	Fix Committed
2019-07-11 15:57:51	Corey Bryant	cloud-archive/queens: status	Triaged	Fix Committed
2019-07-11 15:57:55	Corey Bryant	tags	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-bionic verification-done-cosmic verification-rocky-done	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-bionic verification-done-cosmic verification-queens-needed verification-rocky-done
2019-07-11 19:32:45	Rodrigo Barbieri	attachment added		xenial-queens-validation.txt https://bugs.launchpad.net/nova/+bug/1821594/+attachment/5276539/+files/xenial-queens-validation.txt
2019-07-11 19:33:22	Rodrigo Barbieri	tags	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-bionic verification-done-cosmic verification-queens-needed verification-rocky-done	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-bionic verification-done-cosmic verification-queens-done verification-rocky-done
2019-07-15 13:44:57	Edward Hope-Morley	nova/stein: status	Fix Committed	Fix Released
2019-07-15 13:45:19	Edward Hope-Morley	cloud-archive/stein: status	Fix Committed	Fix Released
2019-07-15 13:45:30	Edward Hope-Morley	nova (Ubuntu Disco): status	Fix Committed	Fix Released
2019-08-06 17:39:36	Launchpad Janitor	nova (Ubuntu Bionic): status	Fix Committed	Fix Released
2019-08-29 13:01:13	Corey Bryant	cloud-archive/queens: status	Fix Committed	Fix Released
2019-08-29 13:02:30	Corey Bryant	cloud-archive/train: status	Fix Committed	Fix Released
2019-08-29 13:02:47	Corey Bryant	nova (Ubuntu Eoan): status	Fix Committed	Fix Released
2019-09-09 13:25:09	Edward Hope-Morley	tags	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-needed verification-done verification-done-bionic verification-done-cosmic verification-queens-done verification-rocky-done	compute in-stable-queens in-stable-rocky in-stable-stein placement resize sts sts-sru-done verification-done verification-done-bionic verification-done-cosmic verification-queens-done verification-rocky-done
2019-10-09 06:24:31	Alvaro Uria	bug			added subscriber Canonical IS BootStack