Migrate actions fail for big VMs

Bug #2131663 reported by Alfredo Moralejo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
watcher
Fix Released
Critical
Alfredo Moralejo

Bug Description

In some cases, specially for non-small virtual machines migrate actions fail.

Root cause is that the migration is taking in nova more than 120 seconds which is a hardcoded timeout in watcher code.

the helper method in nova_helper is setting 120s by default

https://github.com/openstack/watcher/blob/45cc5b9d8ba9e82d20d21d4b3eabcaf6992b26e2/watcher/common/nova_helper.py#L303

And the migrate action is not exposing the timeout as a parametrizable value

https://github.com/openstack/watcher/blob/45cc5b9d8ba9e82d20d21d4b3eabcaf6992b26e2/watcher/applier/actions/migration.py#L119-L120

Should this be a paramerizable for the system? by strategy? by audit?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to watcher (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/watcher/+/967693

Changed in watcher:
status: New → In Progress
Changed in watcher:
assignee: nobody → Alfredo Moralejo (amoralej)
importance: Undecided → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to watcher (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/watcher/+/968610

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to watcher (master)

Reviewed: https://review.opendev.org/c/openstack/watcher/+/967693
Committed: https://opendev.org/openstack/watcher/commit/13d73e9b4e64ac9fea8e301896a4dcabbbab4888
Submitter: "Zuul (22348)"
Branch: master

commit 13d73e9b4e64ac9fea8e301896a4dcabbbab4888
Author: Alfredo Moralejo <email address hidden>
Date: Wed Nov 19 12:37:06 2025 +0100

    Make VM migrations timeout configurable and apply reasonable defaults

    This change adds new 'migration_max_retries' and 'migrate_interval' configuration
    parameters to the [nova] section to control timeout behavior for VM
    migration operations.

    Changes:
    - Add a new [nova] section to the configuration file to store parameters
      related to the integration with nova.
    - Add migration_max_retries config option (default: 180) to define the max
      retries to check the result of VM migrations before giving up.
    - Add migration_interval config option (default: 5 seconds) to define the
      polling interval to check the VM status in migrate actions.
    - Update live_migrate_instance() and watcher_non_live_migrate_instance()
      to use configured migration_max_retries when retry parameter is None.
    - Add new parameter interval to the live_migrate_instance() and
      watcher_non_live_migrate_instance() methods.
    - Add comprehensive unit tests for timeout and retry functionality
    - Set migrate_max_retries to 120 and migration_interval to 1s in CI jobs.

    Closes-Bug: #2131663

    Assisted-By: claude-code (claude-sonnet-4.5)

    Change-Id: Ifed3c058d821ce3b0741627dcc414fe054eb9dca
    Signed-off-by: Alfredo Moralejo <email address hidden>

Changed in watcher:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to watcher (stable/2025.2)

Fix proposed to branch: stable/2025.2
Review: https://review.opendev.org/c/openstack/watcher/+/969483

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to watcher (master)

Reviewed: https://review.opendev.org/c/openstack/watcher/+/968610
Committed: https://opendev.org/openstack/watcher/commit/e427fa68a379cbff4641c937b71d353d9e6387a0
Submitter: "Zuul (22348)"
Branch: master

commit e427fa68a379cbff4641c937b71d353d9e6387a0
Author: Alfredo Moralejo <email address hidden>
Date: Thu Nov 27 09:35:24 2025 +0100

    Make VM resize timeout configurable with migration defaults

    This patch is applying the same approach, configuration parameters and
    values that the ones applied to the vm migrations in [1].

    Note that, internally, vm resize are treated by nova very similarly to
    migrations so it make sense to reuse the same parameters and default
    values.

    Related-Bug: #2131663

    [1] https://review.opendev.org/c/openstack/watcher/+/967693

    Change-Id: Ic81147e19f86d4a8efbecb539b4b83674e79e646
    Signed-off-by: Alfredo Moralejo <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to watcher (stable/2025.2)

Related fix proposed to branch: stable/2025.2
Review: https://review.opendev.org/c/openstack/watcher/+/969877

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to watcher (stable/2025.2)

Reviewed: https://review.opendev.org/c/openstack/watcher/+/969483
Committed: https://opendev.org/openstack/watcher/commit/f738ef7d621111df260283fadef78d5d6a7b995c
Submitter: "Zuul (22348)"
Branch: stable/2025.2

commit f738ef7d621111df260283fadef78d5d6a7b995c
Author: Alfredo Moralejo <email address hidden>
Date: Wed Nov 19 12:37:06 2025 +0100

    Make VM migrations timeout configurable and apply reasonable defaults

    This change adds new 'migration_max_retries' and 'migrate_interval' configuration
    parameters to the [nova] section to control timeout behavior for VM
    migration operations.

    Changes:
    - Add a new [nova] section to the configuration file to store parameters
      related to the integration with nova.
    - Add migration_max_retries config option (default: 180) to define the max
      retries to check the result of VM migrations before giving up.
    - Add migration_interval config option (default: 5 seconds) to define the
      polling interval to check the VM status in migrate actions.
    - Update live_migrate_instance() and watcher_non_live_migrate_instance()
      to use configured migration_max_retries when retry parameter is None.
    - Add new parameter interval to the live_migrate_instance() and
      watcher_non_live_migrate_instance() methods.
    - Add comprehensive unit tests for timeout and retry functionality
    - Set migrate_max_retries to 120 and migration_interval to 1s in CI jobs.

    Closes-Bug: #2131663

    Assisted-By: claude-code (claude-sonnet-4.5)

    Change-Id: Ifed3c058d821ce3b0741627dcc414fe054eb9dca
    Signed-off-by: Alfredo Moralejo <email address hidden>
    (cherry picked from commit 13d73e9b4e64ac9fea8e301896a4dcabbbab4888)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to watcher (stable/2025.2)

Reviewed: https://review.opendev.org/c/openstack/watcher/+/969877
Committed: https://opendev.org/openstack/watcher/commit/70b2a217e8bd4f8ecdd66ddeafa389a8116059fd
Submitter: "Zuul (22348)"
Branch: stable/2025.2

commit 70b2a217e8bd4f8ecdd66ddeafa389a8116059fd
Author: Alfredo Moralejo <email address hidden>
Date: Thu Nov 27 09:35:24 2025 +0100

    Make VM resize timeout configurable with migration defaults

    This patch is applying the same approach, configuration parameters and
    values that the ones applied to the vm migrations in [1].

    Note that, internally, vm resize are treated by nova very similarly to
    migrations so it make sense to reuse the same parameters and default
    values.

    Related-Bug: #2131663

    [1] https://review.opendev.org/c/openstack/watcher/+/967693

    Change-Id: Ic81147e19f86d4a8efbecb539b4b83674e79e646
    Signed-off-by: Alfredo Moralejo <email address hidden>
    (cherry picked from commit e427fa68a379cbff4641c937b71d353d9e6387a0)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.