func test test_bug_1938326.TestMigrateFromDownHost.test_migrate_from_disabled_host intermittently fails

Bug #1940741 reported by Balazs Gibizer
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Lee Yarwood

Bug Description

nova.tests.functional.regressions.test_bug_1938326.TestMigrateFromDownHost.test_migrate_from_disabled_host
----------------------------------------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):

      File "/root/rtox/nova/functional-py38/nova/tests/functional/regressions/test_bug_1938326.py", line 65, in test_migrate_from_disabled_host
    self._confirm_resize(server)

      File "/root/rtox/nova/functional-py38/nova/tests/functional/integrated_helpers.py", line 492, in _confirm_resize
    self.api.post_server_action(server['id'], {'confirmResize': None})

      File "/root/rtox/nova/functional-py38/nova/tests/functional/api/client.py", line 268, in post_server_action
    return self.api_post(

      File "/root/rtox/nova/functional-py38/nova/tests/functional/api/client.py", line 210, in api_post
    return APIResponse(self.api_request(relative_uri, **kwargs))

      File "/root/rtox/nova/functional-py38/nova/tests/functional/api/client.py", line 186, in api_request
    raise OpenStackApiException(

    nova.tests.functional.api.client.OpenStackApiException: Unexpected status code: {"conflictingRequest": {"code": 409, "message": "Service is unavailable at this time."}}

Bug signature with couple of hits: https://paste.opendev.org/show/808229/

Tags: gate-failure
tags: added: gate-failure
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Revision history for this message
Lee Yarwood (lyarwood) wrote :

Somehow the src host is being marked down by the servicegroup DB driver shortly after we call confirmResize, I assume because the dest compute is executing the original resize for so long that we starve the src?

2021-08-23 12:47:50,585 DEBUG [nova.servicegroup.drivers.db] Seems service nova-compute on host src is down. Last heartbeat was 2021-08-23 12:47:45.500309. Elapsed time is 5.084755

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/805667

Changed in nova:
status: New → In Progress
Lee Yarwood (lyarwood)
Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/805667
Committed: https://opendev.org/openstack/nova/commit/5e8267f703d040b6e8b56d5c0f3ed669170cbb2d
Submitter: "Zuul (22348)"
Branch: master

commit 5e8267f703d040b6e8b56d5c0f3ed669170cbb2d
Author: Lee Yarwood <email address hidden>
Date: Mon Aug 23 18:14:18 2021 +0100

    fup: Increase service_down_time beyond INITIAL_REPORTING_DELAY in test

    With service_down_time set to 5 seconds we could easily race the initial
    report from the service that is controlled by
    nova.servicegroup.api.INITIAL_REPORTING_DELAY which also defaults to 5
    seconds.

    This change simply increases service_down_time to 6 seconds to avoid
    this race, ensuring disabled but alive computes are able to initially
    report in and continue to be marked as up by the servicegroup API.

    Closes-Bug: #1940741
    Change-Id: I61fded6f513c18dbd0248ed54fc58a318dc05416

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/807714

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/807714
Committed: https://opendev.org/openstack/nova/commit/058d137ac88ecd45bdf4095a31783886cf64f820
Submitter: "Zuul (22348)"
Branch: master

commit 058d137ac88ecd45bdf4095a31783886cf64f820
Author: Balazs Gibizer <email address hidden>
Date: Tue Sep 7 14:40:12 2021 +0200

    Add more retries to TestMigrateFromDownHost tests

    We still having occasional failures in TestMigrateFromDownHost tests
    where the compute is not observed down after stopping the service. There
    was a fix I61fded6f513c18dbd0248ed54fc58a318dc05416 that increased the
    CONF.service_down_time to 6 seconds. However the
    _wait_for_service_parameter() helper by default uses 10 retries with 0.5
    sec delays. This can be pretty close to the 6 seconds of service
    timeout. This patch increase the number of retires to 20 avoid unstable
    test runs.

    Change-Id: Iad970fa3aa10aaa2c831a064c085c06fdc88305e
    Closes-Bug: #1940741

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 24.0.0.0rc1

This issue was fixed in the openstack/nova 24.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.