Comment 29 for bug 1939023

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I was trying to reproduce that issue today but wasn't able to reproduce and investigate that issue. When I run it on test patch, tempest ended up for me in about 4300 seconds

======
Totals
======
Ran: 1425 tests in 4297.8985 sec.
 - Passed: 1303
 - Skipped: 121
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 1
Sum of execute time for each test: 8342.5592 sec.

I also checked builds history https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-wallaby and I see that time of that job varies from about 3h 14min (https://review.rdoproject.org/zuul/build/0f8c934c63864203a669348b086a2882) to more than 4 hours (and sometimes timeout, like e.g. https://review.rdoproject.org/zuul/build/9d2881833e4d4ff3903afa4a3501ecc3/logs)

Next I compared time of the test execution in the fast (https://review.rdoproject.org/zuul/build/0f8c934c63864203a669348b086a2882/logs) run and in the timed out one (https://review.rdoproject.org/zuul/build/9d2881833e4d4ff3903afa4a3501ecc3/logs). Looking at top 20 slowest tests in both run we see:

    fast run

    ('tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_rebuild_server_with_volume_attached', 101.336682)
    ('tempest.api.compute.admin.test_create_server.ServersWithSpecificFlavorTestJSON.test_verify_created_server_ephemeral_disk', 104.953735)
    ('tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_update_instance_port_admin_state', 113.117529)
    ('tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_attach_detach_volume', 113.68504)
    ('tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops', 115.171857)
    ('tempest.api.compute.volumes.test_attach_volume.AttachVolumeShelveTestJSON.test_attach_volume_shelved_or_offload_server', 115.596492)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_slaac_from_os', 116.36545)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_dhcp6_stateless_from_os', 118.156141)
    ('tempest.api.compute.admin.test_servers.ServersAdminTestJSON.test_rebuild_server_in_error_state', 129.538088)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_dualnet_slaac_from_os', 130.889002)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_multi_prefix_dhcpv6_stateless', 131.692391)
    ('tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_connectivity_between_vms_on_different_networks', 133.453555)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_multi_prefix_slaac', 135.368074)
    ('tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesTestJSON.test_create_list_show_delete_interfaces_by_network_port', 143.339424)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_dualnet_dhcp6_stateless_from_os', 148.011558)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_dualnet_multi_prefix_slaac', 156.358172)
    ('tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesTestJSON.test_reassign_port_between_servers', 157.718676)
    ('tempest.api.compute.servers.test_device_tagging.TaggedBootDevicesTest.test_tagged_boot_devices', 161.264332)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_dualnet_multi_prefix_dhcpv6_stateless', 164.44343)
    ('tempest.api.compute.servers.test_device_tagging.TaggedBootDevicesTest_v242.test_tagged_boot_devices', 164.663563)

    and timed out one:

    ('tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesTestJSON.test_create_list_show_delete_interfaces_by_fixed_ip', 167.733248)
    ('tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_update_instance_port_admin_state', 176.180869)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_dhcp6_stateless_from_os', 186.90833)
    ('tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops', 191.16799)
    ('tempest.api.compute.admin.test_create_server.ServersWithSpecificFlavorTestJSON.test_verify_created_server_ephemeral_disk', 192.309398)
    ('tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_connectivity_between_vms_on_different_networks', 192.523786)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_slaac_from_os', 194.261026)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_multi_prefix_slaac', 205.108607)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_multi_prefix_dhcpv6_stateless', 209.214145)
    ('tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_attach_detach_volume', 222.186212)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_dualnet_slaac_from_os', 222.788042)
    ('tempest.api.compute.volumes.test_attach_volume.AttachVolumeShelveTestJSON.test_detach_volume_shelved_or_offload_server', 223.802673)
    ('tempest.api.compute.volumes.test_attach_volume.AttachVolumeShelveTestJSON.test_attach_volume_shelved_or_offload_server', 229.714189)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_dualnet_multi_prefix_slaac', 230.452579)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_dualnet_dhcp6_stateless_from_os', 230.810108)
    ('tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesTestJSON.test_create_list_show_delete_interfaces_by_network_port', 245.107367)
    ('tempest.scenario.test_network_v6.TestGettingAddress.test_dualnet_multi_prefix_dhcpv6_stateless', 246.453286)
    ('tempest.api.compute.servers.test_device_tagging.TaggedBootDevicesTest.test_tagged_boot_devices', 248.856564)
    ('tempest.api.compute.servers.test_device_tagging.TaggedBootDevicesTest_v242.test_tagged_boot_devices', 266.036853)
    ('tempest.api.compute.servers.test_attach_interfaces.AttachInterfacesTestJSON.test_reassign_port_between_servers', 284.01886)

So it seems that in those slow runs all tests are running much more slower. There is no one strict test (or group of tests) which is breaking things.

While I was comparing those "slow" and "fast" run of tests I checked one of them in the "slow" run to see what is slowing it down as much.
I checked test AttachInterfacesTestJSON:test_reassign_port_between_servers from https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-wallaby/9d28818/logs/undercloud/home/zuul/tempest/tempest.log.txt.gz
One of the longest API requests there was:

2021-09-12 12:12:53.866 313000 INFO tempest.lib.common.rest_client [req-c8d718d1-8c44-4126-9a4c-2e6e66081953 ] Request (AttachInterfacesTestJSON:test_reassign_port_between_servers): 200 GET https://[2001:db8:fd00:1000::5]:13774/v2.1/servers/fa0a5e01-18b7-4d8c-bb5e-5676bd255623 21.053s

But when I looked at the same request in the nova api log I saw:
2021-09-12 12:12:53.858 12 INFO nova.api.openstack.requestlog [req-c8d718d1-8c44-4126-9a4c-2e6e66081953 dc8b761741ad45feaaa79fb89a76ba48 1b908d45bc8a42ce82de2aa2ce61c30f - default default] fd00:fd00:fd00:2000::1c2 "GET /v2.1/servers/fa0a5e01-18b7-4d8c-bb5e-5676bd255623" status: 200 len: 1664 microversion: 2.1 time: 2.400599

So for me the issue looks like it is somewhere in the haproxy maybe? As it seems that query was processed by Nova fast but was "stucked" somewhere in the middle.