Slow tests randomly timing out jobs (which aren't marked slow)

Bug #1783405 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tempest
Fix Released
High
Ghanshyam Mann

Bug Description

Seen here:

http://logs.openstack.org/19/584319/1/gate/tempest-full/f22ef43/job-output.txt.gz#_2018-07-23_00_57_14_222550

2018-07-23 00:37:28.906178 | controller | full runtests: commands[2] | tempest run --combine --serial --regex '(?!.*\[.*\bslow\b.*\])(^tempest\.scenario)' --concurrency=4
2018-07-23 00:39:38.484603 | controller | {0} tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario [113.811021s] ... ok
2018-07-23 00:41:55.964899 | controller | {0} tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_reboot [123.980534s] ... ok
2018-07-23 00:43:42.914941 | controller | {0} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_hotplug_nic [102.832513s] ... ok
2018-07-23 00:46:17.235348 | controller | {0} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops [154.296725s] ... ok
2018-07-23 00:49:45.105508 | controller | {0} tempest.scenario.test_network_v6.TestGettingAddress.test_dualnet_multi_prefix_slaac [203.377526s] ... ok
2018-07-23 00:49:49.843548 | controller | {0} tempest.scenario.test_object_storage_basic_ops.TestObjectStorageBasicOps.test_swift_basic_ops [0.687237s] ... ok
2018-07-23 00:52:51.534208 | controller | {0} tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_cross_tenant_traffic [176.416681s] ... ok
2018-07-23 00:54:43.948963 | controller | {0} tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_in_tenant_traffic [112.402924s] ... ok
2018-07-23 00:56:08.710792 | controller | {0} tempest.scenario.test_server_basic_ops.TestServerBasicOps.test_server_basic_ops [70.124854s] ... ok
2018-07-23 00:56:20.974437 | controller | {0} setUpClass (tempest.scenario.test_server_multinode.TestServerMultinode) ... SKIPPED: Less than 2 compute nodes, skipping multinode tests.
2018-07-23 00:57:14.222550 | RUN END RESULT_TIMED_OUT: [untrusted : git.openstack.org/openstack/tempest/playbooks/devstack-tempest.yaml@master]

This is after the concurrent API tests have run and the job is running the scenario tests, and some are taking nearly 3 minutes so it seems some of these networking scenario tests need to be audited to determine if they should be marked as 'slow'.

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22RESULT_TIMED_OUT%3A%20%5Buntrusted%20%3A%20git.openstack.org%2Fopenstack%2Ftempest%2Fplaybooks%2Fdevstack-tempest.yaml%40master%5D%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d

Revision history for this message
Matt Riedemann (mriedem) wrote :

Note that there are some non-scenario API tests which are also taking a long time, and should maybe be changed to scenario tests:

http://status.openstack.org/openstack-health/#/test/tempest.api.compute.admin.test_create_server.ServersWithSpecificFlavorTestJSON.test_verify_created_server_ephemeral_disk

In the run above, that test took nearly 3 minutes:

2018-07-22 23:40:24.653816 | controller | {3} tempest.api.compute.admin.test_create_server.ServersWithSpecificFlavorTestJSON.test_verify_created_server_ephemeral_disk [157.562831s] ... ok

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is another very slow API test:

2018-07-22 23:59:18.070751 | controller | {3} tempest.api.compute.servers.test_server_personality.ServerPersonalityTestJSON.test_can_create_server_with_max_number_personality_files [159.680922s] ... ok

Revision history for this message
Matt Riedemann (mriedem) wrote :

2018-07-23 00:01:51.804344 | controller | {3} tempest.api.compute.servers.test_server_personality.ServerPersonalityTestJSON.test_create_server_with_personality [153.716416s] ... ok

2018-07-23 00:05:49.667178 | controller | {1} tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_rebuild_server_with_volume_attached [159.306280s] ... ok

2018-07-23 00:16:04.294866 | controller | {3} tempest.api.compute.volumes.test_attach_volume.AttachVolumeShelveTestJSON.test_attach_volume_shelved_or_offload_server [161.107350s] ... ok

Changed in tempest:
status: New → Confirmed
summary: - Networking scenario tests are randomly timing out
+ Slow tests randomly timing out jobs (which aren't marked slow)
Changed in tempest:
importance: Undecided → High
Changed in tempest:
assignee: nobody → Ghanshyam Mann (ghanshyammann)
Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

Yes, there are many tests which taking more times since couple of weeks.

I have prepared the ethercalc and collected the tests which are taking > 100 sec runtime. Collected the Avg time taken since last 14 days. Based on that data we all can mark our voting to make them 'slow' test.
 - https://ethercalc.openstack.org/dorupfz6s9qt

As you mentioned in Bug, API test also taking too much time so there is no issue to mark them 'slow' if required.

Revision history for this message
Attila Fazekas (afazekas) wrote :

There were multiple CPU security issue where the mitigation had serious impact to performance this can be the reason.
I wonder can we look into what can be made faster before tempest run.
For ex osc client invoked serially with a relative slow strategy around devstack...

Revision history for this message
Attila Fazekas (afazekas) wrote :

The last hit was https://access.redhat.com/security/vulnerabilities/L1TF-perf the impact varies based on cpu firmware and workload.
I would not expect the CI hardware would be renewed very soon. Having slow machines in the pool still better than not having them and seeing occasionally longer job run is still better than failed job.
We should increase the job timeout until the situation does not gets better (~year).

Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

current tempest-full timeout is 7200sec [1] and most of the timeout we observed is around 8000sec[2].

I am ok to increase the job timeout to 9000 sec. any other opinion ?

[1] https://github.com/openstack/tempest/blob/c9eb94f56e7ba639a38abc4278440a07167da2ba/.zuul.yaml#L13

[2] https://review.openstack.org/#/c/601830/

Changed in tempest:
assignee: Ghanshyam Mann (ghanshyammann) → Matt Riedemann (mriedem)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/603900
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=b5720533035734f95e6fc54f60365d64f6a19256
Submitter: Zuul
Branch: master

commit b5720533035734f95e6fc54f60365d64f6a19256
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 19 16:02:05 2018 -0400

    Mark some slow tests as slow so they move out of the main job

    We've seen a trend of tempest-full jobs timing out because of
    tests taking a long time to complete which are not marked slow:

      http://status.openstack.org/elastic-recheck/#1783405

    Based on some analysis of the slowest tests:

      https://ethercalc.openstack.org/dorupfz6s9qt

    This change marks several tests as slow and gives reasons why
    they are marked as slow - not only because they are slow, but
    also because they are less common scenarios. These will still
    be run in the tempest-slow job so we aren't losing test coverage
    with this change.

    If this makes the tempest-slow job itself take significantly
    longer, we should consider running the tempest.api* slow tests
    concurrently (like they are in tempest-full) and continue to
    run the tempest.scenario* tests in serial. The tempest-slow job
    currently runs all slow tests in serial, so this will make the
    job take longer.

    Partial-Bug: #1783405

    Change-Id: I2a0e154ba38c7407b41b7e986a0bf9b451c65cae

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/615085

Changed in tempest:
assignee: Matt Riedemann (mriedem) → Ghanshyam Mann (ghanshyammann)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/615085
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=bd8cc59dc2e362963e82bc797840dd53d19e0d86
Submitter: Zuul
Branch: master

commit bd8cc59dc2e362963e82bc797840dd53d19e0d86
Author: ghanshyam <email address hidden>
Date: Fri Nov 2 08:00:21 2018 +0000

    Mark network slow test as slow

    We have observed that tempest-full jobs was time out
    many times.

      http://status.openstack.org/elastic-recheck/#1783405

    Based on below ethercalc, we are trying to check the slow
    tests and mark them slow so that those tests will not run
    as part of tempest-full job. There is other job tempest-slow
    which will run these tests.

      https://ethercalc.openstack.org/dorupfz6s9qt

    Compute slow tests have been marked slow in
    - I2a0e154ba38c7407b41b7e986a0bf9b451c65cae

    This commit mark network slow tests based on above ethercalc.

    Change-Id: Ic2b3f5ea5b6885fe727a21810ddd9e17b779a1a0
    Partial-Bug: #1783405

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I took 3 recent tempest runs[1][2][3] that timeouted from 3 different node providers. Sorted the tests by execution time, took the 10 long running tests from each run: [4].

It seems there are multiple interface attach test that takes long.

[1] http://logs.openstack.org/10/624210/1/check/tempest-full/f24d0fa/job-output.txt
[2] http://logs.openstack.org/10/620010/5/check/tempest-full/7eee897/job-output.txt
[3] http://logs.openstack.org/68/567268/35/check/tempest-full/48b4a7c/job-output.txt
[4] http://paste.openstack.org/show/737238/

Revision history for this message
Matt Riedemann (mriedem) wrote :

Thanks gibi. Based on that, it seems we could consider marking these tests as slow:

1. tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_attach_detach_volume

http://status.openstack.org/openstack-health/#/test/tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_attach_detach_volume?duration=P1M

Takes about 2.5 minutes on average.

2. tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops

http://status.openstack.org/openstack-health/#/test/tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops?duration=P1M

Takes about 3 minutes on average.

3. tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_cross_tenant_traffic

http://status.openstack.org/openstack-health/#/test/tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_cross_tenant_traffic?duration=P1M

Takes about 2 minutes on average.

Revision history for this message
Matthew Treinish (treinish) wrote :

Fwiw, in the future you can use subunit2sql with something like:

http://paste.openstack.org/show/737341/

to get the average test times over the last 300 runs of tempest-full in one go. But, you'll still have to use openstack-health like mriedem did there to get the graphs of the execution time over time for an individual test.

Revision history for this message
Matt Riedemann (mriedem) wrote :

(3:08:28 PM) mtreinish: mriedem: fwiw: http://paste.openstack.org/show/737340/ is the average run time of every test over the last 300 runs of tempest-full in gate
(3:19:42 PM) mriedem: mtreinish: ok 2-4 there in that first paste are the same things i identified from gibi's comments in the bug report
(3:19:44 PM) mriedem: so that's good to know
(3:20:01 PM) mriedem: tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern: 161.5385627070707 is the only difference, but that one shouldn't surprise me
(3:20:10 PM) mriedem: kind of surprised that isn't already marked as slow

Revision history for this message
Matthew Treinish (treinish) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Here is an updated version of mtreinish's test times from the last 300 runs of tempest-full in comment 14:

http://paste.openstack.org/show/745114/

These are the tests that take over 3 minutes on average:

tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern: 208.15136922377624
tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_attach_detach_volume: 200.59249721999998
tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_reboot: 200.5173152341137
tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario: 198.13566518333334

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/637029

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/637030

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/637033

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tempest (master)

Reviewed: https://review.openstack.org/637029
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=4b8a7b863854b5047d9c6898348c03a40b041688
Submitter: Zuul
Branch: master

commit 4b8a7b863854b5047d9c6898348c03a40b041688
Author: Matt Riedemann <email address hidden>
Date: Thu Feb 14 14:07:11 2019 -0500

    Mark test_volume_boot_pattern as slow

    Based on the average test times taken from the last 300 runs
    of the tempest-full job [1] the test_volume_boot_pattern
    scenario test is the slowest test that is not marked slow at
    208 seconds.

    This change marks that test as slow which means it will still
    be run in the tempest-slow job.

    [1] paste.openstack.org/show/745114/

    Change-Id: Ibc5e5ecaa997744aecea8aff07e7a7e06178f5f3
    Related-Bug: #1783405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/637030
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=b4763ef57e54c4517de88affb983de1de2f0ac64
Submitter: Zuul
Branch: master

commit b4763ef57e54c4517de88affb983de1de2f0ac64
Author: Matt Riedemann <email address hidden>
Date: Thu Feb 14 14:09:11 2019 -0500

    Mark test_server_connectivity_reboot as slow

    Based on average test run times from the last 300 runs of
    the tempest-full job [1] the test_server_connectivity_reboot
    test is third slowest at about 200 seconds. This should not
    be a surprise given all of the other tests in the same class
    are already marked slow.

    This change marks that test slow like the others which means
    it will be run in the tempest-slow job.

    [1] http://paste.openstack.org/show/745114/

    Change-Id: Ie8ea1555cc4512bf29dff1e7df592dedfab28c61
    Related-Bug: #1783405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/637033
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=9968315d43bf9fc595d269aafecc78cfdcc06d17
Submitter: Zuul
Branch: master

commit 9968315d43bf9fc595d269aafecc78cfdcc06d17
Author: Matt Riedemann <email address hidden>
Date: Thu Feb 14 14:35:38 2019 -0500

    Conditionally mark test_attach_detach_volume as slow

    Based on average test times from the last 300 runs
    of the tempest-full job [1] this change marks the
    test_attach_detach_volume test as slow but only if
    SSH validation is enabled, which it is in tempest-full
    but might not be in other jobs that run it, for example
    in refstack which uses this test for interoperability
    certification.

    [1] http://paste.openstack.org/show/745114/

    Change-Id: I47710321d8dfb0345c793e66bd09a5dd1933557b
    Related-Bug: #1783405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tempest (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/650300

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/651865

Revision history for this message
Matt Riedemann (mriedem) wrote :

I noticed today that when shutting down an instance, like for a snapshot test, it's sometimes taking a full 60 seconds to wait for the graceful shutdown before just destroying the guest:

http://logs.openstack.org/56/656656/12/check/nova-multi-cell/6a403ba/compute1/logs/screen-n-cpu.txt.gz#_May_16_19_47_26_017509

May 16 19:47:26.017509 ubuntu-bionic-rax-dfw-0006201136 nova-compute[17635]: INFO nova.virt.libvirt.driver [None req-bde576dc-5de2-4113-982d-025c65b97e8d tempest-MigrationsAdminTest-391674513 tempest-MigrationsAdminTest-391674513] [instance: 21c45b79-b0f4-48a8-8949-9f480beb5864] Instance failed to shutdown in 60 seconds.

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Instance%20failed%20to%20shutdown%20in%2060%20seconds.%5C%22%20AND%20tags%3A%5C%22screen-n-cpu.txt%5C%22&from=7d

This could contribute to overall job/test timeouts since a lot of tempest test wait operations are using the build_timeout config and set to 196 seconds. I'm not sure why these guests aren't stopping, but they should be since they shouldn't be doing anything. Seems this started in Stein.

This is with libvirt 4.0.0-1ubuntu8.10 and qemu 1:2.11+dfsg-1ubuntu7.13.

Revision history for this message
Matt Riedemann (mriedem) wrote :

For the instance that fails to shutdown within 60 seconds in comment 25 I don't see any errors in the libvirtd logs, the shutdown request returns with code 0.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Reported nova bug 1829896 for the guest shutdown issue.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tempest (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/650300

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tempest (master)

Reviewed: https://review.opendev.org/651865
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=f66aea04a51d598f29ff12ae9221d61f426c8d7a
Submitter: Zuul
Branch: master

commit f66aea04a51d598f29ff12ae9221d61f426c8d7a
Author: Matt Riedemann <email address hidden>
Date: Thu Apr 11 12:35:30 2019 -0400

    Disable c-bak service in tempest-full* jobs

    Cinder backup tests suffer from a high failure rate and
    given they do not interact with many other services (glance
    and swift for the most part it looks like) we should disable
    them from running in the tempest-full job which is part of
    the integrated-gate project template, which is what this change
    does.

    My recommendation would be to make sure cinder backup test
    coverage is happening in some other non-integrated-gate jobs
    which are running on cinder changes already, like the
    cinder-tempest-dsvm-lvm-lio-barbican job (which is voting but
    not gating on cinder changes).

    This should also help with bug 1783405 where tempest-full
    can timeout due to backup tests which aren't marked slow.

    c-bak was already disabled in tempest-full-py3 because
    swift is disabled but this copies the same note to the
    tempest-full-py3 job definition.

    Change-Id: I9d39d9d7d706ba3fcba5ff352e8c9e31ab4639a2
    Related-Bug: #1483434
    Related-Bug: #1745168
    Related-Bug: #1783405
    Related-Bug: #1813217

Revision history for this message
Martin Kopec (mkopec) wrote :

What's the status here? Can it be considered done as https://review.opendev.org/651865 got merged?

Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

this is something we keep monitoring and if we find the slow test then we mark it slow. we did the complete first phase of this but I will say to keep it open for ussuri cycle in case we find more slow tests.

Revision history for this message
Martin Kopec (mkopec) wrote :

IRC openstack-qa meeting:
gmann: this is something we keep monitoring and if we find the slow test then we mark it slow. we did complete first phase of this but I will say to keep it open for ussuri cycle in case we find more slow tests.

Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

I think we are ok to close this and if any slow tests occur then we can mark that slow. jobs are pretty stable on timeout now.

Changed in tempest:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers