Slow tests randomly timing out jobs (which aren't marked slow)

Bug #1783405 reported by Matt Riedemann on 2018-07-24
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tempest
High
Ghanshyam Mann

Bug Description

Seen here:

http://logs.openstack.org/19/584319/1/gate/tempest-full/f22ef43/job-output.txt.gz#_2018-07-23_00_57_14_222550

2018-07-23 00:37:28.906178 | controller | full runtests: commands[2] | tempest run --combine --serial --regex '(?!.*\[.*\bslow\b.*\])(^tempest\.scenario)' --concurrency=4
2018-07-23 00:39:38.484603 | controller | {0} tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario [113.811021s] ... ok
2018-07-23 00:41:55.964899 | controller | {0} tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_reboot [123.980534s] ... ok
2018-07-23 00:43:42.914941 | controller | {0} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_hotplug_nic [102.832513s] ... ok
2018-07-23 00:46:17.235348 | controller | {0} tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops [154.296725s] ... ok
2018-07-23 00:49:45.105508 | controller | {0} tempest.scenario.test_network_v6.TestGettingAddress.test_dualnet_multi_prefix_slaac [203.377526s] ... ok
2018-07-23 00:49:49.843548 | controller | {0} tempest.scenario.test_object_storage_basic_ops.TestObjectStorageBasicOps.test_swift_basic_ops [0.687237s] ... ok
2018-07-23 00:52:51.534208 | controller | {0} tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_cross_tenant_traffic [176.416681s] ... ok
2018-07-23 00:54:43.948963 | controller | {0} tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_in_tenant_traffic [112.402924s] ... ok
2018-07-23 00:56:08.710792 | controller | {0} tempest.scenario.test_server_basic_ops.TestServerBasicOps.test_server_basic_ops [70.124854s] ... ok
2018-07-23 00:56:20.974437 | controller | {0} setUpClass (tempest.scenario.test_server_multinode.TestServerMultinode) ... SKIPPED: Less than 2 compute nodes, skipping multinode tests.
2018-07-23 00:57:14.222550 | RUN END RESULT_TIMED_OUT: [untrusted : git.openstack.org/openstack/tempest/playbooks/devstack-tempest.yaml@master]

This is after the concurrent API tests have run and the job is running the scenario tests, and some are taking nearly 3 minutes so it seems some of these networking scenario tests need to be audited to determine if they should be marked as 'slow'.

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22RESULT_TIMED_OUT%3A%20%5Buntrusted%20%3A%20git.openstack.org%2Fopenstack%2Ftempest%2Fplaybooks%2Fdevstack-tempest.yaml%40master%5D%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d

Matt Riedemann (mriedem) wrote :

Note that there are some non-scenario API tests which are also taking a long time, and should maybe be changed to scenario tests:

http://status.openstack.org/openstack-health/#/test/tempest.api.compute.admin.test_create_server.ServersWithSpecificFlavorTestJSON.test_verify_created_server_ephemeral_disk

In the run above, that test took nearly 3 minutes:

2018-07-22 23:40:24.653816 | controller | {3} tempest.api.compute.admin.test_create_server.ServersWithSpecificFlavorTestJSON.test_verify_created_server_ephemeral_disk [157.562831s] ... ok

Matt Riedemann (mriedem) wrote :

This is another very slow API test:

2018-07-22 23:59:18.070751 | controller | {3} tempest.api.compute.servers.test_server_personality.ServerPersonalityTestJSON.test_can_create_server_with_max_number_personality_files [159.680922s] ... ok

Matt Riedemann (mriedem) wrote :

2018-07-23 00:01:51.804344 | controller | {3} tempest.api.compute.servers.test_server_personality.ServerPersonalityTestJSON.test_create_server_with_personality [153.716416s] ... ok

2018-07-23 00:05:49.667178 | controller | {1} tempest.api.compute.servers.test_server_actions.ServerActionsTestJSON.test_rebuild_server_with_volume_attached [159.306280s] ... ok

2018-07-23 00:16:04.294866 | controller | {3} tempest.api.compute.volumes.test_attach_volume.AttachVolumeShelveTestJSON.test_attach_volume_shelved_or_offload_server [161.107350s] ... ok

Changed in tempest:
status: New → Confirmed
summary: - Networking scenario tests are randomly timing out
+ Slow tests randomly timing out jobs (which aren't marked slow)
Changed in tempest:
importance: Undecided → High
Changed in tempest:
assignee: nobody → Ghanshyam Mann (ghanshyammann)
Ghanshyam Mann (ghanshyammann) wrote :

Yes, there are many tests which taking more times since couple of weeks.

I have prepared the ethercalc and collected the tests which are taking > 100 sec runtime. Collected the Avg time taken since last 14 days. Based on that data we all can mark our voting to make them 'slow' test.
 - https://ethercalc.openstack.org/dorupfz6s9qt

As you mentioned in Bug, API test also taking too much time so there is no issue to mark them 'slow' if required.

Attila Fazekas (afazekas) wrote :

There were multiple CPU security issue where the mitigation had serious impact to performance this can be the reason.
I wonder can we look into what can be made faster before tempest run.
For ex osc client invoked serially with a relative slow strategy around devstack...

Attila Fazekas (afazekas) wrote :

The last hit was https://access.redhat.com/security/vulnerabilities/L1TF-perf the impact varies based on cpu firmware and workload.
I would not expect the CI hardware would be renewed very soon. Having slow machines in the pool still better than not having them and seeing occasionally longer job run is still better than failed job.
We should increase the job timeout until the situation does not gets better (~year).

Ghanshyam Mann (ghanshyammann) wrote :

current tempest-full timeout is 7200sec [1] and most of the timeout we observed is around 8000sec[2].

I am ok to increase the job timeout to 9000 sec. any other opinion ?

[1] https://github.com/openstack/tempest/blob/c9eb94f56e7ba639a38abc4278440a07167da2ba/.zuul.yaml#L13

[2] https://review.openstack.org/#/c/601830/

Changed in tempest:
assignee: Ghanshyam Mann (ghanshyammann) → Matt Riedemann (mriedem)
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/603900
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=b5720533035734f95e6fc54f60365d64f6a19256
Submitter: Zuul
Branch: master

commit b5720533035734f95e6fc54f60365d64f6a19256
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 19 16:02:05 2018 -0400

    Mark some slow tests as slow so they move out of the main job

    We've seen a trend of tempest-full jobs timing out because of
    tests taking a long time to complete which are not marked slow:

      http://status.openstack.org/elastic-recheck/#1783405

    Based on some analysis of the slowest tests:

      https://ethercalc.openstack.org/dorupfz6s9qt

    This change marks several tests as slow and gives reasons why
    they are marked as slow - not only because they are slow, but
    also because they are less common scenarios. These will still
    be run in the tempest-slow job so we aren't losing test coverage
    with this change.

    If this makes the tempest-slow job itself take significantly
    longer, we should consider running the tempest.api* slow tests
    concurrently (like they are in tempest-full) and continue to
    run the tempest.scenario* tests in serial. The tempest-slow job
    currently runs all slow tests in serial, so this will make the
    job take longer.

    Partial-Bug: #1783405

    Change-Id: I2a0e154ba38c7407b41b7e986a0bf9b451c65cae

Fix proposed to branch: master
Review: https://review.openstack.org/615085

Changed in tempest:
assignee: Matt Riedemann (mriedem) → Ghanshyam Mann (ghanshyammann)

Reviewed: https://review.openstack.org/615085
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=bd8cc59dc2e362963e82bc797840dd53d19e0d86
Submitter: Zuul
Branch: master

commit bd8cc59dc2e362963e82bc797840dd53d19e0d86
Author: ghanshyam <email address hidden>
Date: Fri Nov 2 08:00:21 2018 +0000

    Mark network slow test as slow

    We have observed that tempest-full jobs was time out
    many times.

      http://status.openstack.org/elastic-recheck/#1783405

    Based on below ethercalc, we are trying to check the slow
    tests and mark them slow so that those tests will not run
    as part of tempest-full job. There is other job tempest-slow
    which will run these tests.

      https://ethercalc.openstack.org/dorupfz6s9qt

    Compute slow tests have been marked slow in
    - I2a0e154ba38c7407b41b7e986a0bf9b451c65cae

    This commit mark network slow tests based on above ethercalc.

    Change-Id: Ic2b3f5ea5b6885fe727a21810ddd9e17b779a1a0
    Partial-Bug: #1783405

I took 3 recent tempest runs[1][2][3] that timeouted from 3 different node providers. Sorted the tests by execution time, took the 10 long running tests from each run: [4].

It seems there are multiple interface attach test that takes long.

[1] http://logs.openstack.org/10/624210/1/check/tempest-full/f24d0fa/job-output.txt
[2] http://logs.openstack.org/10/620010/5/check/tempest-full/7eee897/job-output.txt
[3] http://logs.openstack.org/68/567268/35/check/tempest-full/48b4a7c/job-output.txt
[4] http://paste.openstack.org/show/737238/

Matt Riedemann (mriedem) wrote :

Thanks gibi. Based on that, it seems we could consider marking these tests as slow:

1. tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_attach_detach_volume

http://status.openstack.org/openstack-health/#/test/tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_attach_detach_volume?duration=P1M

Takes about 2.5 minutes on average.

2. tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops

http://status.openstack.org/openstack-health/#/test/tempest.scenario.test_network_basic_ops.TestNetworkBasicOps.test_network_basic_ops?duration=P1M

Takes about 3 minutes on average.

3. tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_cross_tenant_traffic

http://status.openstack.org/openstack-health/#/test/tempest.scenario.test_security_groups_basic_ops.TestSecurityGroupsBasicOps.test_cross_tenant_traffic?duration=P1M

Takes about 2 minutes on average.

Matthew Treinish (treinish) wrote :

Fwiw, in the future you can use subunit2sql with something like:

http://paste.openstack.org/show/737341/

to get the average test times over the last 300 runs of tempest-full in one go. But, you'll still have to use openstack-health like mriedem did there to get the graphs of the execution time over time for an individual test.

Matt Riedemann (mriedem) wrote :

(3:08:28 PM) mtreinish: mriedem: fwiw: http://paste.openstack.org/show/737340/ is the average run time of every test over the last 300 runs of tempest-full in gate
(3:19:42 PM) mriedem: mtreinish: ok 2-4 there in that first paste are the same things i identified from gibi's comments in the bug report
(3:19:44 PM) mriedem: so that's good to know
(3:20:01 PM) mriedem: tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern: 161.5385627070707 is the only difference, but that one shouldn't surprise me
(3:20:10 PM) mriedem: kind of surprised that isn't already marked as slow

Matt Riedemann (mriedem) wrote :

Here is an updated version of mtreinish's test times from the last 300 runs of tempest-full in comment 14:

http://paste.openstack.org/show/745114/

These are the tests that take over 3 minutes on average:

tempest.scenario.test_volume_boot_pattern.TestVolumeBootPattern.test_volume_boot_pattern: 208.15136922377624
tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_attach_detach_volume: 200.59249721999998
tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_reboot: 200.5173152341137
tempest.scenario.test_minimum_basic.TestMinimumBasicScenario.test_minimum_basic_scenario: 198.13566518333334

Related fix proposed to branch: master
Review: https://review.openstack.org/637030

Related fix proposed to branch: master
Review: https://review.openstack.org/637033

Reviewed: https://review.openstack.org/637029
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=4b8a7b863854b5047d9c6898348c03a40b041688
Submitter: Zuul
Branch: master

commit 4b8a7b863854b5047d9c6898348c03a40b041688
Author: Matt Riedemann <email address hidden>
Date: Thu Feb 14 14:07:11 2019 -0500

    Mark test_volume_boot_pattern as slow

    Based on the average test times taken from the last 300 runs
    of the tempest-full job [1] the test_volume_boot_pattern
    scenario test is the slowest test that is not marked slow at
    208 seconds.

    This change marks that test as slow which means it will still
    be run in the tempest-slow job.

    [1] paste.openstack.org/show/745114/

    Change-Id: Ibc5e5ecaa997744aecea8aff07e7a7e06178f5f3
    Related-Bug: #1783405

Reviewed: https://review.openstack.org/637030
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=b4763ef57e54c4517de88affb983de1de2f0ac64
Submitter: Zuul
Branch: master

commit b4763ef57e54c4517de88affb983de1de2f0ac64
Author: Matt Riedemann <email address hidden>
Date: Thu Feb 14 14:09:11 2019 -0500

    Mark test_server_connectivity_reboot as slow

    Based on average test run times from the last 300 runs of
    the tempest-full job [1] the test_server_connectivity_reboot
    test is third slowest at about 200 seconds. This should not
    be a surprise given all of the other tests in the same class
    are already marked slow.

    This change marks that test slow like the others which means
    it will be run in the tempest-slow job.

    [1] http://paste.openstack.org/show/745114/

    Change-Id: Ie8ea1555cc4512bf29dff1e7df592dedfab28c61
    Related-Bug: #1783405

Reviewed: https://review.openstack.org/637033
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=9968315d43bf9fc595d269aafecc78cfdcc06d17
Submitter: Zuul
Branch: master

commit 9968315d43bf9fc595d269aafecc78cfdcc06d17
Author: Matt Riedemann <email address hidden>
Date: Thu Feb 14 14:35:38 2019 -0500

    Conditionally mark test_attach_detach_volume as slow

    Based on average test times from the last 300 runs
    of the tempest-full job [1] this change marks the
    test_attach_detach_volume test as slow but only if
    SSH validation is enabled, which it is in tempest-full
    but might not be in other jobs that run it, for example
    in refstack which uses this test for interoperability
    certification.

    [1] http://paste.openstack.org/show/745114/

    Change-Id: I47710321d8dfb0345c793e66bd09a5dd1933557b
    Related-Bug: #1783405

Related fix proposed to branch: master
Review: https://review.openstack.org/651865

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers