[CI] gate functional and fullstack timeouts without reports of the causative test case

Bug #1860774 reported by Rodolfo Alonso
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Invalid
Medium
Unassigned

Bug Description

A common problem in the CI jobs (mainly in the gate), in the test suite timeout without having information about the offender(s) test case. In order to avoid blocking a whole test suite, a test case execution timeout should be set for both FT and fullstack CI jobs. Although the test suite can fail because of a non-passing test case, at least we'll have this information.

Example: [1]. The test suite fails but no test case is reported to fail. We can see worker {3} stops executing jobs very soon in the logs.

The variable OS_TEST_TIMEOUT seems not to be working properly.

[1] https://0bfc64d19aa73a2afe50-7b1f5eff599f257f64cbd89748a5b69e.ssl.cf2.rackcdn.com/703299/1/gate/neutron-functional/b855799/job-output.txt

description: updated
Revision history for this message
Lajos Katona (lajos-katona) wrote :

As I see OS_TEST_TIMEOUT is set to 180s in case of functional tests (See https://opendev.org/openstack/neutron/src/branch/master/tox.ini#L36) and to 600s in case of fullstack (see https://opendev.org/openstack/neutron/src/branch/master/tox.ini#L74) in tox.ini.

I don't know about zuul settings, perhaps there are some common/ancestor yamls that override these.

From the example you linked as I see none of the tests' execution time is over 180s:

neutron.tests.functional.agent.l3.extensions.test_gateway_ip_qos_extension.TestRouterGatewayIPQosAgentExtensionDVR.test_dvr_ha_router_failover_with_gw_and_floatingip 97.022904
neutron.tests.functional.agent.l3.extensions.test_port_forwarding_extension.TestL3AgentFipPortForwardingExtensionDVR.test_dvr_ha_router_unbound_from_agents 90.15016
neutron.tests.functional.agent.l3.extensions.qos.test_fip_qos_extension.TestL3AgentFipQosExtensionDVR.test_dvr_ha_router_unbound_from_agents 82.2691
neutron.tests.functional.agent.l3.test_dvr_router.TestDvrRouter.test_dvr_ha_router_failover_with_gw_and_floatingip 80.532779
neutron.tests.functional.agent.l3.extensions.test_gateway_ip_qos_extension.TestRouterGatewayIPQosAgentExtensionDVR.test_dvr_ha_router_failover_with_gw 80.438136
neutron.tests.functional.agent.l3.extensions.test_port_forwarding_extension.TestL3AgentFipPortForwardingExtensionDVR.test_dvr_ha_router_failover_with_gw 80.29684
neutron.tests.functional.db.migrations.test_2e0d7a8a1586_add_binding_index_to_routerl3agentbinding.TestHARouterPortMigrationMysql.test_walk_versions 77.645509
neutron.tests.functional.agent.l3.extensions.qos.test_fip_qos_extension.TestL3AgentFipQosExtensionDVR.test_dvr_ha_router_failover_with_gw_and_floatingip 75.95222
neutron.tests.functional.agent.l3.extensions.test_gateway_ip_qos_extension.TestRouterGatewayIPQosAgentExtensionDVR.test_dvr_non_ha_router_update 71.40815
neutron.tests.functional.db.migrations.test_3b935b28e7a0_migrate_to_pluggable_ipam.TestMigrationToPluggableIpamMysql.test_walk_versions 71.041897

The total test execution time is ~11252s (just dumbly adding together the numbers after each test from the job-output)
I hope my counting is correct.

The functional timeout (here: https://opendev.org/openstack/neutron/src/branch/master/zuul.d/base.yaml#L5) is 7800 so I suppose this should be increased, or am I missing something?

Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Lajos Katona (lajos-katona) wrote :
Revision history for this message
Lajos Katona (lajos-katona) wrote :

I should change my script to summarize the execution time of executors one-by-one....

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/704291

Revision history for this message
Lajos Katona (lajos-katona) wrote :

From the example the execution times:
{ 0 } 1459.9761220000007
{ 1 } 1501.0642839999996
{ 2 } 1502.1730889999997
{ 3 } 769.8994900000002
{ 4 } 1510.364208
{ 5 } 1489.5193370000002
{ 6 } 1514.6869349999993
{ 7 } 1505.1508729999991

So this is again just summing the outputs from job-output.txt, without the extra test execution overhead.

Changed in neutron:
status: New → Confirmed
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I think that there is some test(s) which are hanging under some condition and we don't have any easy way to check what tests those are.
Maybe we could write some simple script which will get and compare list of executed tests from such timeouted job with tests from passed (or even failed) jobs to see if this is always the same test(s) which are missing to be executed (hanged) or maybe it's totally random.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Lajos Katona (<email address hidden>) on branch: master
Review: https://review.opendev.org/704291

Changed in neutron:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.