test_instance_returns_to_active_after_resize checking job failed by timeout

Bug #1778837 reported by Zhao Chao
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack DBaaS (Trove)
In Progress
Medium
Zhao Chao

Bug Description

Recently test_instance_returns_to_active_after_resize checking job always failed by timeout. However it looks like we have some time issues for creating Trove instance operations, e.g.

https://review.openstack.org/#/c/573070/ is merged at 2018.06.14, during trove-functional-mysql, we have the slowest jobs as following:

2018-06-14 12:09:09.964672 | primary | 2018-06-14 12:09:09.964 | Slowest 5 tests took 3279.58 secs:
2018-06-14 12:09:09.966844 | primary | 2018-06-14 12:09:09.966 | 761.53 proboscis.case.MethodTest (test_slave_created)
2018-06-14 12:09:09.968909 | primary | 2018-06-14 12:09:09.968 | 717.25 proboscis.case.MethodTest (test_instance_created)
2018-06-14 12:09:09.970781 | primary | 2018-06-14 12:09:09.970 | 636.80 proboscis.case.MethodTest (test_resize_down)
2018-06-14 12:09:09.973725 | primary | 2018-06-14 12:09:09.972 | 593.17 proboscis.case.MethodTest (test_instance_restored_incremental)
2018-06-14 12:09:09.975439 | primary | 2018-06-14 12:09:09.974 | 570.83 proboscis.case.MethodTest (test_instance_returns_to_active_after_resize)
2018-06-14 12:09:10.158818 | primary | + /home/zuul/workspace/devstack-gate/functions.sh:tsfilter:L103: return 0

which seemed reasonable.

the same job ran in my dev enrionment based on a 16G-mem vm had a result of:
Slowest 5 tests took 3115.86 secs:
    733.88 proboscis.case.MethodTest (test_slave_created)
    631.29 proboscis.case.MethodTest (test_instance_restored_incremental)
    620.56 proboscis.case.MethodTest (test_resize_down)
    580.96 proboscis.case.MethodTest (test_instance_returns_to_active_after_resize)
    549.17 proboscis.case.MethodTest (test_instance_created)

We could see the time taked by test_instance_returns_to_active_after_resize and test_instance_created are comparable, and normally both of them could finish within 600 seconds.

But for jobs of https://review.openstack.org/#/c/554156/ at 2018.06.24, we got:

2018-06-24 04:21:06.459761 | primary | 2018-06-24 04:21:06.455 | WaitForGuestInstallationToFinish
2018-06-24 04:41:10.865910 | primary | 2018-06-24 04:41:10.862 | test_instance_created OK 1204.44

2018-06-24 05:04:20.782473 | primary | 2018-06-24 05:04:20.780 | test_status_changed_to_resize OK 12.82
2018-06-24 05:19:20.803160 | primary | 2018-06-24 05:19:20.800 | test_instance_returns_to_active_after_resize ERROR

test_instance_returns_to_active_after_resize lasted for more than 900 seconds.

and a similar case for https://review.openstack.org/#/c/576133/ jobs at 2018.06.23:

2018-06-23 22:19:25.128879 | primary | 2018-06-23 22:19:25.127 | test_instance_created OK 959.40

2018-06-23 22:37:12.583075 | primary | 2018-06-23 22:37:12.582 | test_status_changed_to_resize OK 8.11
2018-06-23 22:52:12.600720 | primary | 2018-06-23 22:52:12.598 | test_instance_returns_to_active_after_resize ERROR

for the same patchset and same job at 2018.06.22, the case was different:

2018-06-22 09:44:04.187805 | primary | 2018-06-22 09:44:04.187 | test_instance_created OK 507.28

2018-06-22 10:05:03.686425 | primary | 2018-06-22 10:05:03.685 | test_instance_returns_to_active_after_resize OK 449.97

Zhao Chao (zhaochao1984)
Changed in trove:
assignee: nobody → Zhao Chao (zhaochao1984)
importance: Undecided → Medium
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to trove (master)

Fix proposed to branch: master
Review: https://review.openstack.org/578273

Revision history for this message
Zhao Chao (zhaochao1984) wrote :

There also are other timeouts for the overall job running time, the reason may be in different places, e.g. devstack

for https://review.openstack.org/#/c/573070/ is merged at 2018.06.14, devstack summary was like this(the "trove-scenario-mysql-single" job):
=========================
DevStack Component Timing
 (times are in seconds)
=========================
run_process 23
test_with_retry 3
apt-get-update 2
pip_install 577
osc 322
wait_for_service 27
dbsync 22
apt-get 111
-------------------------
Unaccounted time 408
=========================
Total runtime 1495

and for https://review.openstack.org/#/c/578273/2 (patchset 2, the failed "trove-scenario-mysql-single"):

DevStack Component Timing
 (times are in seconds)
=========================
run_process 37
test_with_retry 4
apt-get-update 3
pip_install 800
osc 438
wait_for_service 39
dbsync 100
apt-get 295
-------------------------
Unaccounted time 638
=========================
Total runtime 2354

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to trove (master)

Reviewed: https://review.openstack.org/578273
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=ceaaf161e3ea9c0b6bcbe8aa48beda800b867d67
Submitter: Zuul
Branch: master

commit ceaaf161e3ea9c0b6bcbe8aa48beda800b867d67
Author: Zhao Chao <email address hidden>
Date: Wed Jun 27 12:10:42 2018 +0800

    Raise timeout for instance resizing checking job

    Recently the trove-functional-mysql usually failed by timeout at
    the test_instance_returns_to_active_after_resize case, and the cause is
    creating and resizing instances could be quit slow(e.g. 1200+ seconds).
    This may be caused by the pressure of the CI system and should be
    further investigated. The patch is a quick fix for the problem by
    raising the timeout to 30 minutes(it's now 900 seconds by default, but
    instead it's 32 minutes for instance creating, so raising the value to
    30 minutes should be reasonable).

    Partial-Bug: #1778837

    Change-Id: Ib903a72b098499553978ec881843a7bbee25b45f
    Signed-off-by: Zhao Chao <email address hidden>

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers