Fullstack tests fail due to "block_until_boot" timeout

Bug #1750337 reported by Slawek Kaplonski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Slawek Kaplonski

Bug Description

Sometimes in tests like "neutron.tests.fullstack.test_connectivity.TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigterm(VLANs,openflow-native)" there is timeout error during waiting for all vms to be boot.
Example of such error can be checked e.g. on http://logs.openstack.org/81/545681/1/check/neutron-fullstack/8285bf3/logs/testr_results.html.gz

This example is done on patch with some additional logging added to debug tests. What is strange there is fact that test environment makes GET /v2.0/ports/{port_id} call properly: http://logs.openstack.org/81/545681/1/check/neutron-fullstack/8285bf3/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigterm_VLANs,openflow-native_.txt.gz#_2018-02-18_20_34_47_950
but there is no this call logged in neutron-server logs. First GET call for this port in neutron-server logs is about 1m 30seconds later: http://logs.openstack.org/81/545681/1/check/neutron-fullstack/8285bf3/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigterm_VLANs,openflow-native_/neutron-server--2018-02-18--20-31-43-830810.txt.gz#_2018-02-18_20_36_18_516 and this is already too late as test reached timeout and it is failed.

Above failed test run is just an example. I saw similar errors more times than only this one.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Above example was spotted on patch with additional logging: https://review.openstack.org/#/c/545243/

Above logs are related to test neutron.tests.fullstack.test_connectivity.TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigterm(VLANs,openflow-native)

but same issue can happen for other test scenarios also.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/545970

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I think that problem is caused by very long time of processing another API call by neutron-server. There is only 1 API worker in neutron-server configured and in logs from example of failed test (above) it looks that just before (few miliseconds) call to show port comes to neutron-server, there was call to show subnet: http://logs.openstack.org/81/545681/1/check/neutron-fullstack/8285bf3/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetworkOnOvsBridgeControllerStop.test_controller_timeout_does_not_break_connectivity_sigterm_VLANs,openflow-native_/neutron-server--2018-02-18--20-31-43-830810.txt.gz#_2018-02-18_20_34_47_771

and this call was processed for 49(!!!) seconds. In this time neutron-server didn't process any other API call.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

From dstat log in this test run: http://logs.openstack.org/81/545681/1/check/neutron-fullstack/8285bf3/logs/dstat-csv_log.txt.gz it looks that in time when this GET /v2.0/ports/port_id "hangs" host was quite heave loaded (load 1m more than 20).

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

As I discussed with Ihar on IRC, it might be that host is just overloaded. There is 8 test workers, and each of them spawns neutron-server with 1 rpc_worker, 1 rpc_state_report_workers and 8 api_workers. So it's at least 80 workers spawned together.
I will try to limit number of api-workers in neutron-server to 2 and check if that will help.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/546069

Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/546069
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=465ad6f3197b8591a401a9f0db2fabf6c70fdfce
Submitter: Zuul
Branch: master

commit 465ad6f3197b8591a401a9f0db2fabf6c70fdfce
Author: Sławek Kapłoński <email address hidden>
Date: Tue Feb 20 09:24:38 2018 +0100

    [Fullstack] Limit number of Neutron's api workers

    Default number of api workers in Neutron is set to be equal to
    number of CPU cores on host. That is fine on production environment
    but on fullstack tests, where each test spawns own neutron-server
    process it might cause host overload.

    This patch limits number of api_workers to 2 which should be enough
    for single test case and should make significantly lower load on host.

    Change-Id: I1e970e35883d5240f0bd30eaea50313d93900580
    Closes-Bug: #1750337

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/545970

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.0.0b1

This issue was fixed in the openstack/neutron 13.0.0.0b1 development milestone.

tags: added: neutron-proactive-backport-potential
tags: added: neutron-easy-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/571665

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/571666

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/571665
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b160461a975e10767de59cd3e176763f00ec2184
Submitter: Zuul
Branch: stable/pike

commit b160461a975e10767de59cd3e176763f00ec2184
Author: Sławek Kapłoński <email address hidden>
Date: Tue Feb 20 09:24:38 2018 +0100

    [Fullstack] Limit number of Neutron's api workers

    Default number of api workers in Neutron is set to be equal to
    number of CPU cores on host. That is fine on production environment
    but on fullstack tests, where each test spawns own neutron-server
    process it might cause host overload.

    This patch limits number of api_workers to 2 which should be enough
    for single test case and should make significantly lower load on host.

    Change-Id: I1e970e35883d5240f0bd30eaea50313d93900580
    Closes-Bug: #1750337
    (cherry picked from commit 465ad6f3197b8591a401a9f0db2fabf6c70fdfce)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/571666
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3c462e4bfdd1cb5ffc21b8c08cc697740b784856
Submitter: Zuul
Branch: stable/ocata

commit 3c462e4bfdd1cb5ffc21b8c08cc697740b784856
Author: Sławek Kapłoński <email address hidden>
Date: Tue Feb 20 09:24:38 2018 +0100

    [Fullstack] Limit number of Neutron's api workers

    Default number of api workers in Neutron is set to be equal to
    number of CPU cores on host. That is fine on production environment
    but on fullstack tests, where each test spawns own neutron-server
    process it might cause host overload.

    This patch limits number of api_workers to 2 which should be enough
    for single test case and should make significantly lower load on host.

    Change-Id: I1e970e35883d5240f0bd30eaea50313d93900580
    Closes-Bug: #1750337
    (cherry picked from commit 465ad6f3197b8591a401a9f0db2fabf6c70fdfce)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.5

This issue was fixed in the openstack/neutron 11.0.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.7

This issue was fixed in the openstack/neutron 10.0.7 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.