Fullstack test test_qos.TestBwLimitQoSOvs.test_bw_limit_qos_port_removed failing many times

Bug #1737892 reported by Slawek Kaplonski
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Slawek Kaplonski

Bug Description

It looks that this test fails every time with same reason:

2017-12-13 05:29:35.846 | Captured traceback:
2017-12-13 05:29:35.848 | ~~~~~~~~~~~~~~~~~~~
2017-12-13 05:29:35.850 | Traceback (most recent call last):
2017-12-13 05:29:35.852 | File "neutron/tests/base.py", line 132, in func
2017-12-13 05:29:35.854 | return f(self, *args, **kwargs)
2017-12-13 05:29:35.856 | File "neutron/tests/fullstack/test_qos.py", line 237, in test_bw_limit_qos_port_removed
2017-12-13 05:29:35.858 | self._wait_for_bw_rule_removed(vm, self.direction)
2017-12-13 05:29:35.860 | File "neutron/tests/fullstack/test_qos.py", line 121, in _wait_for_bw_rule_removed
2017-12-13 05:29:35.862 | self._wait_for_bw_rule_applied(vm, None, None, direction)
2017-12-13 05:29:35.864 | File "neutron/tests/fullstack/test_qos.py", line 219, in _wait_for_bw_rule_applied
2017-12-13 05:29:35.866 | lambda: vm.bridge.get_ingress_bw_limit_for_port(
2017-12-13 05:29:35.868 | File "neutron/common/utils.py", line 632, in wait_until_true
2017-12-13 05:35:14.907 | raise WaitTimeout("Timed out after %d seconds" % timeout)
2017-12-13 05:35:14.909 | neutron.common.utils.WaitTimeout: Timed out after 60 seconds

Example fail: http://logs.openstack.org/73/519573/12/check/legacy-neutron-dsvm-fullstack/f27cbf9/logs/devstack-gate-post_test_hook.txt#_2017-12-13_05_29_35_842

It failed 16 times in 24h between 9:00 12.12.2017 and 9:00 13.12.2017: http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22neutron.tests.fullstack.test_qos.TestBwLimitQoSOvs.test_bw_limit_qos_port_removed%5C%22%20AND%20message%3A%5C%22FAILED%5C%22

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I was trying to understand from gate logs what is the reason of this issue and was also trying to reproduce it locally.
It looks for me that it is kind of race when load on test host is big because there is very short time between adding port and removing it.
I was able to reproduce this issue manually once. I will keep digging into it to confirm (or not) if that is really reason of this issue.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

So after more digging locally I found that my previous comments were wrong. It happen sometimes even without high load.
From logs I can see that both queue and qos rows should be deleted from ovs but they aren't. For now I don't have anything else but it looks that it's not bug in tests but rather in production code

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I made locally some tests. This issue can be reproduced even locally but it happens once per about 20 or 30 times. I did it with command:

    i=1; while true; do echo "========= Test $i starting =========="; tox -e dsvm-fullstack neutron.tests.fullstack.test_qos.TestBwLimitQoSOvs.test_bw_limit_qos_port_removed; RES=$?; i=$(( i + 1 )); if [ $RES -ne 0 ]; then break; fi; done

Problem is that I have no idea why it happens like that. From logs (openvswitch agent and test log, see attachments) it looks like specified "qos" and "queue" records are not deleted from ovs db. But DbDestroy Commands for both records are called properly, there is no errors in those transactions. Also strange thing is that after test is finished, those records are deleted from ovs.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/532247

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I'm still checking why it can happen. What I found in logs of one failed and one good test run is that it looks for me for some race between updating port to "not bound" and deleting it from bridge which is done in destroy() method:

In logs there is one difference between good and "bad" run:
- good run, when port_update message comes to agent device info is in cache and it returns something like: http://logs.openstack.org/73/519573/12/check/legacy-neutron-dsvm-fullstack/f27cbf9/logs/dsvm-fullstack-logs/TestBwLimitQoSOvs.test_bw_limit_qos_port_removed_ingress,openflow-cli_/neutron-openvswitch-agent--2017-12-13--05-25-07-575449.txt.gz#_2017-12-13_05_25_28_629
- bad run, device info is not in cache and there is something like: http://logs.openstack.org/73/519573/12/check/legacy-neutron-dsvm-fullstack/f27cbf9/logs/dsvm-fullstack-logs/TestBwLimitQoSOvs.test_bw_limit_qos_port_removed_ingress,openflow-native_/neutron-openvswitch-agent--2017-12-13--05-27-01-348269.txt.gz#_2017-12-13_05_27_23_884 - and then qos extension driver is not called properly

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

After a lot of debugging and checking logs from gates and from tests runs locally I found that problem comes when:

ovs agent will get "port_update" event (to set port as unbound) and port is already removed from br-int when it tries to get vif port in: https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1530 (port is removed because fullstack machine's destroy() method removes it just after call port_update() method.

In such case port is added to skipped_ports and is set as DOWN in neutron-db but ext_manager is not called then for such port so it will not clear stuff like bandwidth limit's QoS and queue records and also DSCP marking open flow rules for this port.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/533318

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/532247
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7f7d65ea4c58b665c2b54fa9f2e8ec1f81668b48
Submitter: Zuul
Branch: master

commit 7f7d65ea4c58b665c2b54fa9f2e8ec1f81668b48
Author: Sławek Kapłoński <email address hidden>
Date: Tue Jan 9 17:37:02 2018 +0100

    [Fullstack] Mark test_bw_limit_qos_port_removed as unstable

    Mark TestBwLimitQoSOvs.test_bw_limit_qos_port_removed fullstack
    test as unstable until bug 1737892 will not be solved.

    Change-Id: Ia931b081d6afdbed00f49a106baa5598a1b43e0e
    Related-Bug: #1737892

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/533318
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a8271e978a1c540ae9888f568cf14b4c40ea1b6d
Submitter: Zuul
Branch: master

commit a8271e978a1c540ae9888f568cf14b4c40ea1b6d
Author: Sławek Kapłoński <email address hidden>
Date: Fri Jan 12 23:08:56 2018 +0100

    [OVS] Fix for cleaning after skipped_devices

    When Openvswitch agent will get "port_update" event
    (e.g. to set port as unbound) and port is already removed
    from br-int when agent tries to get vif port in
    treat_devices_added_updated() method (port is removed
    because e.g. nova-compute removes it) then resources set
    for port by L2 agent extension drivers (like qos) are not
    cleaned properly.

    In such case port is added to skipped_ports and is set
    as DOWN in neutron-db but ext_manager is not called then
    for such port so it will not clear stuff like bandwidth
    limit's QoS and queue records and also DSCP marking
    open flow rules for this port.

    This patch fixes this issue by adding call of
    ext_manager.delete_port() method for all skipped ports.

    Change-Id: I3cf5c57c7f232deaa190ab6b0129e398fdabe592
    Closes-Bug: #1737892

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.0.0b3

This issue was fixed in the openstack/neutron 12.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/539950

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/539950
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=68dcc4d3c89609125f3076300f2f67338989bb97
Submitter: Zuul
Branch: master

commit 68dcc4d3c89609125f3076300f2f67338989bb97
Author: Sławek Kapłoński <email address hidden>
Date: Thu Feb 1 14:32:42 2018 +0100

    Revert "[Fullstack] Mark test_bw_limit_qos_port_removed as unstable"

    This reverts commit 7f7d65ea4c58b665c2b54fa9f2e8ec1f81668b48.

    Now bug 1737892 is closed with patch
    I3cf5c57c7f232deaa190ab6b0129e398fdabe592
    so it should be safe to remove mark of this test as unstable.

    Change-Id: Ie3f16dc113455062a9f404de045412d4097b3f8b
    Related-Bug: #1737892

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/574885

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/574885
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cdcc704b9e0c6eb3eb8500e5791b007e85653c13
Submitter: Zuul
Branch: stable/pike

commit cdcc704b9e0c6eb3eb8500e5791b007e85653c13
Author: Sławek Kapłoński <email address hidden>
Date: Fri Jan 12 23:08:56 2018 +0100

    [OVS] Fix for cleaning after skipped_devices

    When Openvswitch agent will get "port_update" event
    (e.g. to set port as unbound) and port is already removed
    from br-int when agent tries to get vif port in
    treat_devices_added_updated() method (port is removed
    because e.g. nova-compute removes it) then resources set
    for port by L2 agent extension drivers (like qos) are not
    cleaned properly.

    In such case port is added to skipped_ports and is set
    as DOWN in neutron-db but ext_manager is not called then
    for such port so it will not clear stuff like bandwidth
    limit's QoS and queue records and also DSCP marking
    open flow rules for this port.

    This patch fixes this issue by adding call of
    ext_manager.delete_port() method for all skipped ports.

    Change-Id: I3cf5c57c7f232deaa190ab6b0129e398fdabe592
    Closes-Bug: #1737892
    (cherry picked from commit a8271e978a1c540ae9888f568cf14b4c40ea1b6d)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.6

This issue was fixed in the openstack/neutron 11.0.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/635650

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/635650
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=20fa3f267ebe7e2feb60d930af4a58a77cb9334c
Submitter: Zuul
Branch: stable/ocata

commit 20fa3f267ebe7e2feb60d930af4a58a77cb9334c
Author: Sławek Kapłoński <email address hidden>
Date: Fri Jan 12 23:08:56 2018 +0100

    [OVS] Fix for cleaning after skipped_devices

    When Openvswitch agent will get "port_update" event
    (e.g. to set port as unbound) and port is already removed
    from br-int when agent tries to get vif port in
    treat_devices_added_updated() method (port is removed
    because e.g. nova-compute removes it) then resources set
    for port by L2 agent extension drivers (like qos) are not
    cleaned properly.

    In such case port is added to skipped_ports and is set
    as DOWN in neutron-db but ext_manager is not called then
    for such port so it will not clear stuff like bandwidth
    limit's QoS and queue records and also DSCP marking
    open flow rules for this port.

    This patch fixes this issue by adding call of
    ext_manager.delete_port() method for all skipped ports.

    Change-Id: I3cf5c57c7f232deaa190ab6b0129e398fdabe592
    Closes-Bug: #1737892
    (cherry picked from commit a8271e978a1c540ae9888f568cf14b4c40ea1b6d)
    (cherry picked from commit cdcc704b9e0c6eb3eb8500e5791b007e85653c13)

tags: added: in-stable-ocata
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :
Download full text (4.8 KiB)

Hello:

I found again a new error related with this test: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_2c7/707222/2/check/neutron-fullstack/2c7fea3/testr_results.html

Reviewing the test log [1], the server log [2] and the OVS agent log [3], I found that the OVS agent creates the QoS and the Queue register after the test checks those values and raise the exception when does not find them.

Testcase execution:
2020-02-14 10:03:18.562 22517 DEBUG urllib3.connectionpool [-] Starting new HTTP connection (1): 127.0.0.1:19087 _new_conn /home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/urllib3/connectionpool.py:225
2020-02-14 10:03:18.611 22517 DEBUG urllib3.connectionpool [-] http://127.0.0.1:19087 "GET /v2.0/subnets/4655dcb1-4768-43e9-8463-0e123620f71b HTTP/1.1" 200 615 _make_request /home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/urllib3/connectionpool.py:437
2020-02-14 10:03:18.613 22517 DEBUG neutronclient.client [-] RESP: 200 {'Content-Type': 'application/json', 'Content-Length': '615', 'X-Openstack-Request-Id': 'req-659558b0-3cd4-4749-aba5-1273d3ae33f3', 'Date': 'Fri, 14 Feb 2020 10:03:18 GMT', 'Connection': 'keep-alive'} {"subnet":{"id":"4655dcb1-4768-43e9-8463-0e123620f71b","name":"subnet-test","tenant_id":"6b8ab93d-fa36-42a8-866f-582723b4dacf","network_id":"5e97836e-a2d7-4396-bdc0-f7e2cdd4436b","ip_version":4,"subnetpool_id":null,"enable_dhcp":false,"ipv6_ra_mode":null,"ipv6_address_mode":null,"gateway_ip":"10.0.0.1","cidr":"10.0.0.0/24","allocation_pools":[{"start":"10.0.0.2","end":"10.0.0.254"}],"host_routes":[],"dns_nameservers":[],"description":"","service_types":[],"tags":[],"created_at":"2020-02-14T10:03:14Z","updated_at":"2020-02-14T10:03:14Z","revision_number":0,"project_id":"6b8ab93d-fa36-42a8-866f-582723b4dacf"}} http_log_resp /home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/neutronclient/common/utils.py:204
2020-02-14 10:03:18.613 22517 DEBUG neutronclient.client [-] GET call to neutron for http://127.0.0.1:19087/v2.0/subnets/4655dcb1-4768-43e9-8463-0e123620f71b used request id req-659558b0-3cd4-4749-aba5-1273d3ae33f3 _cs_request /home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/neutronclient/client.py:124
2020-02-14 10:03:19.876 22517 DEBUG neutronclient.client [-] REQ: b'curl -i http://127.0.0.1:19087/v2.0/ports/f2457308-d9e1-464d-abf6-14b20a050bac -X DELETE -H "X-Auth-Token: {SHA256}e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855" -H "User-Agent: python-neutronclient"' http_log_req /home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/neutronclient/common/utils.py:195
2020-02-14 10:03:19.879 22517 DEBUG urllib3.connectionpool [-] Starting new HTTP connection (1): 127.0.0.1:19087 _new_conn /home/zuul/src/opendev.org/openstack/neutron/.tox/dsvm-fullstack/lib/python3.6/site-packages/urllib3/connectionpool.py:225

OVS agent (snippet):
...
2020-02-14 10:03:19.713 26257 DEBUG ovsdbapp.backend.ovs_idl.transaction [req-c0d7c877-fc0c-4e0a-b0ab-1f66e282b280 - -...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/707870

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/707870
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f5c5755100c235b16c86cec782add8fe5b618598
Submitter: Zuul
Branch: master

commit f5c5755100c235b16c86cec782add8fe5b618598
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Fri Feb 14 16:11:28 2020 +0000

    [Fullstack] Wait until min QoS and Queue registers are set in DB

    Wait until the minimum bandwidth Qos and Queue registers are set in
    the OVS DB. In case of error, log the QoS and Queue registers present
    at this moment.

    Change-Id: Ifb2f731d7d04e237d4e47ba486169348684c8b78
    Related-Bug: #1737892

tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/771903
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ocata-eol

This issue was fixed in the openstack/neutron ocata-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.