Debug neutron-tempest-plugin-dvr-multinode-scenario failures

Bug #1830763 reported by Miguel Lavalle on 2019-05-28
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
High
Slawek Kaplonski
Miguel Lavalle (minsel) on 2019-05-28
Changed in neutron:
importance: Undecided → High
status: New → Confirmed
assignee: nobody → Miguel Lavalle (minsel)
tags: added: gate-failure
Miguel Lavalle (minsel) on 2019-06-04
description: updated

Reviewed: https://review.opendev.org/667547
Committed: https://git.openstack.org/cgit/openstack/neutron-tempest-plugin/commit/?id=6aae0d4b0e7e058e4da992b0cb74be84cedad433
Submitter: Zuul
Branch: master

commit 6aae0d4b0e7e058e4da992b0cb74be84cedad433
Author: Slawek Kaplonski <email address hidden>
Date: Wed Jun 26 10:17:15 2019 +0200

    Change order of creating vms and plug routers in scenario test

    In scenario tests in module
    test_connectivity there was an issue that first 2 vms
    were created and after that subnets were plugged into
    router. That caused race condition between spawning vms
    and cloud-init script during boot process and configuring
    metadata service in routers. Because of that often
    instance was booted without SSH key configured properly
    thus there was no possibility to ssh to this VM and test
    was failing.

    As we don't have any way to ensure that metadata is already
    configured inside router, this patch just change order of
    operations that subnets are first plugged into router and
    than VMs are created. Thanks to this change it should be
    at least much more reliable and test should be working better.

    Change-Id: Ieca8567965789f8d7763a77cecc82059c30b5ced
    Related-Bug: #1830763

Miguel Lavalle (minsel) wrote :
Download full text (9.0 KiB)

Looking at http://logs.openstack.org/14/668914/1/check/neutron-tempest-plugin-dvr-multinode-scenario/f2ce738/, we can see failure of test_connectivity_through_2_routers due to haproxy-metadata-proxy not being ready when an instance requests its metadata:

1) Instance b97e5dcc-7408-4ed4-8cef-95e65a31dec5, fixed IP 10.10.220.79. This instance landed in the Controller node and succeeds in getting metadata

2) Instance 62f21169-36c3-418d-a1cb-eec491810425, fixed IP 10.10.210.94. This instance landed in the Compute1 node and failed getting its metadata:

Sending discover...
Sending select for 10.10.210.94...
Lease of 10.10.210.94 obtained, lease time 86400
route: SIOCADDRT: Invalid argument
WARN: failed: route add -net "169.254.169.254/32" gw "10.10.210.254"
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.10.210.254"
cirros-ds 'net' up at 11.29
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 11.70. request failed
failed 2/20: up 17.21. request failed
failed 3/20: up 20.19. request failed
failed 4/20: up 25.53. request failed
failed 5/20: up 28.54. request failed
failed 6/20: up 33.91. request failed
failed 7/20: up 36.92. request failed
failed 8/20: up 42.31. request failed
failed 9/20: up 45.31. request failed
failed 10/20: up 50.67. request failed
failed 11/20: up 53.65. request failed
failed 12/20: up 59.04. request failed
failed 13/20: up 62.01. request failed
failed 14/20: up 67.37. request failed
failed 15/20: up 70.38. request failed
failed 16/20: up 75.77. request failed
failed 17/20: up 78.78. request failed
failed 18/20: up 84.15. request failed
failed 19/20: up 87.17. request failed
failed 20/20: up 92.54. request failed
failed to read iid from metadata. tried 20

3) The routers are c9d4b134-8064-46b9-94a7-d4a6fd775edb:

2019-07-05 09:20:43,293 4546 INFO [tempest.lib.common.rest_client] Request (NetworkConnectivityTest:test_connectivity_through_2_routers): 201 POST http://10.209.129.140:9696/v2.0/routers 4.268s
2019-07-05 09:20:43,293 4546 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: {"router": {"external_gateway_info": {"network_id": "ace47216-82e5-47cc-b721-8ffec818f30f"}, "name": "tempest-ap1_rt-1161204222", "admin_state_up": true}}
    Response - Headers: {'content-type': 'application/json', 'content-length': '753', 'x-openstack-request-id': 'req-dbb3cd88-32e4-438b-8717-905e721a1a50', 'date': 'Fri, 05 Jul 2019 09:20:43 GMT', 'connection': 'close', 'status': '201', 'content-location': 'http://10.209.129.140:9696/v2.0/routers'}
        Body: b'{"router": {"id": "c9d4b134-8064-46b9-94a7-d4a6fd775edb"

and 5a4d9551-0650-4ed1-9d18-2aaff5a5bc1f:

019-07-05 09:20:43,424 4546 INFO [tempest.lib.common.rest_client] Request (NetworkConnectivityTest:test_connectivity_through_2_routers): 201 POST http://10.209.129.140:9696/v2.0/routers 0.130s
2019-07-05 09:20:43,424 4546 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: {"router": {"external...

Read more...

Miguel Lavalle (minsel) wrote :

At exactly the same time that the L3 agent is Compute1 reports failing to fetch information for router c9d4b134-8064-46b9-94a7-d4a6fd775edb, the neutron server log shows this traceback: http://paste.openstack.org/show/754239/. Interestingly, neutron server references router 5a4d9551-0650-4ed1-9d18-2aaff5a5bc1f

Miguel Lavalle (minsel) wrote :

Since merging https://review.opendev.org/#/c/667547/, the relative frequency of test cases failures has changed. The frequency of test_connectivity_through_2_routers has decreased significantly. As a consequence. I did an analysis of frequency failures per test over the past 7 days (I went through the 500 occurrences that Kibana returns as a maximum. There may be more not visible beyond that 500 limit). This is what I found:

test_qos_basic_and_update 48
test_from_dvr_to_dvr_ha 39
test_from_dvr_to_ha 38
test_from_dvr_to_legacy 23
test_connectivity_through_2_routers 17
test_snat_external_ip 17
test_vm_reachable_through_compute 10
test_trunk_subport_lifecycle 9
test_qos 8

So the big offenders as of now are test_qos_basic_and_update and the routers migrations

Miguel Lavalle (minsel) wrote :

Looking at the 48 failures of test_qos_basic_and_update, most of the failures happened at https://github.com/openstack/neutron-tempest-plugin/blob/a7bb1619d43b413eb8d5849eb6df8d0dee260660/neutron_tempest_plugin/scenario/test_qos.py#L251-L256. The way I am reading this test, is that everything seemed to work correctly (including the the router) and after updating the QoS policy / rule, a timeout occurs while measuring the actual bandwidth that the VM gets:

Traceback (most recent call last):
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/scenario/test_qos.py", line 256, in test_qos_basic_and_update
    sleep=1)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/common/utils.py", line 77, in wait_until_true
    while not predicate():
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/scenario/test_qos.py", line 254, in <lambda>
    port=self.NC_PORT),
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/scenario/test_qos.py", line 110, in _check_bw
    data = client_socket.recv(QoSTestMixin.BUFFER_SIZE)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
    raise TimeoutException()
fixtures._fixtures.timeout.TimeoutException

Miguel Lavalle (minsel) wrote :

In the case of test_from_dvr_to_dvr_ha, most of the occurrences failed while waiting for router ports down: https://github.com/openstack/neutron-tempest-plugin/blob/a7bb1619d43b413eb8d5849eb6df8d0dee260660/neutron_tempest_plugin/scenario/test_migration.py#L81

Here's the test traceback:

Traceback (most recent call last):
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/scenario/test_migration.py", line 222, in test_from_dvr_to_dvr_ha
    after_dvr=True, after_ha=True)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/scenario/test_migration.py", line 136, in _test_migration
    self._wait_until_router_ports_down(router['id'])
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/scenario/test_migration.py", line 81, in _wait_until_router_ports_down
    timeout=300, sleep=5)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/common/utils.py", line 83, in wait_until_true
    raise WaitTimeout("Timed out after %d seconds" % timeout)
neutron_tempest_plugin.common.utils.WaitTimeout: Timed out after 300 seconds

Miguel Lavalle (minsel) wrote :

The case of test_from_dvr_to_ha is similar to test_from_dvr_to_dvr_ha, described in the previous note

Slawek Kaplonski (slaweq) wrote :

According to issue with reaching metadata service from vm and ssh failure. I think I found what is the reason.
It is race condition when 2 routers are created in short time and configured on same snat node. Then when both routers are configuring external gateway it may happend that one of routers will add external net to subscribers list in https://github.com/openstack/neutron/blob/master/neutron/agent/l3/dvr_fip_ns.py#L129 so second router will got info that it's not "first" and will go to update gateway port instead of creating it.
But if in fact gateway wasn't created yet it will cause exception in: https://github.com/openstack/neutron/blob/master/neutron/agent/l3/dvr_fip_ns.py#L332
And if this will happend, one of routers will not have properly configured iptables rules to allow requests to 169.254.169.254 so metadata will not work for this instance.

Fix proposed to branch: master
Review: https://review.opendev.org/673004

Changed in neutron:
assignee: Miguel Lavalle (minsel) → Slawek Kaplonski (slaweq)
status: Confirmed → In Progress
Slawek Kaplonski (slaweq) wrote :

Yesterday I did another round of debugging on this issue with reaching metadata in DVR jobs.
I think I finally found out what is the issue there.
My all analysis yesterday was based on failed test http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/testr_results.html.gz

I found out that 2 different routers were updated to set external gateway in almost same time and two API workers discovered that there is no floating IP agent gateway port created yet for this L3 agent. See:

http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/controller/logs/screen-q-svc.txt.gz#_Jul_29_08_13_07_864615

and

http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/controller/logs/screen-q-svc.txt.gz#_Jul_29_08_13_07_933488

Those ports were created in:

http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/controller/logs/screen-q-svc.txt.gz#_Jul_29_08_13_09_261935

and

http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/controller/logs/screen-q-svc.txt.gz#_Jul_29_08_13_13_724372

Please note that for both of those ports, device_id is set to "ac1f0f11-3731-439a-b2e9-1708fd2a9ba2" and this device_id is just L3 agent ID.

Each of those ports were than send to L3 agent to be created in fip-XXX namespace. First one was created fine but second one wasn't "first" so went to "update" code path in https://github.com/openstack/neutron/blob/e8b8a8498df4ea68e8ae3fc72e8fca74ab7d2243/neutron/agent/l3/dvr_fip_ns.py#L123

When it went to "update path" but device fq-XXX was not found in namespace, it failed and router didn't have properly configured e.g. iptables rules to be able to reach 169.254.169.254 address. Error on L3 agent's side is in http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/compute1/logs/screen-q-l3.txt.gz#_Jul_29_08_13_15_443994

Miguel Lavalle (minsel) wrote :

Filed this bug https://bugs.launchpad.net/neutron/+bug/1838449 for the router migrations failure

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.opendev.org/673004

Reviewed: https://review.opendev.org/673331
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7b81c1bc67d2d85e03b4c96a8c1c558a2f909836
Submitter: Zuul
Branch: master

commit 7b81c1bc67d2d85e03b4c96a8c1c558a2f909836
Author: Slawek Kaplonski <email address hidden>
Date: Mon Jul 29 18:15:33 2019 +0200

    [DVR] Add lock during creation of FIP agent gateway port

    In case when new external network is set as gateway network for
    dvr router, neutron tries to create floating IP agent gateway port.
    There should be always max 1 such port per network per L3 agent but
    sometimes when there are 2 requests to set external gateway for 2
    different routers executed almost in same time it may happend that
    there will be 2 such ports created.
    That will cause error with configuration of one of routers on L3 agent
    and this will cause e.g. problems with access from VMs to metadata
    service.
    Such issues are visible in DVR CI jobs from time to time. Please check
    related bug for details.

    This patch adds lock mechanism during creation of such FIP gateway port.
    Such solution isn't fully solving exising race condition as if 2
    requests will be processed by api workers running on 2 different nodes
    than this race can still happend.
    But this should mitigate the issue a bit and solve problem in U/S gates
    at least.
    For proper fix we should probably add some constraint on database level
    to prevent creation of 2 such ports for one network and one host but
    such solution will not be easy to backport to stable branches so I would
    prefer first to go with this easy workaround.

    Change-Id: Iabab7e4d36c7d6a876b2b74423efd7106a5f63f6
    Related-Bug: #1830763

Reviewed: https://review.opendev.org/675838
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f7532f0c927a551ea40e81a35818eb83faba4b5a
Submitter: Zuul
Branch: stable/stein

commit f7532f0c927a551ea40e81a35818eb83faba4b5a
Author: Slawek Kaplonski <email address hidden>
Date: Mon Jul 29 18:15:33 2019 +0200

    [DVR] Add lock during creation of FIP agent gateway port

    In case when new external network is set as gateway network for
    dvr router, neutron tries to create floating IP agent gateway port.
    There should be always max 1 such port per network per L3 agent but
    sometimes when there are 2 requests to set external gateway for 2
    different routers executed almost in same time it may happend that
    there will be 2 such ports created.
    That will cause error with configuration of one of routers on L3 agent
    and this will cause e.g. problems with access from VMs to metadata
    service.
    Such issues are visible in DVR CI jobs from time to time. Please check
    related bug for details.

    This patch adds lock mechanism during creation of such FIP gateway port.
    Such solution isn't fully solving exising race condition as if 2
    requests will be processed by api workers running on 2 different nodes
    than this race can still happend.
    But this should mitigate the issue a bit and solve problem in U/S gates
    at least.
    For proper fix we should probably add some constraint on database level
    to prevent creation of 2 such ports for one network and one host but
    such solution will not be easy to backport to stable branches so I would
    prefer first to go with this easy workaround.

    Conflicts:
        neutron/db/l3_dvr_db.py

    Change-Id: Iabab7e4d36c7d6a876b2b74423efd7106a5f63f6
    Related-Bug: #1830763
    (cherry picked from commit 7b81c1bc67d2d85e03b4c96a8c1c558a2f909836)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/675846
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=99602ab15b77e8cc8641dc8a6ad760ddb04c3028
Submitter: Zuul
Branch: stable/queens

commit 99602ab15b77e8cc8641dc8a6ad760ddb04c3028
Author: Slawek Kaplonski <email address hidden>
Date: Mon Jul 29 18:15:33 2019 +0200

    [DVR] Add lock during creation of FIP agent gateway port

    In case when new external network is set as gateway network for
    dvr router, neutron tries to create floating IP agent gateway port.
    There should be always max 1 such port per network per L3 agent but
    sometimes when there are 2 requests to set external gateway for 2
    different routers executed almost in same time it may happend that
    there will be 2 such ports created.
    That will cause error with configuration of one of routers on L3 agent
    and this will cause e.g. problems with access from VMs to metadata
    service.
    Such issues are visible in DVR CI jobs from time to time. Please check
    related bug for details.

    This patch adds lock mechanism during creation of such FIP gateway port.
    Such solution isn't fully solving exising race condition as if 2
    requests will be processed by api workers running on 2 different nodes
    than this race can still happend.
    But this should mitigate the issue a bit and solve problem in U/S gates
    at least.
    For proper fix we should probably add some constraint on database level
    to prevent creation of 2 such ports for one network and one host but
    such solution will not be easy to backport to stable branches so I would
    prefer first to go with this easy workaround.

    Conflicts:
        neutron/db/l3_dvr_db.py

    Change-Id: Iabab7e4d36c7d6a876b2b74423efd7106a5f63f6
    Related-Bug: #1830763
    (cherry picked from commit 7b81c1bc67d2d85e03b4c96a8c1c558a2f909836)
    (cherry picked from commit f7532f0c927a551ea40e81a35818eb83faba4b5a)
    (cherry picked from commit 5c1afcaf2b9cb1bd09267a26dd4f5d7f7e99bf85)

tags: added: in-stable-queens

Reviewed: https://review.opendev.org/675844
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5c1afcaf2b9cb1bd09267a26dd4f5d7f7e99bf85
Submitter: Zuul
Branch: stable/rocky

commit 5c1afcaf2b9cb1bd09267a26dd4f5d7f7e99bf85
Author: Slawek Kaplonski <email address hidden>
Date: Mon Jul 29 18:15:33 2019 +0200

    [DVR] Add lock during creation of FIP agent gateway port

    In case when new external network is set as gateway network for
    dvr router, neutron tries to create floating IP agent gateway port.
    There should be always max 1 such port per network per L3 agent but
    sometimes when there are 2 requests to set external gateway for 2
    different routers executed almost in same time it may happend that
    there will be 2 such ports created.
    That will cause error with configuration of one of routers on L3 agent
    and this will cause e.g. problems with access from VMs to metadata
    service.
    Such issues are visible in DVR CI jobs from time to time. Please check
    related bug for details.

    This patch adds lock mechanism during creation of such FIP gateway port.
    Such solution isn't fully solving exising race condition as if 2
    requests will be processed by api workers running on 2 different nodes
    than this race can still happend.
    But this should mitigate the issue a bit and solve problem in U/S gates
    at least.
    For proper fix we should probably add some constraint on database level
    to prevent creation of 2 such ports for one network and one host but
    such solution will not be easy to backport to stable branches so I would
    prefer first to go with this easy workaround.

    Conflicts:
        neutron/db/l3_dvr_db.py

    Change-Id: Iabab7e4d36c7d6a876b2b74423efd7106a5f63f6
    Related-Bug: #1830763
    (cherry picked from commit 7b81c1bc67d2d85e03b4c96a8c1c558a2f909836)
    (cherry picked from commit f7532f0c927a551ea40e81a35818eb83faba4b5a)

tags: added: in-stable-rocky
tags: added: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers