Bug #1830763 “Debug neutron-tempest-plugin-dvr-multinode-scenari...” : Bugs : neutron

Miguel Lavalle (minsel) on 2019-05-28

Changed in neutron:
importance:	Undecided → High
status:	New → Confirmed
assignee:	nobody → Miguel Lavalle (minsel)

Slawek Kaplonski (slaweq) on 2019-06-04

tags:

added: gate-failure

Miguel Lavalle (minsel) on 2019-06-04

description:

updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-26: Related fix proposed to neutron-tempest-plugin (master)

#1

Related fix proposed to branch: master
Review: https://review.opendev.org/667547

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-29: Related fix merged to neutron-tempest-plugin (master)

#2

Reviewed: https://review.opendev.org/667547
Committed: https://git.openstack.org/cgit/openstack/neutron-tempest-plugin/commit/?id=6aae0d4b0e7e058e4da992b0cb74be84cedad433
Submitter: Zuul
Branch: master

commit 6aae0d4b0e7e058e4da992b0cb74be84cedad433
Author: Slawek Kaplonski <email address hidden>
Date: Wed Jun 26 10:17:15 2019 +0200

Change order of creating vms and plug routers in scenario test

    In scenario tests in module
    test_connectivity there was an issue that first 2 vms
    were created and after that subnets were plugged into
    router. That caused race condition between spawning vms
    and cloud-init script during boot process and configuring
    metadata service in routers. Because of that often
    instance was booted without SSH key configured properly
    thus there was no possibility to ssh to this VM and test
    was failing.

    As we don't have any way to ensure that metadata is already
    configured inside router, this patch just change order of
    operations that subnets are first plugged into router and
    than VMs are created. Thanks to this change it should be
    at least much more reliable and test should be working better.

Change-Id: Ieca8567965789f8d7763a77cecc82059c30b5ced
Related-Bug: #1830763

Revision history for this message

Miguel Lavalle (minsel) wrote on 2019-07-09:

#3

Download full text (9.0 KiB)

Looking at http://logs.openstack.org/14/668914/1/check/neutron-tempest-plugin-dvr-multinode-scenario/f2ce738/, we can see failure of test_connectivity_through_2_routers due to haproxy-metadata-proxy not being ready when an instance requests its metadata:

1) Instance b97e5dcc-7408-4ed4-8cef-95e65a31dec5, fixed IP 10.10.220.79. This instance landed in the Controller node and succeeds in getting metadata

2) Instance 62f21169-36c3-418d-a1cb-eec491810425, fixed IP 10.10.210.94. This instance landed in the Compute1 node and failed getting its metadata:

Sending discover...
Sending select for 10.10.210.94...
Lease of 10.10.210.94 obtained, lease time 86400
route: SIOCADDRT: Invalid argument
WARN: failed: route add -net "169.254.169.254/32" gw "10.10.210.254"
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.10.210.254"
cirros-ds 'net' up at 11.29
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 11.70. request failed
failed 2/20: up 17.21. request failed
failed 3/20: up 20.19. request failed
failed 4/20: up 25.53. request failed
failed 5/20: up 28.54. request failed
failed 6/20: up 33.91. request failed
failed 7/20: up 36.92. request failed
failed 8/20: up 42.31. request failed
failed 9/20: up 45.31. request failed
failed 10/20: up 50.67. request failed
failed 11/20: up 53.65. request failed
failed 12/20: up 59.04. request failed
failed 13/20: up 62.01. request failed
failed 14/20: up 67.37. request failed
failed 15/20: up 70.38. request failed
failed 16/20: up 75.77. request failed
failed 17/20: up 78.78. request failed
failed 18/20: up 84.15. request failed
failed 19/20: up 87.17. request failed
failed 20/20: up 92.54. request failed
failed to read iid from metadata. tried 20

3) The routers are c9d4b134-8064-46b9-94a7-d4a6fd775edb:

2019-07-05 09:20:43,293 4546 INFO [tempest.lib.common.rest_client] Request (NetworkConnectivityTest:test_connectivity_through_2_routers): 201 POST http://10.209.129.140:9696/v2.0/routers 4.268s
2019-07-05 09:20:43,293 4546 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: {"router": {"external_gateway_info": {"network_id": "ace47216-82e5-47cc-b721-8ffec818f30f"}, "name": "tempest-ap1_rt-1161204222", "admin_state_up": true}}
    Response - Headers: {'content-type': 'application/json', 'content-length': '753', 'x-openstack-request-id': 'req-dbb3cd88-32e4-438b-8717-905e721a1a50', 'date': 'Fri, 05 Jul 2019 09:20:43 GMT', 'connection': 'close', 'status': '201', 'content-location': 'http://10.209.129.140:9696/v2.0/routers'}
        Body: b'{"router": {"id": "c9d4b134-8064-46b9-94a7-d4a6fd775edb"

and 5a4d9551-0650-4ed1-9d18-2aaff5a5bc1f:

019-07-05 09:20:43,424 4546 INFO [tempest.lib.common.rest_client] Request (NetworkConnectivityTest:test_connectivity_through_2_routers): 201 POST http://10.209.129.140:9696/v2.0/routers 0.130s
2019-07-05 09:20:43,424 4546 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
Body: {"router": {"external...

Looking at http://logs.openstack.org/14/668914/1/check/neutron-tempest-plugin-dvr-multinode-scenario/f2ce738/, we can see failure of test_connectivity_through_2_routers due to haproxy-metadata-proxy not being ready when an instance requests its metadata:

1) Instance b97e5dcc-7408-4ed4-8cef-95e65a31dec5, fixed IP 10.10.220.79. This instance landed in the Controller node and succeeds in getting metadata

2) Instance 62f21169-36c3-418d-a1cb-eec491810425, fixed IP 10.10.210.94. This instance landed in the Compute1 node and failed getting its metadata:

Sending discover...
Sending select for 10.10.210.94...
Lease of 10.10.210.94 obtained, lease time 86400
route: SIOCADDRT: Invalid argument
WARN: failed: route add -net "169.254.169.254/32" gw "10.10.210.254"
route: SIOCADDRT: File exists
WARN: failed: route add -net "0.0.0.0/0" gw "10.10.210.254"
cirros-ds 'net' up at 11.29
checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 11.70. request failed
failed 2/20: up 17.21. request failed
failed 3/20: up 20.19. request failed
failed 4/20: up 25.53. request failed
failed 5/20: up 28.54. request failed
failed 6/20: up 33.91. request failed
failed 7/20: up 36.92. request failed
failed 8/20: up 42.31. request failed
failed 9/20: up 45.31. request failed
failed 10/20: up 50.67. request failed
failed 11/20: up 53.65. request failed
failed 12/20: up 59.04. request failed
failed 13/20: up 62.01. request failed
failed 14/20: up 67.37. request failed
failed 15/20: up 70.38. request failed
failed 16/20: up 75.77. request failed
failed 17/20: up 78.78. request failed
failed 18/20: up 84.15. request failed
failed 19/20: up 87.17. request failed
failed 20/20: up 92.54. request failed
failed to read iid from metadata. tried 20

3) The routers are c9d4b134-8064-46b9-94a7-d4a6fd775edb:

2019-07-05 09:20:43,293 4546 INFO     [tempest.lib.common.rest_client] Request (NetworkConnectivityTest:test_connectivity_through_2_routers): 201 POST http://10.209.129.140:9696/v2.0/routers 4.268s
2019-07-05 09:20:43,293 4546 DEBUG    [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: {"router": {"external_gateway_info": {"network_id": "ace47216-82e5-47cc-b721-8ffec818f30f"}, "name": "tempest-ap1_rt-1161204222", "admin_state_up": true}}
    Response - Headers: {'content-type': 'application/json', 'content-length': '753', 'x-openstack-request-id': 'req-dbb3cd88-32e4-438b-8717-905e721a1a50', 'date': 'Fri, 05 Jul 2019 09:20:43 GMT', 'connection': 'close', 'status': '201', 'content-location': 'http://10.209.129.140:9696/v2.0/routers'}
        Body: b'{"router": {"id": "c9d4b134-8064-46b9-94a7-d4a6fd775edb"

and 5a4d9551-0650-4ed1-9d18-2aaff5a5bc1f:

019-07-05 09:20:43,424 4546 INFO     [tempest.lib.common.rest_client] Request (NetworkConnectivityTest:test_connectivity_through_2_routers): 201 POST http://10.209.129.140:9696/v2.0/routers 0.130s
2019-07-05 09:20:43,424 4546 DEBUG    [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: {"router": {"external_gateway_info": {}, "name": "tempest-ap2_rt-29568820", "admin_state_up": true}}
    Response - Headers: {'content-type': 'application/json', 'content-length': '486', 'x-openstack-request-id': 'req-7bd24508-d790-4b6a-941f-45a481bbe662', 'date': 'Fri, 05 Jul 2019 09:20:43 GMT', 'connection': 'close', 'status': '201', 'content-location': 'http://10.209.129.140:9696/v2.0/routers'}
        Body: b'{"router": {"id": "5a4d9551-0650-4ed1-9d18-2aaff5a5bc1f

4) Instance that failed getting metadata  (62f21169-36c3-418d-a1cb-eec491810425 in node Compute1) is ACTIVE according to Nova at:

2019-07-05 09:22:04,164 4546 INFO     [tempest.lib.common.rest_client] Request (NetworkConnectivityTest:test_connectivity_through_2_routers): 200 GET http://10.209.129.140/compute/v2.1/servers/62f21169-36c3-418d-a1cb-eec491810425 0.385s
2019-07-05 09:22:04,164 4546 DEBUG    [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'date': 'Fri, 05 Jul 2019 09:22:03 GMT', 'server': 'Apache/2.4.29 (Ubuntu)', 'content-length': '1525', 'content-type': 'application/json', 'openstack-api-version': 'compute 2.1', 'x-openstack-nova-api-version': '2.1', 'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', 'x-openstack-request-id': 'req-fdf8d58c-4c84-4a94-a645-2b97893bfc93', 'x-compute-request-id': 'req-fdf8d58c-4c84-4a94-a645-2b97893bfc93', 'connection': 'close', 'status': '200', 'content-location': 'http://10.209.129.140/compute/v2.1/servers/62f21169-36c3-418d-a1cb-eec491810425'}
        Body: b'{"server": {"id": "62f21169-36c3-418d-a1cb-eec491810425", "name": "tempest-server-test-1125157758", "status": "ACTIVE",

5) haproxy-metadata-proxys for the two routers in Compute1 ARE CREATED AFTER instance 62f21169-36c3-418d-a1cb-eec491810425 (landed on Compute1) became ACTIVE:

Jul 05 09:22:49.441033 ubuntu-bionic-rax-ord-0008698810 neutron-l3-agent[18812]: DEBUG neutron.agent.metadata.driver [-] haproxy_cfg =
Jul 05 09:22:49.441033 ubuntu-bionic-rax-ord-0008698810 neutron-l3-agent[18812]: global
Jul 05 09:22:49.441033 ubuntu-bionic-rax-ord-0008698810 neutron-l3-agent[18812]:     log         /dev/log local0 debug
Jul 05 09:22:49.441033 ubuntu-bionic-rax-ord-0008698810 neutron-l3-agent[18812]:     log-tag     haproxy-metadata-proxy-5a4d9551-0650-4ed1-9d18-2aaff5a5bc1f

and:

Jul 05 09:23:20.860992 ubuntu-bionic-rax-ord-0008698810 neutron-l3-agent[18812]: DEBUG neutron.agent.metadata.driver [-] haproxy_cfg =
Jul 05 09:23:20.860992 ubuntu-bionic-rax-ord-0008698810 neutron-l3-agent[18812]: global
Jul 05 09:23:20.860992 ubuntu-bionic-rax-ord-0008698810 neutron-l3-agent[18812]:     log         /dev/log local0 debug
Jul 05 09:23:20.860992 ubuntu-bionic-rax-ord-0008698810 neutron-l3-agent[18812]:     log-tag     haproxy-metadata-proxy-c9d4b134-8064-46b9-94a7-d4a6fd775edb

6) As a contrast, instance b97e5dcc-7408-4ed4-8cef-95e65a31dec5 (Controller node) became active, according to Nova, at:

2019-07-05 09:23:47,274 4546 INFO     [tempest.lib.common.rest_client] Request (NetworkConnectivityTest:test_connectivity_through_2_routers): 200 GET http://10.209.129.140/compute/v2.1/servers/b97e5dcc-7408-4ed4-8cef-95e65a31dec5 0.641s
2019-07-05 09:23:47,274 4546 DEBUG    [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'date': 'Fri, 05 Jul 2019 09:23:46 GMT', 'server': 'Apache/2.4.29 (Ubuntu)', 'content-length': '1523', 'content-type': 'application/json', 'openstack-api-version': 'compute 2.1', 'x-openstack-nova-api-version': '2.1', 'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', 'x-openstack-request-id': 'req-32af6871-e7c7-4bee-906a-78f3772787ed', 'x-compute-request-id': 'req-32af6871-e7c7-4bee-906a-78f3772787ed', 'connection': 'close', 'status': '200', 'content-location': 'http://10.209.129.140/compute/v2.1/servers/b97e5dcc-7408-4ed4-8cef-95e65a31dec5'}
        Body: b'{"server": {"id": "b97e5dcc-7408-4ed4-8cef-95e65a31dec5", "name": "tempest-server-test-554631732", "status": "ACTIVE"

whereas the corresponding haproxy-metadata-proxys were created at:

Jul 05 09:21:45.208061 ubuntu-bionic-rax-ord-0008698665 neutron-l3-agent[20762]: DEBUG neutron.agent.metadata.driver [-] haproxy_cfg =
Jul 05 09:21:45.208061 ubuntu-bionic-rax-ord-0008698665 neutron-l3-agent[20762]: global
Jul 05 09:21:45.208061 ubuntu-bionic-rax-ord-0008698665 neutron-l3-agent[20762]:     log         /dev/log local0 debug
Jul 05 09:21:45.208061 ubuntu-bionic-rax-ord-0008698665 neutron-l3-agent[20762]:     log-tag     haproxy-metadata-proxy-5a4d9551-0650-4ed1-9d18-2aaff5a5bc1f

and:

Jul 05 09:21:55.948830 ubuntu-bionic-rax-ord-0008698665 neutron-l3-agent[20762]: DEBUG neutron.agent.metadata.driver [-] haproxy_cfg =
Jul 05 09:21:55.948830 ubuntu-bionic-rax-ord-0008698665 neutron-l3-agent[20762]: global
Jul 05 09:21:55.948830 ubuntu-bionic-rax-ord-0008698665 neutron-l3-agent[20762]:     log         /dev/log local0 debug
Jul 05 09:21:55.948830 ubuntu-bionic-rax-ord-0008698665 neutron-l3-agent[20762]:     log-tag     haproxy-metadata-proxy-c9d4b134-8064-46b9-94a7-d4a6fd775edb

7) It is worth noting that the L3 agent in Compute1 (where the failed instance landed) shows the following error when trying to retrieve from the neutron server data on one of the routers:

Jul 05 09:22:13.458957 ubuntu-bionic-rax-ord-0008698810 neutron-l3-agent[18812]: ERROR neutron.agent.l3.agent [-] Failed to fetch router information for 'c9d4b134-8064-46b9-94a7-d4a6fd775edb': RemoteError: Remote error: InvalidRequestError This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (pymysql.err.IntegrityError) (1062, "Duplicate entry '8831ed85-9ccf-48a2-92eb-ab39d3d30e89-ubuntu-bionic-rax-ord-00086' for key 'PRIMARY'")

Revision history for this message

Miguel Lavalle (minsel) wrote on 2019-07-09:

#4

At exactly the same time that the L3 agent is Compute1 reports failing to fetch information for router c9d4b134-8064-46b9-94a7-d4a6fd775edb, the neutron server log shows this traceback: http://paste.openstack.org/show/754239/. Interestingly, neutron server references router 5a4d9551-0650-4ed1-9d18-2aaff5a5bc1f

Revision history for this message

Miguel Lavalle (minsel) wrote on 2019-07-23:

#5

Since merging https://review.opendev.org/#/c/667547/, the relative frequency of test cases failures has changed. The frequency of test_connectivity_through_2_routers has decreased significantly. As a consequence. I did an analysis of frequency failures per test over the past 7 days (I went through the 500 occurrences that Kibana returns as a maximum. There may be more not visible beyond that 500 limit). This is what I found:

test_qos_basic_and_update 48
test_from_dvr_to_dvr_ha 39
test_from_dvr_to_ha 38
test_from_dvr_to_legacy 23
test_connectivity_through_2_routers 17
test_snat_external_ip 17
test_vm_reachable_through_compute 10
test_trunk_subport_lifecycle 9
test_qos 8

So the big offenders as of now are test_qos_basic_and_update and the routers migrations

Revision history for this message

Miguel Lavalle (minsel) wrote on 2019-07-23:

#6

Looking at the 48 failures of test_qos_basic_and_update, most of the failures happened at https://github.com/openstack/neutron-tempest-plugin/blob/a7bb1619d43b413eb8d5849eb6df8d0dee260660/neutron_tempest_plugin/scenario/test_qos.py#L251-L256. The way I am reading this test, is that everything seemed to work correctly (including the the router) and after updating the QoS policy / rule, a timeout occurs while measuring the actual bandwidth that the VM gets:

Traceback (most recent call last):
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/scenario/test_qos.py", line 256, in test_qos_basic_and_update
    sleep=1)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/common/utils.py", line 77, in wait_until_true
    while not predicate():
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/scenario/test_qos.py", line 254, in <lambda>
    port=self.NC_PORT),
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/scenario/test_qos.py", line 110, in _check_bw
    data = client_socket.recv(QoSTestMixin.BUFFER_SIZE)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
    raise TimeoutException()
fixtures._fixtures.timeout.TimeoutException

Revision history for this message

Miguel Lavalle (minsel) wrote on 2019-07-23:

#7

In the case of test_from_dvr_to_dvr_ha, most of the occurrences failed while waiting for router ports down: https://github.com/openstack/neutron-tempest-plugin/blob/a7bb1619d43b413eb8d5849eb6df8d0dee260660/neutron_tempest_plugin/scenario/test_migration.py#L81

Here's the test traceback:

Traceback (most recent call last):
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/scenario/test_migration.py", line 222, in test_from_dvr_to_dvr_ha
    after_dvr=True, after_ha=True)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/scenario/test_migration.py", line 136, in _test_migration
    self._wait_until_router_ports_down(router['id'])
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/scenario/test_migration.py", line 81, in _wait_until_router_ports_down
    timeout=300, sleep=5)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.6/site-packages/neutron_tempest_plugin/common/utils.py", line 83, in wait_until_true
    raise WaitTimeout("Timed out after %d seconds" % timeout)
neutron_tempest_plugin.common.utils.WaitTimeout: Timed out after 300 seconds

Revision history for this message

Miguel Lavalle (minsel) wrote on 2019-07-23:

#8

The case of test_from_dvr_to_ha is similar to test_from_dvr_to_dvr_ha, described in the previous note

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2019-07-26:

#9

According to issue with reaching metadata service from vm and ssh failure. I think I found what is the reason.
It is race condition when 2 routers are created in short time and configured on same snat node. Then when both routers are configuring external gateway it may happend that one of routers will add external net to subscribers list in https://github.com/openstack/neutron/blob/master/neutron/agent/l3/dvr_fip_ns.py#L129 so second router will got info that it's not "first" and will go to update gateway port instead of creating it.
But if in fact gateway wasn't created yet it will cause exception in: https://github.com/openstack/neutron/blob/master/neutron/agent/l3/dvr_fip_ns.py#L332
And if this will happend, one of routers will not have properly configured iptables rules to allow requests to 169.254.169.254 so metadata will not work for this instance.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-26: Fix proposed to neutron (master)

#10

Fix proposed to branch: master
Review: https://review.opendev.org/673004

Changed in neutron:
assignee:	Miguel Lavalle (minsel) → Slawek Kaplonski (slaweq)
status:	Confirmed → In Progress

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2019-07-26:

#11

For https://bugs.launchpad.net/neutron/+bug/1830763/comments/6, I filled a new bug: https://bugs.launchpad.net/neutron/+bug/1838068.

Patch: https://review.opendev.org/#/c/673023/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-29: Related fix proposed to neutron (master)

#12

Related fix proposed to branch: master
Review: https://review.opendev.org/673331

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2019-07-30:

#13

Yesterday I did another round of debugging on this issue with reaching metadata in DVR jobs.
I think I finally found out what is the issue there.
My all analysis yesterday was based on failed test http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/testr_results.html.gz

I found out that 2 different routers were updated to set external gateway in almost same time and two API workers discovered that there is no floating IP agent gateway port created yet for this L3 agent. See:

http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/controller/logs/screen-q-svc.txt.gz#_Jul_29_08_13_07_864615

and

http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/controller/logs/screen-q-svc.txt.gz#_Jul_29_08_13_07_933488

Those ports were created in:

http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/controller/logs/screen-q-svc.txt.gz#_Jul_29_08_13_09_261935

and

http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/controller/logs/screen-q-svc.txt.gz#_Jul_29_08_13_13_724372

Please note that for both of those ports, device_id is set to "ac1f0f11-3731-439a-b2e9-1708fd2a9ba2" and this device_id is just L3 agent ID.

Each of those ports were than send to L3 agent to be created in fip-XXX namespace. First one was created fine but second one wasn't "first" so went to "update" code path in https://github.com/openstack/neutron/blob/e8b8a8498df4ea68e8ae3fc72e8fca74ab7d2243/neutron/agent/l3/dvr_fip_ns.py#L123

When it went to "update path" but device fq-XXX was not found in namespace, it failed and router didn't have properly configured e.g. iptables rules to be able to reach 169.254.169.254 address. Error on L3 agent's side is in http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/compute1/logs/screen-q-l3.txt.gz#_Jul_29_08_13_15_443994

Yesterday I did another round of debugging on this issue with reaching metadata in DVR jobs.
I think I finally found out what is the issue there.
My all analysis yesterday was based on failed test http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/testr_results.html.gz

I found out that 2 different routers were updated to set external gateway in almost same time and two API workers discovered that there is no floating IP agent gateway port created yet for this L3 agent. See:

http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/controller/logs/screen-q-svc.txt.gz#_Jul_29_08_13_07_864615

and

http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/controller/logs/screen-q-svc.txt.gz#_Jul_29_08_13_07_933488

Those ports were created in:

http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/controller/logs/screen-q-svc.txt.gz#_Jul_29_08_13_09_261935

and

http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/controller/logs/screen-q-svc.txt.gz#_Jul_29_08_13_13_724372

Please note that for both of those ports, device_id is set to "ac1f0f11-3731-439a-b2e9-1708fd2a9ba2" and this device_id is just L3 agent ID.

Each of those ports were than send to L3 agent to be created in fip-XXX namespace. First one was created fine but second one wasn't "first" so went to "update" code path in https://github.com/openstack/neutron/blob/e8b8a8498df4ea68e8ae3fc72e8fca74ab7d2243/neutron/agent/l3/dvr_fip_ns.py#L123

When it went to "update path" but device fq-XXX was not found in namespace, it failed and router didn't have properly configured e.g. iptables rules to be able to reach 169.254.169.254 address. Error on L3 agent's side is in http://logs.openstack.org/80/671780/1/check/neutron-tempest-plugin-dvr-multinode-scenario/20b67fb/compute1/logs/screen-q-l3.txt.gz#_Jul_29_08_13_15_443994

Revision history for this message

Miguel Lavalle (minsel) wrote on 2019-07-30:

#14

Filed this bug https://bugs.launchpad.net/neutron/+bug/1838449 for the router migrations failure

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-07-30: Change abandoned on neutron (master)

#15

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.opendev.org/673004

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-11: Related fix merged to neutron (master)

#16

Reviewed: https://review.opendev.org/673331
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7b81c1bc67d2d85e03b4c96a8c1c558a2f909836
Submitter: Zuul
Branch: master

commit 7b81c1bc67d2d85e03b4c96a8c1c558a2f909836
Author: Slawek Kaplonski <email address hidden>
Date: Mon Jul 29 18:15:33 2019 +0200

[DVR] Add lock during creation of FIP agent gateway port

    In case when new external network is set as gateway network for
    dvr router, neutron tries to create floating IP agent gateway port.
    There should be always max 1 such port per network per L3 agent but
    sometimes when there are 2 requests to set external gateway for 2
    different routers executed almost in same time it may happend that
    there will be 2 such ports created.
    That will cause error with configuration of one of routers on L3 agent
    and this will cause e.g. problems with access from VMs to metadata
    service.
    Such issues are visible in DVR CI jobs from time to time. Please check
    related bug for details.

    This patch adds lock mechanism during creation of such FIP gateway port.
    Such solution isn't fully solving exising race condition as if 2
    requests will be processed by api workers running on 2 different nodes
    than this race can still happend.
    But this should mitigate the issue a bit and solve problem in U/S gates
    at least.
    For proper fix we should probably add some constraint on database level
    to prevent creation of 2 such ports for one network and one host but
    such solution will not be easy to backport to stable branches so I would
    prefer first to go with this easy workaround.

Change-Id: Iabab7e4d36c7d6a876b2b74423efd7106a5f63f6
Related-Bug: #1830763

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-12: Related fix proposed to neutron (stable/stein)

#17

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/675838

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-12: Related fix proposed to neutron (stable/rocky)

#18

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/675844

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-12: Related fix proposed to neutron (stable/queens)

#19

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/675846

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-13: Related fix merged to neutron (stable/stein)

#20

Reviewed: https://review.opendev.org/675838
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f7532f0c927a551ea40e81a35818eb83faba4b5a
Submitter: Zuul
Branch: stable/stein

commit f7532f0c927a551ea40e81a35818eb83faba4b5a
Author: Slawek Kaplonski <email address hidden>
Date: Mon Jul 29 18:15:33 2019 +0200

[DVR] Add lock during creation of FIP agent gateway port

    In case when new external network is set as gateway network for
    dvr router, neutron tries to create floating IP agent gateway port.
    There should be always max 1 such port per network per L3 agent but
    sometimes when there are 2 requests to set external gateway for 2
    different routers executed almost in same time it may happend that
    there will be 2 such ports created.
    That will cause error with configuration of one of routers on L3 agent
    and this will cause e.g. problems with access from VMs to metadata
    service.
    Such issues are visible in DVR CI jobs from time to time. Please check
    related bug for details.

    This patch adds lock mechanism during creation of such FIP gateway port.
    Such solution isn't fully solving exising race condition as if 2
    requests will be processed by api workers running on 2 different nodes
    than this race can still happend.
    But this should mitigate the issue a bit and solve problem in U/S gates
    at least.
    For proper fix we should probably add some constraint on database level
    to prevent creation of 2 such ports for one network and one host but
    such solution will not be easy to backport to stable branches so I would
    prefer first to go with this easy workaround.

Conflicts:
neutron/db/l3_dvr_db.py

    Change-Id: Iabab7e4d36c7d6a876b2b74423efd7106a5f63f6
    Related-Bug: #1830763
    (cherry picked from commit 7b81c1bc67d2d85e03b4c96a8c1c558a2f909836)

tags:

added: in-stable-stein

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-13: Related fix merged to neutron (stable/queens)

#21

Reviewed: https://review.opendev.org/675846
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=99602ab15b77e8cc8641dc8a6ad760ddb04c3028
Submitter: Zuul
Branch: stable/queens

commit 99602ab15b77e8cc8641dc8a6ad760ddb04c3028
Author: Slawek Kaplonski <email address hidden>
Date: Mon Jul 29 18:15:33 2019 +0200

[DVR] Add lock during creation of FIP agent gateway port

    In case when new external network is set as gateway network for
    dvr router, neutron tries to create floating IP agent gateway port.
    There should be always max 1 such port per network per L3 agent but
    sometimes when there are 2 requests to set external gateway for 2
    different routers executed almost in same time it may happend that
    there will be 2 such ports created.
    That will cause error with configuration of one of routers on L3 agent
    and this will cause e.g. problems with access from VMs to metadata
    service.
    Such issues are visible in DVR CI jobs from time to time. Please check
    related bug for details.

    This patch adds lock mechanism during creation of such FIP gateway port.
    Such solution isn't fully solving exising race condition as if 2
    requests will be processed by api workers running on 2 different nodes
    than this race can still happend.
    But this should mitigate the issue a bit and solve problem in U/S gates
    at least.
    For proper fix we should probably add some constraint on database level
    to prevent creation of 2 such ports for one network and one host but
    such solution will not be easy to backport to stable branches so I would
    prefer first to go with this easy workaround.

Conflicts:
neutron/db/l3_dvr_db.py

    Change-Id: Iabab7e4d36c7d6a876b2b74423efd7106a5f63f6
    Related-Bug: #1830763
    (cherry picked from commit 7b81c1bc67d2d85e03b4c96a8c1c558a2f909836)
    (cherry picked from commit f7532f0c927a551ea40e81a35818eb83faba4b5a)
    (cherry picked from commit 5c1afcaf2b9cb1bd09267a26dd4f5d7f7e99bf85)

Reviewed:  https://review.opendev.org/675846
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=99602ab15b77e8cc8641dc8a6ad760ddb04c3028
Submitter: Zuul
Branch:    stable/queens

commit 99602ab15b77e8cc8641dc8a6ad760ddb04c3028
Author: Slawek Kaplonski <skaplons@redhat.com>
Date:   Mon Jul 29 18:15:33 2019 +0200

[DVR] Add lock during creation of FIP agent gateway port
    
    In case when new external network is set as gateway network for
    dvr router, neutron tries to create floating IP agent gateway port.
    There should be always max 1 such port per network per L3 agent but
    sometimes when there are 2 requests to set external gateway for 2
    different routers executed almost in same time it may happend that
    there will be 2 such ports created.
    That will cause error with configuration of one of routers on L3 agent
    and this will cause e.g. problems with access from VMs to metadata
    service.
    Such issues are visible in DVR CI jobs from time to time. Please check
    related bug for details.
    
    This patch adds lock mechanism during creation of such FIP gateway port.
    Such solution isn't fully solving exising race condition as if 2
    requests will be processed by api workers running on 2 different nodes
    than this race can still happend.
    But this should mitigate the issue a bit and solve problem in U/S gates
    at least.
    For proper fix we should probably add some constraint on database level
    to prevent creation of 2 such ports for one network and one host but
    such solution will not be easy to backport to stable branches so I would
    prefer first to go with this easy workaround.
    
    Conflicts:
        neutron/db/l3_dvr_db.py
    
    Change-Id: Iabab7e4d36c7d6a876b2b74423efd7106a5f63f6
    Related-Bug: #1830763
    (cherry picked from commit 7b81c1bc67d2d85e03b4c96a8c1c558a2f909836)
    (cherry picked from commit f7532f0c927a551ea40e81a35818eb83faba4b5a)
    (cherry picked from commit 5c1afcaf2b9cb1bd09267a26dd4f5d7f7e99bf85)

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-15: Related fix merged to neutron (stable/rocky)

#22

Reviewed: https://review.opendev.org/675844
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5c1afcaf2b9cb1bd09267a26dd4f5d7f7e99bf85
Submitter: Zuul
Branch: stable/rocky

commit 5c1afcaf2b9cb1bd09267a26dd4f5d7f7e99bf85
Author: Slawek Kaplonski <email address hidden>
Date: Mon Jul 29 18:15:33 2019 +0200

[DVR] Add lock during creation of FIP agent gateway port

    In case when new external network is set as gateway network for
    dvr router, neutron tries to create floating IP agent gateway port.
    There should be always max 1 such port per network per L3 agent but
    sometimes when there are 2 requests to set external gateway for 2
    different routers executed almost in same time it may happend that
    there will be 2 such ports created.
    That will cause error with configuration of one of routers on L3 agent
    and this will cause e.g. problems with access from VMs to metadata
    service.
    Such issues are visible in DVR CI jobs from time to time. Please check
    related bug for details.

    This patch adds lock mechanism during creation of such FIP gateway port.
    Such solution isn't fully solving exising race condition as if 2
    requests will be processed by api workers running on 2 different nodes
    than this race can still happend.
    But this should mitigate the issue a bit and solve problem in U/S gates
    at least.
    For proper fix we should probably add some constraint on database level
    to prevent creation of 2 such ports for one network and one host but
    such solution will not be easy to backport to stable branches so I would
    prefer first to go with this easy workaround.

Conflicts:
neutron/db/l3_dvr_db.py

    Change-Id: Iabab7e4d36c7d6a876b2b74423efd7106a5f63f6
    Related-Bug: #1830763
    (cherry picked from commit 7b81c1bc67d2d85e03b4c96a8c1c558a2f909836)
    (cherry picked from commit f7532f0c927a551ea40e81a35818eb83faba4b5a)

tags:

added: in-stable-rocky

Bernard Cafarelli (bcafarel) on 2019-08-23

tags:

added: neutron-proactive-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-14: Fix proposed to neutron (master)

#23

Fix proposed to branch: master
Review: https://review.opendev.org/702547

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-28: Related fix proposed to neutron (master)

#24

Related fix proposed to branch: master
Review: https://review.opendev.org/704686

Bernard Cafarelli (bcafarel) on 2020-01-31

tags:

removed: neutron-proactive-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-08: Related fix merged to neutron (master)

#25

Reviewed: https://review.opendev.org/704686
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=18d8d3973a532a36120c2c58136683e834a5e405
Submitter: Zuul
Branch: master

commit 18d8d3973a532a36120c2c58136683e834a5e405
Author: Slawek Kaplonski <email address hidden>
Date: Tue Jan 28 16:52:29 2020 +0100

Revert "[DVR] Add lock during creation of FIP agent gateway port"

This reverts commit 7b81c1bc67d2d85e03b4c96a8c1c558a2f909836.

It isn't needed anymore with new solution with lock "on db level"
which is introduced in follow-up patch.

Change-Id: Ibf15ee1969f902e8a266825934d9ac963353f0a0
Related-Bug: #1830763

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-08: Fix merged to neutron (master)

#26

Reviewed: https://review.opendev.org/702547
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=93d9d6bbba083ede727aa233903edc3456c5dbeb
Submitter: Zuul
Branch: master

commit 93d9d6bbba083ede727aa233903edc3456c5dbeb
Author: Slawek Kaplonski <email address hidden>
Date: Mon Jan 13 11:26:28 2020 +0100

Ensure there is always at most 1 dvr fip gw port per agent and network

    In patch [1] there was introduced simple lock for creation of
    DVR agent's floating IP gateway ports for network to avoid races
    and creation of duplicated ports for one agent and one network.

    This fix from [1] works in simple examples with only one neutron-server,
    so it helped e.g. in CI but it wasn't proper fix for production
    deployments which are much bigger and have more neutron server api
    workers.

So this patch introduces constraint on database level so this works even
across cluster with multiple neutron-server api workers.

[1] https://review.opendev.org/#/c/673331/

Change-Id: Id55b8a21d6ecf5e029d1ca267b2cbd2ed91cca4c
Closes-Bug: #1830763

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-21: Fix included in openstack/neutron 16.0.0.0b1

#27

This issue was fixed in the openstack/neutron 16.0.0.0b1 development milestone.

Bernard Cafarelli (bcafarel) on 2020-04-21

tags:

added: neutron-proactive-backport-potential

Slawek Kaplonski (slaweq) on 2020-09-25

tags:

removed: neutron-proactive-backport-potential

neutron

Debug neutron-tempest-plugin-dvr-multinode-scenario failures

Bug Description

Other bug subscribers

Remote bug watches