Tempest jobs failing with SSH timeout sometimes

Bug #1735154 reported by Daniel Alvarez on 2017-11-29
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
networking-ovn
High
Daniel Alvarez

Bug Description

We've been observing lately that some tempest tests fail with timeout when trying to SSH the FIP of an instance. After debugging the failures, we can tell that the reason is that ARP replies from the FIP aren't coming through. This doesn't always happen and I can't reproduce it myself in devstack.

2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh [-] Failed to establish authenticated ssh connection to cirros@172.24.5.10 after 16 attempts: NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 172.24.5.10
2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh Traceback (most recent call last):
2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh File "tempest/lib/common/ssh.py", line 107, in _get_ssh_connection
2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh sock=proxy_chan)
2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/paramiko/client.py", line 357, in connect
2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh raise NoValidConnectionsError(errors)
2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 172.24.5.10
2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh

Instance IP is 10.1.0.3.
NAT entry [0]:

187 OVSDB JSON 802 48c3f9ccd173a7722c22dea80c779ffc567fc162
188 {"_date":1511790428367,"Logical_Router":{"f89bbcb7-063b-42d7-b943-5825ee8a839e":{"nat":["set",[["uuid","5ab4b7f5-da9a-4aeb-aaa8-4f471633d724"],["uuid","b2681871-58a3-485f-838a-672d6e92e5 5e"]]]}},"NAT":{"b2681871-58a3-485f-838a-672d6e92e55e":{"external_ip":"172.24.5.10","logical_ip":"10.1.0.3","type":"dnat_and_snat"}},"Logical_Switch":{"e55f3c0e-f59e-4d25-b566-8c4c63f8a6 b6":{"ports":["set",[["uuid","0135db16-8a8c-4dc4-990d-3e5921cf79ee"],["uuid","31aa3b52-821b-415c-9770-8dad54c13d12"],["uuid","4557f804-96b7-4035-933e-90d8c79f2332"],["uuid","d18bd12c-ab2 d-4002-856c-9fb8cc111f26"],["uuid","d78d9e54-14fe-46df-8341-031d68816cef"],["uuid","dbd6621b-60ff-4466-b995-213a03edf175"],["uuid","e89ec11b-7b7e-490b-b8ab-69e85597d536"]]]}},"Logical_Sw itch_Port":{"2ec1ede6-ec94-48d6-8680-895ba3d7b117":null}}

Search for "Request who-has 172.24.5.10" at [1] and you won't see any replies back.

Logical flows look good to me (at first glance) so it could be an ovn-controller/ovs-vswitchd bug. We'll try to include periodic dumps of openflows and see how they look like when this failure occurs.

[0] http://logs.openstack.org/97/523097/2/check/networking-ovn-tempest-dsvm-ovs-release/9a70bc0/logs/ovs_dbs/ovnnb.txt.gz
[1] http://logs.openstack.org/97/523097/2/check/networking-ovn-tempest-dsvm-ovs-release/9a70bc0/logs/screen-br-ex-tcpdump.txt.gz

Daniel Alvarez (dalvarezs) wrote :
Download full text (3.9 KiB)

This bug still shows up, adding more info after some investigation:

1. A server is created and a FIP is associated to it:

Tempest log:

2018-04-12 17:38:26,073 6080 INFO [tempest.lib.common.rest_client] Request (TestServerBasicOps:test_server_basic_ops): 202 POST http://198.72.124.116/compute/v2.1/servers/d98c0fd9-e92c-4fd6-8391-e4452b7b32ae/action 1.282s
2018-04-12 17:38:26,074 6080 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'X-Auth-Token': '<omitted>', 'Accept': 'application/json'}
Body: {"addFloatingIp": {"address": "172.24.5.5"}}

neutron-server log:
The dnat_and_snat NAT entry is added (fixed IP is 10.1.0.9):

Apr 12 17:38:25.418601 ubuntu-xenial-inap-mtl01-0003484549 neutron-server[25773]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn command(idx=2): AddNATRuleInLRouterCommand(lrouter=neutron-57884ee5-47a6-4587-949d-b066f5ea56c3, columns={'external_ip': '172.24.5.5', 'external_ids': {'neutron:router_name': 'neutron-57884ee5-47a6-4587-949d-b066f5ea56c3', 'neutron:fip_id': '20b39da8-413d-48f6-8d14-e0cc18a59a74', 'neutron:revision_number': '1', 'neutron:fip_port_id': '3846002b-ac5b-4a8e-b323-1117df6aa5a2'}, 'type': 'dnat_and_snat', 'logical_ip': '10.1.0.9'}) {{(pid=25866) do_commit /usr/local/lib/python2.7/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}}

br-int flows:

ARP entry looks good (table=9) with the MAC of the gateway port:

Apr 12 17:38:38.077970 ubuntu-xenial-inap-mtl01-0003484549 sh[23566]: cookie=0xfdc62a55, duration=12.576s, table=9, n_packets=0, n_bytes=0, idle_age=12, priority=90,arp,reg14=0x1,metadata=0x5c,arp_tpa=172.24.5.5,arp_op=1 actions=move:NXM_OF_ETH_SRC[]->NXM_OF_ETH_DST[],load:0x2->NXM_OF_ARP_OP[],move:NXM_NX_ARP_SHA[]->NXM_NX_ARP_THA[],mod_dl_src:fa:16:3e:83:d5:0d,load:0xfa163e83d50d->NXM_NX_ARP_SHA[],move:NXM_OF_ARP_SPA[]->NXM_OF_ARP_TPA[],load:0xac180505->NXM_OF_ARP_SPA[],load:0x1->NXM_NX_REG15[],load:0x1->NXM_NX_REG10[0],resubmit(,32)

Note the n_packets=0, this flow never got hit.

However in table=12 we would expect to see the dnat action and we see it for the fixed IP address but to a different FIP (in a different datapath), so the DNAT action is not there.

Apr 12 17:38:38.116595 ubuntu-xenial-inap-mtl01-0003484549 sh[23566]: cookie=0xca5b11b2, duration=29.150s, table=12, n_packets=5, n_bytes=370, idle_age=12, priority=100,ip,reg14=0x1,metadata=0x57,nw_dst=172.24.5.7 actions=ct(commit,table=13,zone=NXM_NX_REG11[0..15],nat(dst=10.1.0.9))

This means that traffic directed to 172.24.5.5 will never reach the instance as the flow is not installed.

tcpdump logs:

I would expect to see ARP responses for the FIP but i don't see any:

Apr 12 17:38:31.367332 ubuntu-xenial-inap-mtl01-0003484549 tcpdump[23418]: 17:38:31.086964 76:17:ae:71:f2:47 > fa:16:3e:a1:de:ed, ethertype ARP (0x0806), length 42: Request who-has 172.24.5.5 tell 172.24.5.1, length 28
Apr 12 17:38:31.367518 ubuntu-xenial-inap-mtl01-0003484549 tcpdump[23418]: 0x0000: 0001 0800 0604 0001 7617 ae71 f247 ac18 ........v..q.G..
Apr 12 17:38:31.367672 ubuntu-xenial-inap-mtl01-0003484549 tcpdump[23418]: 0x0010: 0501 0000 0000 0000 ac18 0505 ...

Read more...

Related fix proposed to branch: master
Review: https://review.openstack.org/602449

Daniel Alvarez (dalvarezs) wrote :
Download full text (3.4 KiB)

After almost getting blind debugging and comparing against the working case, I can see that when the traffic to the FIP is not working, there's no SNAT entry for the gateway:

Working case when adding a FIP (dnat_and_snat):

{"uuid-name":"row252a0a68_35ea_4d83_adc2_c62fd28ac862","table":"NAT","op":"insert","row":{"external_ip":"172.24.5.5","logical_ip":"10.1.0.5",...

"table":"Logical_Router","op":"update","row":{"nat":["set",[["named-uuid","row252a0a68_35ea_4d83_adc2_c62fd28ac862"],["uuid","48bb5186-e0f3-4e91-812d-72de2d8a8466"]]]}},{"where":[["_uuid","==",["uuid","35149ca5-df87-48c1-b01b-e7073e6f99ad"]]]

Basically, it creates a NAT row and adds it to the Logical Router where another row was already added with ID: 48bb5186-e0f3-4e91-812d-72de2d8a8466

Taking a look at that NAT row (previous transaction):

NAT":{"48bb5186-e0f3-4e91-812d-72de2d8a8466":{"insert":{"external_ip":"172.24.5.16","logical_ip":"10.1.0.0/28","type":"snat"}}}

This port is the router gateway port:

_uuid : 9bfd89ac-aba3-4c60-9e93-a15066864f98
addresses : [router]
dhcpv4_options : []
dhcpv6_options : []
dynamic_addresses : []
enabled : true
external_ids : {"neutron:cidrs"="172.24.4.5/24 2001:db8::5/64", "neutron:device_id"="d40a90a2-d0fb-42f3-9db5-45deb1115e68", "neutron:device_owner"="network:router_gateway", "neutron:network_name"="neutron-7a0cde2a-5b80-48b3-9fcb-ecc3b110ac01", "neutron:port_name"="", "neutron:project_id"="", "neutron:revision_number"="6", "neutron:security_group_ids"=""}
name : "59d0903a-4aa2-47b1-89ce-934de5f7b88a"
options : {nat-addresses=router, router-port="lrp-59d0903a-4aa2-47b1-89ce-934de5f7b88a"}
parent_name : []
port_security : []
tag : []
tag_request : []
type : router
up : true

For the *non working* case, I find this NAT rule missing:

DNAT entry is created and I realized that this is going to be the first nat entry added to the Logical Router:

Sep 13 00:55:56.255737 ubuntu-xenial-ovh-gra1-0001985292 ovsdb-server[27529]: ovs|209671|jsonrpc|DBG|tcp:147.135.132.110:53442: send notification, method="update2", params=["5fa818c2-b6ea-11e8-abe5-fa163e7f50bc",{"Logical_Router":{"2eaf1ac4-e639-45bd-85e3-b0d018936db4":{"modify":{"nat":["uuid","43bd0972-26f0-4609-8a27-73bfad25484c"]}}},"NB_Global":{"4e3729fc-0435-4beb-9805-6d1d8dbd8fd6":{"modify":{"nb_cfg":4332}}},"NAT":{"43bd0972-26f0-4609-8a27-73bfad25484c":{"insert":{"external_ip":"172.24.5.2","logical_ip":"10.1.0.7","external_ids":["map",[["neutron:fip_id","8c8722ab-5d3d-4c69-a734-beba2459ade6"],["neutron:fip_port_id","7678dbfe-82c1-46c4-8461-cc7d73607890"],["neutron:revision_number","1"],["neutron:router_name","neutron-ff13d1be-f9ab-4240-8969-0da5e374c57d"]]],"type":"dnat_and_snat"}}}}]

the DNAT entry is 43bd0972-26f0-4609-8a27-73bfad25484c which is the only one existing in the Logical Router.

Therefore, any attempt of reaching the FIP won't succeed as ovn-northd won't generate the required Logical Flows.

At this point the next step is figure out if the gateway port was created or the SNAT entry failed to be created/added to the Logical ...

Read more...

Related fix proposed to branch: master
Review: https://review.openstack.org/602490

Change abandoned by Miguel Angel Ajo (<email address hidden>) on branch: master
Review: https://review.openstack.org/602449
Reason: in favor of https://review.openstack.org/#/c/602490/1

Daniel Alvarez (dalvarezs) wrote :

Confirmed through logs that the gateway port at least is not present in the router after I analyzed other logs:

- The switch has 3 ports (metadata port, logical port for the VM and the router port):

switch a8168e8d-49e5-4032-849b-2a7107e93610 (neutron-6cbe7647-d46f-4ad6-a193-fc59fdc352f3) (aka tempest-network-smoke--988884713)
    port c9998671-bfa3-44f3-8837-2977f257481b
        type: localport
        addresses: ["fa:16:3e:95:24:2b 10.1.0.2"]
    port a02af8d6-890f-4a44-ac94-32d2df9700e6
        type: router
        router-port: lrp-a02af8d6-890f-4a44-ac94-32d2df9700e6
    port 3dcf857a-5c6a-4f36-a509-c55ec4d52b87
        addresses: ["fa:16:3e:a5:25:2d 10.1.0.3"]
switch 9f07b6f6-0f09-459d-b614-9c0ba25d47d9 (neutron-46e7378b-0b97-455e-977d-3daead3bf546) (aka tempest-test-network--878432264)

- The router has the external network port missing:

router 9612b8f5-b2d6-4cc0-a671-cabe59cf177b (neutron-cde28807-1542-4329-a314-20406e4c3e92) (aka tempest-TestNetworkBasicOps-router-580083611)
    port lrp-a02af8d6-890f-4a44-ac94-32d2df9700e6
        mac: "fa:16:3e:a0:b8:e1"
        networks: ["10.1.0.1/28"]
    nat be099503-5bbd-48e4-9fac-fd68f2db1663
        external ip: "172.24.5.30"
        logical ip: "10.1.0.0/28"
        type: "snat"
    nat d3e69a5e-bbfc-42f1-8d82-c9f8f5b391a5
        external ip: "172.24.5.25"
        logical ip: "10.1.0.3"
        type: "dnat_and_snat"

Need to investigate why. Will update this soon.

Download full text (4.8 KiB)

You nailed it down dude!!! :)

On Mon, Sep 17, 2018 at 6:31 PM Daniel Alvarez <email address hidden>
wrote:

> Confirmed through logs that the gateway port at least is not present in
> the router after I analyzed other logs:
>
>
> - The switch has 3 ports (metadata port, logical port for the VM and the
> router port):
>
> switch a8168e8d-49e5-4032-849b-2a7107e93610
> (neutron-6cbe7647-d46f-4ad6-a193-fc59fdc352f3) (aka
> tempest-network-smoke--988884713)
> port c9998671-bfa3-44f3-8837-2977f257481b
> type: localport
> addresses: ["fa:16:3e:95:24:2b 10.1.0.2"]
> port a02af8d6-890f-4a44-ac94-32d2df9700e6
> type: router
> router-port: lrp-a02af8d6-890f-4a44-ac94-32d2df9700e6
> port 3dcf857a-5c6a-4f36-a509-c55ec4d52b87
> addresses: ["fa:16:3e:a5:25:2d 10.1.0.3"]
> switch 9f07b6f6-0f09-459d-b614-9c0ba25d47d9
> (neutron-46e7378b-0b97-455e-977d-3daead3bf546) (aka
> tempest-test-network--878432264)
>
>
> - The router has the external network port missing:
>
>
> router 9612b8f5-b2d6-4cc0-a671-cabe59cf177b
> (neutron-cde28807-1542-4329-a314-20406e4c3e92) (aka
> tempest-TestNetworkBasicOps-router-580083611)
> port lrp-a02af8d6-890f-4a44-ac94-32d2df9700e6
> mac: "fa:16:3e:a0:b8:e1"
> networks: ["10.1.0.1/28"]
> nat be099503-5bbd-48e4-9fac-fd68f2db1663
> external ip: "172.24.5.30"
> logical ip: "10.1.0.0/28"
> type: "snat"
> nat d3e69a5e-bbfc-42f1-8d82-c9f8f5b391a5
> external ip: "172.24.5.25"
> logical ip: "10.1.0.3"
> type: "dnat_and_snat"
>
>
> Need to investigate why. Will update this soon.
>
> --
> You received this bug notification because you are subscribed to
> networking-ovn.
> https://bugs.launchpad.net/bugs/1735154
>
> Title:
> Tempest jobs failing with SSH timeout sometimes
>
> Status in networking-ovn:
> New
>
> Bug description:
> We've been observing lately that some tempest tests fail with timeout
> when trying to SSH the FIP of an instance. After debugging the
> failures, we can tell that the reason is that ARP replies from the FIP
> aren't coming through. This doesn't always happen and I can't
> reproduce it myself in devstack.
>
>
> 2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh [-] Failed to
> establish authenticated ssh connection to cirros@172.24.5.10 after 16
> attempts: NoValidConnectionsError: [Errno None] Unable to connect to port
> 22 on 172.24.5.10
> 2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh Traceback
> (most recent call last):
> 2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh File
> "tempest/lib/common/ssh.py", line 107, in _get_ssh_connection
> 2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh
> sock=proxy_chan)
> 2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh File
> "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/paramiko/client.py",
> line 357, in connect
> 2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh raise
> NoValidConnectionsError(errors)
> 2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh
> NoValidConnectionsError: [Errno None] Unable to connect to po...

Read more...

Daniel Alvarez (dalvarezs) wrote :

More debugging.

Apparently, the gw is not added to the router so the problem must be somewhere around here:

https://github.com/openstack/networking-ovn/blob/9126c366ced27b773faffb3cc84093db7e23bc72/networking_ovn/common/ovn_client.py#L930-#L934

Either _get_v4_network_of_all_router_ports is returning no networks or router.get(l3.EXTERNAL_GW_INFO) is returning no info.

Daniel Alvarez (dalvarezs) wrote :

Also, the problematic router gets created while another router is getting created at the same time:

11:59 <dalvarez> 1. AddLRouterCommand (router1)
11:59 <dalvarez> 2. AddLRouterCommand (router2) <--- problematic one
11:59 <dalvarez> 3. AddLRouterPortCommand (router1)
11:59 <dalvarez> 4. SetLRouterPortInLSwitchPortCommand (router1)
11:59 <dalvarez> 5. AddStaticRouteCommand (router1)
12:02 <dalvarez> create_router() does 1, 3, 4 and 5
12:02 <dalvarez> but somehow there's another router in between
12:02 <dalvarez> and surprisingly im seeing that im missing a LRP for 2

Daniel Alvarez (dalvarezs) wrote :

Sep 14 22:08:58.665604 ubuntu-xenial-rax-dfw-0002028832 neutron-server[345]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=0): AddLRouterCommand(ma
y_exist=True, columns={'external_ids': {'neutron:gw_port_id': u'b26d240d-e5f1-452a-b42b-2d74b815799c', 'neutron:router_name': u'tempest-TestSecurityGroupsBasicOps-router-184247
8056', 'neutron:revision_number': '2'}, 'enabled': True, 'options': {}}, name=neutron-780c5403-d1f2-44a3-9c98-9153cce6caec) {{(pid=467) do_commit /usr/local/lib/python2.7/dist-
packages/ovsdbapp/backend/ovs_idl/transaction.py:84}}

Sep 14 22:08:58.666326 ubuntu-xenial-rax-dfw-0002028832 neutron-server[345]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=1): AddLRouterCommand(ma
y_exist=True, columns={'external_ids': {'neutron:gw_port_id': u'3d745598-db5e-4a39-8028-89dc22a869f3', 'neutron:router_name': u'tempest-TestNetworkBasicOps-router-580083611', '
neutron:revision_number': '2'}, 'enabled': True, 'options': {}}, name=neutron-cde28807-1542-4329-a314-20406e4c3e92) {{(pid=467) do_commit /usr/local/lib/python2.7/dist-packages
/ovsdbapp/backend/ovs_idl/transaction.py:84}}

Sep 14 22:08:58.666929 ubuntu-xenial-rax-dfw-0002028832 neutron-server[345]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=2): AddLRouterPortComman
d(name=lrp-b26d240d-e5f1-452a-b42b-2d74b815799c, may_exist=True, lrouter=neutron-780c5403-d1f2-44a3-9c98-9153cce6caec, columns={'mac': u'fa:16:3e:49:6d:fc', 'gateway_chassis':
[u'fcdbec9c-8125-49c2-b1ca-f3d83ddfb5f4'], 'external_ids': {'neutron:subnet_ids': u'6e60301b-bdf2-4cb4-afd1-71fb6a5a3aec 6c18098b-d8a3-4b95-af34-2e11fff2aeef', 'neutron:revisio
n_number': '1', 'neutron:router_name': u'780c5403-d1f2-44a3-9c98-9153cce6caec'}, 'networks': [u'172.24.5.10/24', u'2001:db8::a/64']}) {{(pid=467) do_commit /usr/local/lib/pytho
n2.7/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}}

Sep 14 22:08:58.667872 ubuntu-xenial-rax-dfw-0002028832 neutron-server[345]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=3): SetLRouterPortInLSwi
tchPortCommand(if_exists=True, lswitch_port=b26d240d-e5f1-452a-b42b-2d74b815799c, lrouter_port=lrp-b26d240d-e5f1-452a-b42b-2d74b815799c, is_gw_port=True, lsp_address=router) {{
(pid=467) do_commit /usr/local/lib/python2.7/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}}

Sep 14 22:08:58.668517 ubuntu-xenial-rax-dfw-0002028832 neutron-server[345]: DEBUG ovsdbapp.backend.ovs_idl.transaction [-] Running txn n=1 command(idx=4): AddStaticRouteComman
d(lrouter=neutron-780c5403-d1f2-44a3-9c98-9153cce6caec, columns={'ip_prefix': '0.0.0.0/0', 'external_ids': {'neutron:subnet_id': u'6e60301b-bdf2-4cb4-afd1-71fb6a5a3aec', 'neutr
on:is_ext_gw': 'true'}, 'nexthop': '172.24.5.1'}) {{(pid=467) do_commit /usr/local/lib/python2.7/dist-packages/ovsdbapp/backend/ovs_idl/transaction.py:84}}

Daniel Alvarez (dalvarezs) wrote :

Finally!

After matching against working cases I could see that when the test fails, it's due to some commands of the transaction don't get commited and some others (the first(s) ones) getting commited in what is supposed to be a different transaction (actually different test running in the same worker thread). Turns out to be that it's a ovsdbapp bug:

1. T1 creates a new transaction txn1
2. T1 adds a new command to it C1.1
3. T1 yields the execution (due to some I/O operation for example)
4. T2 requests a transaction so txn1 is returned as a nested transaction
5. T2 adds a new command to it C2.1
6. T2 adds a new command to it C2.2
7. T2 commits the transaction
8. T1 adds a new command to txn1 C1.2
9. T1 adds a new command to txn1 C1.3

Reported the bug there too:
https://bugs.launchpad.net/ovsdbapp/+bug/1793499

\o/

Changed in networking-ovn:
assignee: nobody → Daniel Alvarez (dalvarezs)
importance: Undecided → High

This issue was fixed in the openstack/ovsdbapp 0.13.0 release.

This issue was fixed in the openstack/ovsdbapp 0.10.2 release.

This issue was fixed in the openstack/ovsdbapp 0.12.2 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers