We've been observing lately that some tempest tests fail with timeout when trying to SSH the FIP of an instance. After debugging the failures, we can tell that the reason is that ARP replies from the FIP aren't coming through. This doesn't always happen and I can't reproduce it myself in devstack.
2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh [-] Failed to establish authenticated ssh connection to cirros@172.24.5.10 after 16 attempts: NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 172.24.5.10
2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh Traceback (most recent call last):
2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh File "tempest/lib/common/ssh.py", line 107, in _get_ssh_connection
2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh sock=proxy_chan)
2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh File "/opt/stack/new/tempest/.tox/tempest/local/lib/python2.7/site-packages/paramiko/client.py", line 357, in connect
2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh raise NoValidConnectionsError(errors)
2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh NoValidConnectionsError: [Errno None] Unable to connect to port 22 on 172.24.5.10
2017-11-27 13:50:38.477 10303 ERROR tempest.lib.common.ssh
Instance IP is 10.1.0.3.
NAT entry [0]:
187 OVSDB JSON 802 48c3f9ccd173a7722c22dea80c779ffc567fc162
188 {"_date":1511790428367,"Logical_Router":{"f89bbcb7-063b-42d7-b943-5825ee8a839e":{"nat":["set",[["uuid","5ab4b7f5-da9a-4aeb-aaa8-4f471633d724"],["uuid","b2681871-58a3-485f-838a-672d6e92e5 5e"]]]}},"NAT":{"b2681871-58a3-485f-838a-672d6e92e55e":{"external_ip":"172.24.5.10","logical_ip":"10.1.0.3","type":"dnat_and_snat"}},"Logical_Switch":{"e55f3c0e-f59e-4d25-b566-8c4c63f8a6 b6":{"ports":["set",[["uuid","0135db16-8a8c-4dc4-990d-3e5921cf79ee"],["uuid","31aa3b52-821b-415c-9770-8dad54c13d12"],["uuid","4557f804-96b7-4035-933e-90d8c79f2332"],["uuid","d18bd12c-ab2 d-4002-856c-9fb8cc111f26"],["uuid","d78d9e54-14fe-46df-8341-031d68816cef"],["uuid","dbd6621b-60ff-4466-b995-213a03edf175"],["uuid","e89ec11b-7b7e-490b-b8ab-69e85597d536"]]]}},"Logical_Sw itch_Port":{"2ec1ede6-ec94-48d6-8680-895ba3d7b117":null}}
Search for "Request who-has 172.24.5.10" at [1] and you won't see any replies back.
Logical flows look good to me (at first glance) so it could be an ovn-controller/ovs-vswitchd bug. We'll try to include periodic dumps of openflows and see how they look like when this failure occurs.
[0] http://logs.openstack.org/97/523097/2/check/networking-ovn-tempest-dsvm-ovs-release/9a70bc0/logs/ovs_dbs/ovnnb.txt.gz
[1] http://logs.openstack.org/97/523097/2/check/networking-ovn-tempest-dsvm-ovs-release/9a70bc0/logs/screen-br-ex-tcpdump.txt.gz
This bug still shows up, adding more info after some investigation:
1. A server is created and a FIP is associated to it:
Tempest log:
2018-04-12 17:38:26,073 6080 INFO [tempest. lib.common. rest_client] Request (TestServerBasi cOps:test_ server_ basic_ops) : 202 POST http:// 198.72. 124.116/ compute/ v2.1/servers/ d98c0fd9- e92c-4fd6- 8391-e4452b7b32 ae/action 1.282s lib.common. rest_client] Request - Headers: {'Content-Type': 'application/json', 'X-Auth-Token': '<omitted>', 'Accept': 'application/json'}
2018-04-12 17:38:26,074 6080 DEBUG [tempest.
Body: {"addFloatingIp": {"address": "172.24.5.5"}}
neutron-server log:
The dnat_and_snat NAT entry is added (fixed IP is 10.1.0.9):
Apr 12 17:38:25.418601 ubuntu- xenial- inap-mtl01- 0003484549 neutron- server[ 25773]: DEBUG ovsdbapp. backend. ovs_idl. transaction [-] Running txn command(idx=2): AddNATRuleInLRo uterCommand( lrouter= neutron- 57884ee5- 47a6-4587- 949d-b066f5ea56 c3, columns= {'external_ ip': '172.24.5.5', 'external_ids': {'neutron: router_ name': 'neutron- 57884ee5- 47a6-4587- 949d-b066f5ea56 c3', 'neutron:fip_id': '20b39da8- 413d-48f6- 8d14-e0cc18a59a 74', 'neutron: revision_ number' : '1', 'neutron: fip_port_ id': '3846002b- ac5b-4a8e- b323-1117df6aa5 a2'}, 'type': 'dnat_and_snat', 'logical_ip': '10.1.0.9'}) {{(pid=25866) do_commit /usr/local/ lib/python2. 7/dist- packages/ ovsdbapp/ backend/ ovs_idl/ transaction. py:84}}
br-int flows:
ARP entry looks good (table=9) with the MAC of the gateway port:
Apr 12 17:38:38.077970 ubuntu- xenial- inap-mtl01- 0003484549 sh[23566]: cookie=0xfdc62a55, duration=12.576s, table=9, n_packets=0, n_bytes=0, idle_age=12, priority= 90,arp, reg14=0x1, metadata= 0x5c,arp_ tpa=172. 24.5.5, arp_op= 1 actions= move:NXM_ OF_ETH_ SRC[]-> NXM_OF_ ETH_DST[ ],load: 0x2->NXM_ OF_ARP_ OP[],move: NXM_NX_ ARP_SHA[ ]->NXM_ NX_ARP_ THA[],mod_ dl_src: fa:16:3e: 83:d5:0d, load:0xfa163e83 d50d->NXM_ NX_ARP_ SHA[],move: NXM_OF_ ARP_SPA[ ]->NXM_ OF_ARP_ TPA[],load: 0xac180505- >NXM_OF_ ARP_SPA[ ],load: 0x1->NXM_ NX_REG15[ ],load: 0x1->NXM_ NX_REG10[ 0],resubmit( ,32)
Note the n_packets=0, this flow never got hit.
However in table=12 we would expect to see the dnat action and we see it for the fixed IP address but to a different FIP (in a different datapath), so the DNAT action is not there.
Apr 12 17:38:38.116595 ubuntu- xenial- inap-mtl01- 0003484549 sh[23566]: cookie=0xca5b11b2, duration=29.150s, table=12, n_packets=5, n_bytes=370, idle_age=12, priority= 100,ip, reg14=0x1, metadata= 0x57,nw_ dst=172. 24.5.7 actions= ct(commit, table=13, zone=NXM_ NX_REG11[ 0..15], nat(dst= 10.1.0. 9))
This means that traffic directed to 172.24.5.5 will never reach the instance as the flow is not installed.
tcpdump logs:
I would expect to see ARP responses for the FIP but i don't see any:
Apr 12 17:38:31.367332 ubuntu- xenial- inap-mtl01- 0003484549 tcpdump[23418]: 17:38:31.086964 76:17:ae:71:f2:47 > fa:16:3e:a1:de:ed, ethertype ARP (0x0806), length 42: Request who-has 172.24.5.5 tell 172.24.5.1, length 28 xenial- inap-mtl01- 0003484549 tcpdump[23418]: 0x0000: 0001 0800 0604 0001 7617 ae71 f247 ac18 ........v..q.G.. xenial- inap-mtl01- 0003484549 tcpdump[23418]: 0x0010: 0501 0000 0000 0000 ac18 0505 ...
Apr 12 17:38:31.367518 ubuntu-
Apr 12 17:38:31.367672 ubuntu-