Tempest floatingip scenario tests failing on DVR Multinode setup with HA

Bug #1717302 reported by Swaminathan Vasudevan on 2017-09-14
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
High
Miguel Lavalle

Bug Description

neutron.tests.tempest.scenario.test_floatingip.FloatingIpSameNetwork and
neutron.tests.tempest.scenario.test_floatingip.FloatingIpSeparateNetwork are failing on every patch.

This trace is seen on the node-2 l3-agent.

Sep 13 07:16:43.404250 ubuntu-xenial-2-node-rax-dfw-10909819-895688 neutron-keepalived-state-change[5461]: ERROR neutron.agent.linux.ip_lib [-] Failed sending gratuitous ARP to 172.24.5.3 on qg-bf79c157-e2 in namespace qrouter-796b8715-ca01-43ad-bc08-f81a0b4db8cc: Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address
                                                                                                           : ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address
                                                                                                           ERROR neutron.agent.linux.ip_lib Traceback (most recent call last):
                                                                                                           ERROR neutron.agent.linux.ip_lib File "/opt/stack/new/neutron/neutron/agent/linux/ip_lib.py", line 1082, in _arping
                                                                                                           ERROR neutron.agent.linux.ip_lib ip_wrapper.netns.execute(arping_cmd, extra_ok_codes=[1])
                                                                                                           ERROR neutron.agent.linux.ip_lib File "/opt/stack/new/neutron/neutron/agent/linux/ip_lib.py", line 901, in execute
                                                                                                           ERROR neutron.agent.linux.ip_lib log_fail_as_error=log_fail_as_error, **kwargs)
                                                                                                           ERROR neutron.agent.linux.ip_lib File "/opt/stack/new/neutron/neutron/agent/linux/utils.py", line 151, in execute
                                                                                                           ERROR neutron.agent.linux.ip_lib raise ProcessExecutionError(msg, returncode=returncode)
                                                                                                           ERROR neutron.agent.linux.ip_lib ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address
                                                                                                           ERROR neutron.agent.linux.ip_lib
                                                                                                           ERROR neutron.agent.linux.ip_lib

If this is a DVR router, then the GARP should not go through the qg interface for the floatingIP.

More information can be seen here.

http://logs.openstack.org/43/500143/5/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/0a58fce/logs/subnode-2/screen-q-l3.txt.gz?level=TRACE#_Sep_13_07_16_47_864052

summary: - Tempest floatingip scenario tests failing on DVR Multinode setup
+ Tempest floatingip scenario tests failing on DVR Multinode setup with HA

http://logs.openstack.org/30/503530/6/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/ba5131c/logs/subnode-2/screen-q-l3.txt.gz?level=DEBUG#_Sep_14_18_31_35_136440

This is also this trace seen in the debug logs on Node2.

Sep 14 20:26:41.262749 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: Traceback (most recent call last):
Sep 14 20:26:41.262909 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 457, in fire_timers
Sep 14 20:26:41.263056 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: timer()
Sep 14 20:26:41.263191 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
Sep 14 20:26:41.263333 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: cb(*args, **kw)
Sep 14 20:26:41.263468 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: File "/opt/stack/new/neutron/neutron/agent/linux/ip_lib.py", line 1124, in arping
Sep 14 20:26:41.263599 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: _arping(ns_name, iface_name, address, count, log_exception)
Sep 14 20:26:41.263731 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: File "/opt/stack/new/neutron/neutron/agent/linux/ip_lib.py", line 1060, in _arping
Sep 14 20:26:41.263870 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: for i in range(count):
Sep 14 20:26:41.264015 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: TypeError: range() integer end argument expected, got ConfigOpts.

Sep 13 07:16:43.404250 ubuntu-xenial-2-node-rax-dfw-10909819-895688 neutron-keepalived-state-change[5461]: ERROR neutron.agent.linux.ip_lib [-] Failed sending gratuitous ARP to 172.24.5.3 on qg-bf79c157-e2 in namespace qrouter-796b8715-ca01-43ad-bc08-f81a0b4db8cc: Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address

The above trace is seen after a DVR Router is migrated to a Legacy-HA router. ( That is my understanding) May be @Anilvenkata can comment on this.

@anil-venkata can you comment on this bug.
I am trying to understand this scenario test case.
We are basically using two node setup, with one DVR node and the other DVR_SNAT node.
With DVR, ha should not be enabled, since we only have one node.
If it can be enabled, then how are we testing the master/slave snat_namespace here.

Aso the issue I am seeing consistently is 'ip address 172.24.5.4/32 dev qg-c4da18e0-db, no longer exist".

And this is trying to send the ARP mesage 'Failed sending gratuitous ARP to 172.24.5.4 on qg-c4da18e0-db in namespace qrouter-617bfb93-834e-4cf8-9f9d-521279f4f580'

DVR routers do not create qg- interface in qrouter namespace.

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
Brian Haley (brian-haley) wrote :

I think I have a patch for this bug, maybe I didn't tag it correctly with the BZ #.

Brian Haley (brian-haley) wrote :

https://bugs.launchpad.net/neutron/+bug/1696893 is the bug I've been tracking the arping fix under, mostly just a cosmetic error since it's asynchronous.

Brian Haley (brian-haley) wrote :

Even with the above patches (bad arping arguments, arping error), we still have a failure:

http://logs.openstack.org/84/500384/18/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/a427ec7/logs/testr_results.html.gz

Traceback (most recent call last):
  File "/opt/stack/new/neutron/neutron/tests/tempest/scenario/test_floatingip.py", line 139, in test_east_west
    self._test_east_west()
  File "/opt/stack/new/neutron/neutron/tests/tempest/scenario/test_floatingip.py", line 119, in _test_east_west
    dest_server['port']['fixed_ips'][0]['ip_address'])
  File "/opt/stack/new/neutron/neutron/tests/tempest/scenario/base.py", line 279, in check_remote_connectivity
    source, dest, should_succeed, nic, mtu, fragmentation))
  File "/opt/stack/new/neutron/neutron/tests/tempest/scenario/base.py", line 274, in _check_remote_connectivity
    1)
  File "tempest/lib/common/utils/test_utils.py", line 103, in call_until_true
    if func():
  File "/opt/stack/new/neutron/neutron/tests/tempest/scenario/base.py", line 259, in ping_remote
    fragmentation=fragmentation)
  File "/opt/stack/new/neutron/neutron/tests/tempest/scenario/base.py", line 254, in ping_host
    return source.exec_command(cmd)
  File "tempest/lib/common/ssh.py", line 151, in exec_command
    ssh = self._get_ssh_connection()
  File "tempest/lib/common/ssh.py", line 121, in _get_ssh_connection
    password=self.password)
tempest.lib.exceptions.SSHTimeout: Connection to the 172.24.5.6 via SSH timed out.
User: ubuntu, Password: None

Still need to track this down.

tags: added: gate-failure
Download full text (9.2 KiB)

I was able to reproduce this issue locally.

These tests are failing randomly and on further debugging here is what I could see.
In the two node setup.

In Node 1 (Ubuntu-controller) there is one 'VM'
In the Node 2(Ubuntu-compute-new) there are two 'VMs'

Both the VMs in Node2 have floatingIP configured.
Here is the output of the 'router-namespace' iptable rules.

stack@ubuntu-compute-new:~/devstack$ sudo ip netns exec qrouter-6f01678c-64d6-4197-b09d-3285c46207ef bash
root@ubuntu-compute-new:~/devstack# iptables -t nat -S
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-N neutron-l3-agent-OUTPUT
-N neutron-l3-agent-POSTROUTING
-N neutron-l3-agent-PREROUTING
-N neutron-l3-agent-float-snat
-N neutron-l3-agent-snat
-N neutron-postrouting-bottom
-A PREROUTING -j neutron-l3-agent-PREROUTING
-A OUTPUT -j neutron-l3-agent-OUTPUT
-A POSTROUTING -j neutron-l3-agent-POSTROUTING
-A POSTROUTING -j neutron-postrouting-bottom
-A neutron-l3-agent-POSTROUTING ! -i rfp-6f01678c-6 ! -o rfp-6f01678c-6 -m conntrack ! --ctstate DNAT -j ACCEPT
-A neutron-l3-agent-PREROUTING -d 169.254.169.254/32 -i qr-+ -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 9697
-A neutron-l3-agent-PREROUTING -d 192.168.100.100/32 -i rfp-6f01678c-6 -j DNAT --to-destination 10.0.0.13
-A neutron-l3-agent-PREROUTING -d 192.168.100.114/32 -i rfp-6f01678c-6 -j DNAT --to-destination 10.0.0.14
-A neutron-l3-agent-float-snat -s 10.0.0.13/32 -j SNAT --to-source 192.168.100.100
-A neutron-l3-agent-float-snat -s 10.0.0.14/32 -j SNAT --to-source 192.168.100.114
-A neutron-l3-agent-snat -j neutron-l3-agent-float-snat
-A neutron-postrouting-bottom -m comment --comment "Perform source NAT on outgoing traffic." -j neutron-l3-agent-snat
root@ubuntu-compute-new:~/devstack#

But What I see in the 'Fip namespace' is that the "10.0.0.13" IP is seen within the Fipnamespace responding to a FloatingIP.

stack@ubuntu-compute-new:~$ sudo ip netns exec fip-5c94b420-0b1f-4025-864a-9209d8e7211f tcpdump -i any icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
 ^C19:50:32.073635 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 54785, seq 0, length 64
19:50:35.578246 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 55553, seq 0, length 64
19:50:39.153168 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 56321, seq 0, length 64
19:50:42.790410 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 57089, seq 0, length 64
19:50:46.368505 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 57857, seq 0, length 64
19:50:49.982396 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 58625, seq 0, length 64
19:50:53.553890 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 59393, seq 0, length 64
19:50:57.005240 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 60161, seq 0, length 64
19:51:00.557693 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 60929, seq 0, length 64
19:51:04.045430 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 61697, seq 0, length 64
19:51:07.579294 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 62465, seq 0, length 64
19:51:11.229360 IP 10.0.0...

Read more...

Download full text (5.0 KiB)

Also on the 'Node1', floatingIP is configured but the DNAT rule is missing in the router namespace.

stack@ubuntu-controller:~/devstack$ neutron floatingip-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+----------------------------------+------------------+---------------------+--------------------------------------+
| id | tenant_id | fixed_ip_address | floating_ip_address | port_id |
+--------------------------------------+----------------------------------+------------------+---------------------+--------------------------------------+
| 0fd57315-51d7-4277-9835-e3aee82a5773 | 948bc6fadbbc4ca4ad4d223dcc76b9f1 | 10.0.0.3 | 192.168.100.104 | 9187cca2-a96f-495f-abf4-041de154fc95 |
| 5ad5be80-f720-47b0-a05e-4b309d192daf | 948bc6fadbbc4ca4ad4d223dcc76b9f1 | 10.0.0.13 | 192.168.100.100 | 95e78c3c-21a2-4d62-9fc9-ad5451ef73cd |
| 6fc89fb9-ffc7-438d-8320-23c44de2ab09 | 948bc6fadbbc4ca4ad4d223dcc76b9f1 | 10.0.0.14 | 192.168.100.114 | e4b5e14e-6625-4bbb-884c-36f94dbc609d |
+--------------------------------------+----------------------------------+------------------+---------------------+--------------------------------------+

root@ubuntu-controller:~/devstack# iptables -t nat -S
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-N neutron-l3-agent-OUTPUT
-N neutron-l3-agent-POSTROUTING
-N neutron-l3-agent-PREROUTING
-N neutron-l3-agent-float-snat
-N neutron-l3-agent-snat
-N neutron-postrouting-bottom
-A PREROUTING -j neutron-l3-agent-PREROUTING
-A OUTPUT -j neutron-l3-agent-OUTPUT
-A POSTROUTING -j neutron-l3-agent-POSTROUTING
-A POSTROUTING -j neutron-postrouting-bottom
-A neutron-l3-agent-POSTROUTING ! -i rfp-6f01678c-6 ! -o rfp-6f01678c-6 -m conntrack ! --ctstate DNAT -j ACCEPT
-A neutron-l3-agent-PREROUTING -d 169.254.169.254/32 -i qr-+ -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 9697
-A neutron-l3-agent-snat -j neutron-l3-agent-float-snat
-A neutron-postrouting-bottom -m comment --comment "Perform source NAT on outgoing traffic." -j neutron-l3-agent-snat

There are not "DNAT" rules seen in the router namespace.

But the IP rule shows 54170: from 10.0.0.3 lookup 16 is defined.

root@ubuntu-controller:~/devstack# ip rule
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
54170: from 10.0.0.3 lookup 16
167772161: from 10.0.0.1/28 lookup 167772161
root@ubuntu-controller:~/devstack#

The fipnamespace also has the routes required to route the traffic for the floatingip. (192.168.100.104).

stack@ubuntu-controller:~/devstack$ sudo ip netns exec fip-5c94b420-0b1f-4025-864a-9209d8e7211f bash
root@ubuntu-controller:~/devstack# ifconfig
fg-687a771e-78 Link encap:Ethernet HWaddr fa:16:3e:81:97:2e
          inet addr:192.168.100.105 Bcast:192.168.100.255 Mask:255.255.255.0
          inet6 addr: fe80::f816:3eff:fe81:972e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:290 errors:0 dropped:0 overruns:0 frame:0
          TX packets:114 er...

Read more...

The 'odd' behavior here is

We do see that the DNAT rule is in place for the incoming packets.
-A neutron-l3-agent-PREROUTING -d 192.168.100.100/32 -i rfp-6f01678c-6 -j DNAT --to-destination 10.0.0.13

We do see that the float-Snat rule is in place for the outgoing packets.
-A neutron-l3-agent-float-snat -s 10.0.0.13/32 -j SNAT --to-source 192.168.100.100

But What I see in the 'Fip namespace' is that the "10.0.0.13" IP is seen within the Fipnamespace responding to a FloatingIP. ( Theoretically the above rule 'on neutron-l3-agent-float-snat' should have translated the source address 10.0.0.13 to 192.168.100.100. But it did not happen)?????????????

NOT SURE WHY?

stack@ubuntu-compute-new:~$ sudo ip netns exec fip-5c94b420-0b1f-4025-864a-9209d8e7211f tcpdump -i any icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
 ^C19:50:32.073635 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 54785, seq 0, length 64
19:50:35.578246 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 55553, seq 0, length 64
19:50:39.153168 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 56321, seq 0, length 64
19:50:42.790410 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 57089, seq 0, length 64

I am running out of ideas on this bug.
Can anyone else take a look at it.

Changed in neutron:
assignee: nobody → Brian Haley (brian-haley)
Miguel Lavalle (minsel) on 2018-05-31
Changed in neutron:
assignee: Brian Haley (brian-haley) → Miguel Lavalle (minsel)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers