Tempest floatingip scenario tests failing on DVR Multinode setup with HA

Bug #1717302 reported by Swaminathan Vasudevan on 2017-09-14
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
High
Slawek Kaplonski

Bug Description

neutron.tests.tempest.scenario.test_floatingip.FloatingIpSameNetwork and
neutron.tests.tempest.scenario.test_floatingip.FloatingIpSeparateNetwork are failing on every patch.

This trace is seen on the node-2 l3-agent.

Sep 13 07:16:43.404250 ubuntu-xenial-2-node-rax-dfw-10909819-895688 neutron-keepalived-state-change[5461]: ERROR neutron.agent.linux.ip_lib [-] Failed sending gratuitous ARP to 172.24.5.3 on qg-bf79c157-e2 in namespace qrouter-796b8715-ca01-43ad-bc08-f81a0b4db8cc: Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address
                                                                                                           : ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address
                                                                                                           ERROR neutron.agent.linux.ip_lib Traceback (most recent call last):
                                                                                                           ERROR neutron.agent.linux.ip_lib File "/opt/stack/new/neutron/neutron/agent/linux/ip_lib.py", line 1082, in _arping
                                                                                                           ERROR neutron.agent.linux.ip_lib ip_wrapper.netns.execute(arping_cmd, extra_ok_codes=[1])
                                                                                                           ERROR neutron.agent.linux.ip_lib File "/opt/stack/new/neutron/neutron/agent/linux/ip_lib.py", line 901, in execute
                                                                                                           ERROR neutron.agent.linux.ip_lib log_fail_as_error=log_fail_as_error, **kwargs)
                                                                                                           ERROR neutron.agent.linux.ip_lib File "/opt/stack/new/neutron/neutron/agent/linux/utils.py", line 151, in execute
                                                                                                           ERROR neutron.agent.linux.ip_lib raise ProcessExecutionError(msg, returncode=returncode)
                                                                                                           ERROR neutron.agent.linux.ip_lib ProcessExecutionError: Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address
                                                                                                           ERROR neutron.agent.linux.ip_lib
                                                                                                           ERROR neutron.agent.linux.ip_lib

If this is a DVR router, then the GARP should not go through the qg interface for the floatingIP.

More information can be seen here.

http://logs.openstack.org/43/500143/5/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/0a58fce/logs/subnode-2/screen-q-l3.txt.gz?level=TRACE#_Sep_13_07_16_47_864052

summary: - Tempest floatingip scenario tests failing on DVR Multinode setup
+ Tempest floatingip scenario tests failing on DVR Multinode setup with HA

http://logs.openstack.org/30/503530/6/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/ba5131c/logs/subnode-2/screen-q-l3.txt.gz?level=DEBUG#_Sep_14_18_31_35_136440

This is also this trace seen in the debug logs on Node2.

Sep 14 20:26:41.262749 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: Traceback (most recent call last):
Sep 14 20:26:41.262909 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 457, in fire_timers
Sep 14 20:26:41.263056 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: timer()
Sep 14 20:26:41.263191 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
Sep 14 20:26:41.263333 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: cb(*args, **kw)
Sep 14 20:26:41.263468 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: File "/opt/stack/new/neutron/neutron/agent/linux/ip_lib.py", line 1124, in arping
Sep 14 20:26:41.263599 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: _arping(ns_name, iface_name, address, count, log_exception)
Sep 14 20:26:41.263731 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: File "/opt/stack/new/neutron/neutron/agent/linux/ip_lib.py", line 1060, in _arping
Sep 14 20:26:41.263870 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: for i in range(count):
Sep 14 20:26:41.264015 ubuntu-xenial-2-node-rax-dfw-10937231-899019 neutron-l3-agent[8011]: TypeError: range() integer end argument expected, got ConfigOpts.

Sep 13 07:16:43.404250 ubuntu-xenial-2-node-rax-dfw-10909819-895688 neutron-keepalived-state-change[5461]: ERROR neutron.agent.linux.ip_lib [-] Failed sending gratuitous ARP to 172.24.5.3 on qg-bf79c157-e2 in namespace qrouter-796b8715-ca01-43ad-bc08-f81a0b4db8cc: Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address

The above trace is seen after a DVR Router is migrated to a Legacy-HA router. ( That is my understanding) May be @Anilvenkata can comment on this.

@anil-venkata can you comment on this bug.
I am trying to understand this scenario test case.
We are basically using two node setup, with one DVR node and the other DVR_SNAT node.
With DVR, ha should not be enabled, since we only have one node.
If it can be enabled, then how are we testing the master/slave snat_namespace here.

Aso the issue I am seeing consistently is 'ip address 172.24.5.4/32 dev qg-c4da18e0-db, no longer exist".

And this is trying to send the ARP mesage 'Failed sending gratuitous ARP to 172.24.5.4 on qg-c4da18e0-db in namespace qrouter-617bfb93-834e-4cf8-9f9d-521279f4f580'

DVR routers do not create qg- interface in qrouter namespace.

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
Brian Haley (brian-haley) wrote :

I think I have a patch for this bug, maybe I didn't tag it correctly with the BZ #.

Brian Haley (brian-haley) wrote :

https://bugs.launchpad.net/neutron/+bug/1696893 is the bug I've been tracking the arping fix under, mostly just a cosmetic error since it's asynchronous.

Brian Haley (brian-haley) wrote :

Even with the above patches (bad arping arguments, arping error), we still have a failure:

http://logs.openstack.org/84/500384/18/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/a427ec7/logs/testr_results.html.gz

Traceback (most recent call last):
  File "/opt/stack/new/neutron/neutron/tests/tempest/scenario/test_floatingip.py", line 139, in test_east_west
    self._test_east_west()
  File "/opt/stack/new/neutron/neutron/tests/tempest/scenario/test_floatingip.py", line 119, in _test_east_west
    dest_server['port']['fixed_ips'][0]['ip_address'])
  File "/opt/stack/new/neutron/neutron/tests/tempest/scenario/base.py", line 279, in check_remote_connectivity
    source, dest, should_succeed, nic, mtu, fragmentation))
  File "/opt/stack/new/neutron/neutron/tests/tempest/scenario/base.py", line 274, in _check_remote_connectivity
    1)
  File "tempest/lib/common/utils/test_utils.py", line 103, in call_until_true
    if func():
  File "/opt/stack/new/neutron/neutron/tests/tempest/scenario/base.py", line 259, in ping_remote
    fragmentation=fragmentation)
  File "/opt/stack/new/neutron/neutron/tests/tempest/scenario/base.py", line 254, in ping_host
    return source.exec_command(cmd)
  File "tempest/lib/common/ssh.py", line 151, in exec_command
    ssh = self._get_ssh_connection()
  File "tempest/lib/common/ssh.py", line 121, in _get_ssh_connection
    password=self.password)
tempest.lib.exceptions.SSHTimeout: Connection to the 172.24.5.6 via SSH timed out.
User: ubuntu, Password: None

Still need to track this down.

tags: added: gate-failure
Download full text (9.2 KiB)

I was able to reproduce this issue locally.

These tests are failing randomly and on further debugging here is what I could see.
In the two node setup.

In Node 1 (Ubuntu-controller) there is one 'VM'
In the Node 2(Ubuntu-compute-new) there are two 'VMs'

Both the VMs in Node2 have floatingIP configured.
Here is the output of the 'router-namespace' iptable rules.

stack@ubuntu-compute-new:~/devstack$ sudo ip netns exec qrouter-6f01678c-64d6-4197-b09d-3285c46207ef bash
root@ubuntu-compute-new:~/devstack# iptables -t nat -S
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-N neutron-l3-agent-OUTPUT
-N neutron-l3-agent-POSTROUTING
-N neutron-l3-agent-PREROUTING
-N neutron-l3-agent-float-snat
-N neutron-l3-agent-snat
-N neutron-postrouting-bottom
-A PREROUTING -j neutron-l3-agent-PREROUTING
-A OUTPUT -j neutron-l3-agent-OUTPUT
-A POSTROUTING -j neutron-l3-agent-POSTROUTING
-A POSTROUTING -j neutron-postrouting-bottom
-A neutron-l3-agent-POSTROUTING ! -i rfp-6f01678c-6 ! -o rfp-6f01678c-6 -m conntrack ! --ctstate DNAT -j ACCEPT
-A neutron-l3-agent-PREROUTING -d 169.254.169.254/32 -i qr-+ -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 9697
-A neutron-l3-agent-PREROUTING -d 192.168.100.100/32 -i rfp-6f01678c-6 -j DNAT --to-destination 10.0.0.13
-A neutron-l3-agent-PREROUTING -d 192.168.100.114/32 -i rfp-6f01678c-6 -j DNAT --to-destination 10.0.0.14
-A neutron-l3-agent-float-snat -s 10.0.0.13/32 -j SNAT --to-source 192.168.100.100
-A neutron-l3-agent-float-snat -s 10.0.0.14/32 -j SNAT --to-source 192.168.100.114
-A neutron-l3-agent-snat -j neutron-l3-agent-float-snat
-A neutron-postrouting-bottom -m comment --comment "Perform source NAT on outgoing traffic." -j neutron-l3-agent-snat
root@ubuntu-compute-new:~/devstack#

But What I see in the 'Fip namespace' is that the "10.0.0.13" IP is seen within the Fipnamespace responding to a FloatingIP.

stack@ubuntu-compute-new:~$ sudo ip netns exec fip-5c94b420-0b1f-4025-864a-9209d8e7211f tcpdump -i any icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
 ^C19:50:32.073635 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 54785, seq 0, length 64
19:50:35.578246 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 55553, seq 0, length 64
19:50:39.153168 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 56321, seq 0, length 64
19:50:42.790410 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 57089, seq 0, length 64
19:50:46.368505 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 57857, seq 0, length 64
19:50:49.982396 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 58625, seq 0, length 64
19:50:53.553890 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 59393, seq 0, length 64
19:50:57.005240 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 60161, seq 0, length 64
19:51:00.557693 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 60929, seq 0, length 64
19:51:04.045430 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 61697, seq 0, length 64
19:51:07.579294 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 62465, seq 0, length 64
19:51:11.229360 IP 10.0.0...

Read more...

Download full text (5.0 KiB)

Also on the 'Node1', floatingIP is configured but the DNAT rule is missing in the router namespace.

stack@ubuntu-controller:~/devstack$ neutron floatingip-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+----------------------------------+------------------+---------------------+--------------------------------------+
| id | tenant_id | fixed_ip_address | floating_ip_address | port_id |
+--------------------------------------+----------------------------------+------------------+---------------------+--------------------------------------+
| 0fd57315-51d7-4277-9835-e3aee82a5773 | 948bc6fadbbc4ca4ad4d223dcc76b9f1 | 10.0.0.3 | 192.168.100.104 | 9187cca2-a96f-495f-abf4-041de154fc95 |
| 5ad5be80-f720-47b0-a05e-4b309d192daf | 948bc6fadbbc4ca4ad4d223dcc76b9f1 | 10.0.0.13 | 192.168.100.100 | 95e78c3c-21a2-4d62-9fc9-ad5451ef73cd |
| 6fc89fb9-ffc7-438d-8320-23c44de2ab09 | 948bc6fadbbc4ca4ad4d223dcc76b9f1 | 10.0.0.14 | 192.168.100.114 | e4b5e14e-6625-4bbb-884c-36f94dbc609d |
+--------------------------------------+----------------------------------+------------------+---------------------+--------------------------------------+

root@ubuntu-controller:~/devstack# iptables -t nat -S
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-N neutron-l3-agent-OUTPUT
-N neutron-l3-agent-POSTROUTING
-N neutron-l3-agent-PREROUTING
-N neutron-l3-agent-float-snat
-N neutron-l3-agent-snat
-N neutron-postrouting-bottom
-A PREROUTING -j neutron-l3-agent-PREROUTING
-A OUTPUT -j neutron-l3-agent-OUTPUT
-A POSTROUTING -j neutron-l3-agent-POSTROUTING
-A POSTROUTING -j neutron-postrouting-bottom
-A neutron-l3-agent-POSTROUTING ! -i rfp-6f01678c-6 ! -o rfp-6f01678c-6 -m conntrack ! --ctstate DNAT -j ACCEPT
-A neutron-l3-agent-PREROUTING -d 169.254.169.254/32 -i qr-+ -p tcp -m tcp --dport 80 -j REDIRECT --to-ports 9697
-A neutron-l3-agent-snat -j neutron-l3-agent-float-snat
-A neutron-postrouting-bottom -m comment --comment "Perform source NAT on outgoing traffic." -j neutron-l3-agent-snat

There are not "DNAT" rules seen in the router namespace.

But the IP rule shows 54170: from 10.0.0.3 lookup 16 is defined.

root@ubuntu-controller:~/devstack# ip rule
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
54170: from 10.0.0.3 lookup 16
167772161: from 10.0.0.1/28 lookup 167772161
root@ubuntu-controller:~/devstack#

The fipnamespace also has the routes required to route the traffic for the floatingip. (192.168.100.104).

stack@ubuntu-controller:~/devstack$ sudo ip netns exec fip-5c94b420-0b1f-4025-864a-9209d8e7211f bash
root@ubuntu-controller:~/devstack# ifconfig
fg-687a771e-78 Link encap:Ethernet HWaddr fa:16:3e:81:97:2e
          inet addr:192.168.100.105 Bcast:192.168.100.255 Mask:255.255.255.0
          inet6 addr: fe80::f816:3eff:fe81:972e/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:290 errors:0 dropped:0 overruns:0 frame:0
          TX packets:114 er...

Read more...

The 'odd' behavior here is

We do see that the DNAT rule is in place for the incoming packets.
-A neutron-l3-agent-PREROUTING -d 192.168.100.100/32 -i rfp-6f01678c-6 -j DNAT --to-destination 10.0.0.13

We do see that the float-Snat rule is in place for the outgoing packets.
-A neutron-l3-agent-float-snat -s 10.0.0.13/32 -j SNAT --to-source 192.168.100.100

But What I see in the 'Fip namespace' is that the "10.0.0.13" IP is seen within the Fipnamespace responding to a FloatingIP. ( Theoretically the above rule 'on neutron-l3-agent-float-snat' should have translated the source address 10.0.0.13 to 192.168.100.100. But it did not happen)?????????????

NOT SURE WHY?

stack@ubuntu-compute-new:~$ sudo ip netns exec fip-5c94b420-0b1f-4025-864a-9209d8e7211f tcpdump -i any icmp
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
 ^C19:50:32.073635 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 54785, seq 0, length 64
19:50:35.578246 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 55553, seq 0, length 64
19:50:39.153168 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 56321, seq 0, length 64
19:50:42.790410 IP 10.0.0.13 > 192.168.100.109: ICMP echo reply, id 57089, seq 0, length 64

I am running out of ideas on this bug.
Can anyone else take a look at it.

Changed in neutron:
assignee: nobody → Brian Haley (brian-haley)
Miguel Lavalle (minsel) on 2018-05-31
Changed in neutron:
assignee: Brian Haley (brian-haley) → Miguel Lavalle (minsel)
LIU Yulong (dragon889) wrote :

I think this is the root cause:
https://bugs.centos.org/view.php?id=11238
It's a kernel bug.

Reviewed: https://review.openstack.org/600197
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=dbd6dbb5e19a269c17601a4038eae8eb050d182c
Submitter: Zuul
Branch: stable/pike

commit dbd6dbb5e19a269c17601a4038eae8eb050d182c
Author: Jakub Libosvar <email address hidden>
Date: Tue Oct 24 13:11:14 2017 +0000

    tests: Add decorator to mark unstable tests

    As it was agreed on Neutron CI meeting, we're going to mark unstable
    tests in fullstack suite with this decorator while working in paralel on
    stabilization of such tests.

    Mark the DVR east-west tests as unstable to prove it works.

    Conflicts:
        neutron/tests/tempest/scenario/base.py

    NOTE: This is a squash of the unstable decorator change and another
          to the neutron-tempest-plugin repo during the Queens cycle.

    Related-bug: #1717302

    Change-Id: I3beb6e7a4d96da778378e9d979cb8c6261f6036b
    (cherry picked from commit bdda46ade7f1f8a2742bcba6ea7556e3f059031f)
    (cherry picked from commit ba80045aabbdf5bbf66e39ed5aecad72eb3d86ef)

tags: added: in-stable-pike
Miguel Lavalle (minsel) wrote :

Note that in related bug https://bugs.launchpad.net/neutron/+bug/1793118, the submitter reports that:

1) Despite the error in the log file, data plane works correctly.
2) After executing sysctl -w net.ipv4.ip_nonlocal_bind=1 in the router name space, the error messages go away. I wonder if this patch is related to the issue: https://review.openstack.org/#/c/393886/

Miguel Lavalle (minsel) wrote :

I will bring this up in the next Le sub-team meeting

Gökhan (skylightcoder) wrote :

I think, ı find the problem. keepalived 1:1.2.24-1ubuntu0.16.04.1 is broken. if you downgrade it, it will work properly. I run sudo apt-get install --allow-downgrades keepalived=1:1.2.19-1 and it is worked. we need to check neutron side for keepalived 1:1.2.24-1ubuntu0.16.04.1

Slawek Kaplonski (slaweq) wrote :

Hi Gökhan,

Speaking about keepalived we already had problems with this version in functional tests, see https://bugs.launchpad.net/neutron/+bug/1788185

I even reported bug for keepalived https://bugs.launchpad.net/ubuntu/+source/keepalived/+bug/1789045 - maybe You can updated it with Your findings also?

I know that for functional tests newer version of keepalived was also working fine. Problem is only with this one specific version which You pointed also.

Gökhan (skylightcoder) wrote :

Hi Slawek,
thanks for your explanation. You are right. Problem is only with this specific keepalived version, but unfortunately this is stable version on ubuntu xenial. I will share my findings to https://bugs.launchpad.net/ubuntu/+source/keepalived/+bug/1789045

Changed in neutron:
assignee: Miguel Lavalle (minsel) → Slawek Kaplonski (slaweq)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.