[L3] arp issue in router namespace in compute node

Bug #1922923 reported by Andrei Panescu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Medium
Unassigned

Bug Description

Hello,

I have an issue with Openstack Victoria. Since I moved to Openstack Victoria I still have an issue. Sometimes the VM cannot be accessed on the FIP and it gets fixed only after I clear the arp table for the private IP of the VM from the network namespace.

I did some troubleshooting and I found out that the port seems to down in the OVS.

 7(qr-4affa6db-67): addr:00:00:00:00:00:00
     config: PORT_DOWN
     state: LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max

The MAC address of the VM port is this one:

[root@compute-38 ~]# arp -a | grep 87
? (10.10.13.87) at fa:16:3e:ee:d1:57 [ether] PERM on qr-4affa6db-67

And the ping it's now working.

[root@compute-38 ~]# ip r
10.10.13.0/24 dev qr-4affa6db-67 proto kernel scope link src 10.10.13.1
169.254.107.94/31 dev rfp-9b2225f1-b proto kernel scope link src 169.254.107.94
169.254.110.46/31 dev rfp-9b2225f1-b proto kernel scope link src 169.254.110.46
[root@compute-38 ~]# ping 10.10.13.1
PING 10.10.13.1 (10.10.13.1) 56(84) bytes of data.
64 bytes from 10.10.13.1: icmp_seq=1 ttl=64 time=0.050 ms
^C
--- 10.10.13.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.050/0.050/0.050/0.000 ms
[root@compute-38 ~]# ping 10.10.13.87
PING 10.10.13.87 (10.10.13.87) 56(84) bytes of data.
^C
--- 10.10.13.87 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 87ms

[root@compute-38 ~]#

The solution is to clear the ARP for 10.10.13.87 from the namespace.

[root@compute-38 ~]# ping 10.10.13.87
PING 10.10.13.87 (10.10.13.87) 56(84) bytes of data.
^C
--- 10.10.13.87 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 87ms

[root@compute-38 ~]# arp -d 10.10.13.87
[root@compute-38 ~]# arp -a | grep 87
? (10.10.13.87) at fa:16:3e:99:08:a5 [ether] on qr-4affa6db-67
[root@compute-38 ~]# ping 10.10.13.87
PING 10.10.13.87 (10.10.13.87) 56(84) bytes of data.
64 bytes from 10.10.13.87: icmp_seq=1 ttl=64 time=0.322 ms
64 bytes from 10.10.13.87: icmp_seq=2 ttl=64 time=0.239 ms
^C
--- 10.10.13.87 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 12ms
rtt min/avg/max/mdev = 0.239/0.280/0.322/0.044 ms

And now the FIP started to answer:

~]# ping 10.40.131.220
PING 10.40.131.220 (10.40.131.220) 56(84) bytes of data.
64 bytes from 10.40.131.220: icmp_seq=2500 ttl=61 time=1.60 ms
64 bytes from 10.40.131.220: icmp_seq=2501 ttl=61 time=0.462 ms
64 bytes from 10.40.131.220: icmp_seq=2502 ttl=61 time=0.536 ms

^C
--- 10.40.131.220 ping statistics ---
2545 packets transmitted, 46 received, 98% packet loss, time 2544013ms
rtt min/avg/max/mdev = 0.305/0.571/1.606/0.272 ms

And it's weird that after it's working the ports still looks down.

7(qr-4affa6db-67): addr:00:00:00:00:00:00
     config: PORT_DOWN
     state: LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max

From what I see it seems like a bug as it gets fixed after I do this workaround. And this is happening to the brand new VMs that are being deployed, but not all of them.

Do you have any idea how can I fix this issue? I updated the containers last week to the latest stable release of Victoria.

Thanks!
Andrei

Revision history for this message
Andrei Panescu (apanescu) wrote :
Download full text (4.3 KiB)

Here are the logs when the VM was created from neutron-openvswitch-agent.log

2021-04-07 07:01:20.119 7 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-9889f4be-3b0d-455c-b20f-30688126b1da - - - - -] Port 548dbbd3-bc08-447e-a144-b5de106a58df updated. Details: {'device': '548dbbd3-bc08-447e-a144-b5de106a58df', 'device_id': '2c77110e-02cf-4b2d-9b4d-295784d6f72f', 'network_id': 'a5486dd5-ad9f-4ae0-8922-6a2e70c17f7c', 'port_id': '548dbbd3-bc08-447e-a144-b5de106a58df', 'mac_address': 'fa:16:3e:99:08:a5', 'admin_state_up': True, 'network_type': 'vxlan', 'segmentation_id': 7, 'physical_network': None, 'fixed_ips': [{'subnet_id': '5224c1a8-5e7d-4e65-bc3c-82a27693b8ea', 'ip_address': '10.10.13.87'}], 'device_owner': 'compute:nova', 'allowed_address_pairs': [], 'port_security_enabled': True, 'qos_policy_id': None, 'network_qos_policy_id': None, 'profile': {}, 'vif_type': 'ovs', 'vnic_type': 'normal', 'security_groups': ['78f32d1a-ee3d-49be-b6cc-f4bc8ce706a0']}
2021-04-07 07:01:20.129 7 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-9889f4be-3b0d-455c-b20f-30688126b1da - - - - -] process_network_ports - iteration:299131 - treat_devices_added_or_updated completed. Skipped 0 and no activated binding devices 0 of 37 devices currently available. Time elapsed: 0.014
2021-04-07 07:01:20.135 7 INFO neutron.agent.securitygroups_rpc [req-9889f4be-3b0d-455c-b20f-30688126b1da - - - - -] Preparing filters for devices {'548dbbd3-bc08-447e-a144-b5de106a58df'}
2021-04-07 07:01:22.521 7 INFO neutron.agent.securitygroups_rpc [req-3a7d7427-4d54-43f7-acef-dd6081f9b788 742a6700df704ffdbb29c0f73c57adae fa7cc4c1af894b7e92b8d0726f10d548 - - -] Security group member updated {'a808fd6a-dc24-433d-9b4d-e2680be38aba', '78f32d1a-ee3d-49be-b6cc-f4bc8ce706a0'}
2021-04-07 07:01:23.063 7 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-9889f4be-3b0d-455c-b20f-30688126b1da - - - - -] process_network_ports - iteration:299131 - agent port security group processed in 2.948
2021-04-07 07:01:23.892 7 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-9889f4be-3b0d-455c-b20f-30688126b1da - - - - -] Configuration for devices up ['548dbbd3-bc08-447e-a144-b5de106a58df'] and devices down [] completed.
2021-04-07 07:01:23.893 7 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-9889f4be-3b0d-455c-b20f-30688126b1da - - - - -] Agent rpc_loop - iteration:299131 - ports processed. Elapsed:3.787
2021-04-07 07:01:23.893 7 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-9889f4be-3b0d-455c-b20f-30688126b1da - - - - -] Agent rpc_loop - iteration:299131 completed. Processed ports statistics: {'regular': {'added': 1, 'updated': 1, 'removed': 0}}. Elapsed:3.787
2021-04-07 07:01:23.894 7 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-9889f4be-3b0d-455c-b20f-30688126b1da - - - - -] Agent rpc_loop - iteration:299132 started
2021-04-07 07:01:23.898 7 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-9889f4be-3b0d-455c-b20f-30688126b1da - - - - -] Agent rpc_loop - iteration:299132 - starting polling. Elapsed:0.004
20...

Read more...

Revision history for this message
Mark Goddard (mgoddard) wrote :

Added neutron to affected projects.

Revision history for this message
LIU Yulong (dragon889) wrote :

Please try this fix https://review.opendev.org/c/openstack/neutron/+/779463:
It was released in version 17.1.1.

For more information, see this bug:
https://bugs.launchpad.net/neutron/+bug/1916761

Revision history for this message
Miguel Lavalle (minsel) wrote :

@Andrei,

Did you try the fix suggested by @Yulong? Did it work?

Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Andrei Panescu (apanescu) wrote :

Hello,

My tests from above were ran in neutron-17.1.2.dev12. I checked the code and it seems that the container code contains the fix mention in the bugfix 779463.

The problem still persists.

Do you have any other suggestion?

How can we proceed further?

Thanks!
Andrei

Revision history for this message
LIU Yulong (dragon889) wrote :

Your pasted command line outputs did not show any router namesapce name. It's really hart to guess where you were running the command. Could add the namespace to the front of your command? Something like this:
# ip netns exec qrouter-xxxxx-xxxx-xxx-xxx-xxx ip a
# ip netns exec qrouter-xxxxx-xxxx-xxx-xxx-xxx ip route

And some more helpful things are your config files, log files and your test topology.

Revision history for this message
Andrei Panescu (apanescu) wrote :
Download full text (10.9 KiB)

Hello,

Sorry for the late response.

We encountered again the same issue and I managed to do again the workaround steps.

So the faulty VM is on compute-40.

Here are all the namespaces configured on this compute node.

[root@compute-40 ~]# ip netns
fip-bb27c880-c0ac-4941-9aa2-70de032cbb7c (id: 5)
qrouter-7413152f-8b53-4bc5-bdb0-ca936f7aee17 (id: 2)
qrouter-9b2225f1-b8ba-4079-8441-6e3d599fafb1 (id: 0)
qrouter-0786134b-55a4-4549-99e1-d0f34a7bf897 (id: 1)
qrouter-7e060fb5-ca41-4d29-9cac-48f353509208 (id: 3)
qrouter-48248482-9412-444c-84e8-ece4309bb7a0 (id: 4)

Our namespace is the following:

[root@compute-40 ~]# ip netns exec qrouter-9b2225f1-b8ba-4079-8441-6e3d599fafb1 bash
[root@compute-40 ~]# ip r
10.10.13.0/24 dev qr-4affa6db-67 proto kernel scope link src 10.10.13.1
169.254.127.96/31 dev rfp-9b2225f1-b proto kernel scope link src 169.254.127.96
[root@compute-40 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: rfp-9b2225f1-b@if5: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether c2:12:6b:27:29:bd brd ff:ff:ff:ff:ff:ff link-netns fip-bb27c880-c0ac-4941-9aa2-70de032cbb7c
    inet 169.254.127.96/31 scope global rfp-9b2225f1-b
       valid_lft forever preferred_lft forever
    inet6 fe80::c012:6bff:fe27:29bd/64 scope link
       valid_lft forever preferred_lft forever
58: qr-4affa6db-67: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:e4:a6:aa brd ff:ff:ff:ff:ff:ff
    inet 10.10.13.1/24 brd 10.10.13.255 scope global qr-4affa6db-67
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fee4:a6aa/64 scope link
       valid_lft forever preferred_lft forever

The IP of the VM is 10.10.13.195.
When I try to ping it, it does not answer and I have the PERM arp entry for it.

[root@compute-40 ~]# ping 10.10.13.195
PING 10.10.13.195 (10.10.13.195) 56(84) bytes of data.
^C
--- 10.10.13.195 ping statistics ---
5 packets transmitted, 0 received, 100% packet loss, time 78ms

[root@compute-40 ~]# arp -a | grep 10.10.13.195
? (10.10.13.195) at fa:16:3e:c5:b9:23 [ether] PERM on qr-4affa6db-67

After I delete the arp entry the VM is starting to answer to ping and it's changing the arp entry.

[root@compute-40 ~]# arp -d 10.10.13.195
[root@compute-40 ~]# ping 10.10.13.195
PING 10.10.13.195 (10.10.13.195) 56(84) bytes of data.
64 bytes from 10.10.13.195: icmp_seq=1 ttl=64 time=1.24 ms
64 bytes from 10.10.13.195: icmp_seq=2 ttl=64 time=0.394 ms
64 bytes from 10.10.13.195: icmp_seq=3 ttl=64 time=0.319 ms
64 bytes from 10.10.13.195: icmp_seq=4 ttl=64 time=0.363 ms
^C
--- 10.10.13.195 ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 41ms
rtt min/avg/max/mdev = 0.319/0.578/1.238/0.382 ms
[root@compute-40 ~]# arp -a | grep 10.10.13.195
? (10.10.13.195) at fa:16:3e:a4:76:8c [ether] on qr-4affa6db-67

So... I am not sure what log fil...

Revision history for this message
Andrei Panescu (apanescu) wrote :

It seem that my comments were not pasted. If you need any log files or any config files, please let me know so I can paste them.

Thanks!
Andrei

Revision history for this message
LIU Yulong (dragon889) wrote :

There is MAC inconsistent between the action of arp entry delation: " (10.10.13.195) at fa:16:3e:c5:b9:23 [ether] PERM on qr-4affa6db-67" and "(10.10.13.195) at fa:16:3e:a4:76:8c [ether] on qr-4affa6db-67"

I guess more actions from the very begining of your case should be better to add, such as from the beging of VM booting, router creation and so on.

summary: - OVS port issue
+ [L3] arp issue in router namespace in compute node
Revision history for this message
Andrei Panescu (apanescu) wrote :
Download full text (40.0 KiB)

Hi,

I managed to get another example:

I have the following VM id: 29d8a5ce-1347-4ebe-9441-2f9da57ee103

I found these logs in nova-compute.log, since it was created:

2021-06-07 10:50:57.723 7 INFO nova.compute.claims [req-f3d3adb9-1371-400b-ba17-7ca9660fd24e 2bda1d308eb0438da1ab307c20bc50cd ef6962f677d345249e0aedb708307c38 - default default] [instance: 29d8a5ce-1347-4ebe-9441-2f9da57ee103] Claim successful on node compute-40
2021-06-07 10:50:58.632 7 INFO nova.virt.libvirt.driver [req-f3d3adb9-1371-400b-ba17-7ca9660fd24e 2bda1d308eb0438da1ab307c20bc50cd ef6962f677d345249e0aedb708307c38 - default default] [instance: 29d8a5ce-1347-4ebe-9441-2f9da57ee103] Creating image
2021-06-07 10:51:26.022 7 INFO nova.compute.manager [req-e09743b5-4ba9-47de-a1c5-a639e52c9c79 - - - - -] [instance: 29d8a5ce-1347-4ebe-9441-2f9da57ee103] VM Started (Lifecycle Event)
2021-06-07 10:51:26.079 7 INFO nova.compute.manager [req-e09743b5-4ba9-47de-a1c5-a639e52c9c79 - - - - -] [instance: 29d8a5ce-1347-4ebe-9441-2f9da57ee103] VM Paused (Lifecycle Event)
2021-06-07 10:51:26.188 7 INFO nova.compute.manager [req-e09743b5-4ba9-47de-a1c5-a639e52c9c79 - - - - -] [instance: 29d8a5ce-1347-4ebe-9441-2f9da57ee103] During sync_power_state the instance has a pending task (spawning). Skip.
2021-06-07 10:51:26.782 7 INFO nova.compute.manager [req-e09743b5-4ba9-47de-a1c5-a639e52c9c79 - - - - -] [instance: 29d8a5ce-1347-4ebe-9441-2f9da57ee103] VM Resumed (Lifecycle Event)
2021-06-07 10:51:26.789 7 INFO nova.virt.libvirt.driver [-] [instance: 29d8a5ce-1347-4ebe-9441-2f9da57ee103] Instance spawned successfully.
2021-06-07 10:51:26.789 7 INFO nova.compute.manager [req-f3d3adb9-1371-400b-ba17-7ca9660fd24e 2bda1d308eb0438da1ab307c20bc50cd ef6962f677d345249e0aedb708307c38 - default default] [instance: 29d8a5ce-1347-4ebe-9441-2f9da57ee103] Took 28.16 seconds to spawn the instance on the hypervisor.
2021-06-07 10:51:26.916 7 INFO nova.compute.manager [req-f3d3adb9-1371-400b-ba17-7ca9660fd24e 2bda1d308eb0438da1ab307c20bc50cd ef6962f677d345249e0aedb708307c38 - default default] [instance: 29d8a5ce-1347-4ebe-9441-2f9da57ee103] Took 29.23 seconds to build instance.
2021-06-07 11:01:50.975 7 INFO nova.virt.libvirt.driver [req-2fbfa61d-7597-4b07-946f-ea9551141c7a 742a6700df704ffdbb29c0f73c57adae ef6962f677d345249e0aedb708307c38 - default default] [instance: 29d8a5ce-1347-4ebe-9441-2f9da57ee103] Instance shutdown successfully after 3 seconds.
2021-06-07 11:01:50.983 7 INFO nova.virt.libvirt.driver [-] [instance: 29d8a5ce-1347-4ebe-9441-2f9da57ee103] Instance destroyed successfully.
2021-06-07 11:02:05.449 7 INFO nova.compute.manager [-] [instance: 29d8a5ce-1347-4ebe-9441-2f9da57ee103] VM Stopped (Lifecycle Event)
2021-06-07 11:02:19.258 7 INFO nova.virt.libvirt.driver [-] [instance: 29d8a5ce-1347-4ebe-9441-2f9da57ee103] Instance destroyed successfully.
2021-06-07 11:02:20.361 7 WARNING nova.compute.manager [req-596cd48e-7719-41da-993e-a9167f4353b8 ffbe9441e5b043059796254560cadb79 e1855b55c67942cda4fb76356fbffea4 - default default] [instance: 29d8a5ce-1347-4ebe-9441-2f9da57ee103] Received unexpected event network-vif-unplugged-b4d24314-7b8e-4b9a-85b8-fffd7d022b8f for instance with vm_state stopped a...

Revision history for this message
Andrei Panescu (apanescu) wrote :

Hello,

Do you have any update?

Thanks!
Andrei

Revision history for this message
Andrei Panescu (apanescu) wrote :

Hello,

Do you think that this problem is solved in the latest version of Openstack?

Thanks!
Andrei

Revision history for this message
LIU Yulong (dragon889) wrote :

Try if this patch can works:
https://review.opendev.org/c/openstack/neutron/+/779463

It is included in neutron versions of 17.1.1 17.1.2 17.2.0.

no longer affects: kolla-ansible
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.