With LinuxBridge/VXLAN ARP proxy, ip neigh replace fails due to ARP entry limits
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Confirmed
|
Medium
|
yalei wang |
Bug Description
In an environment with over 600 instances, we observed failures by the LinuxBridge agent w/ l2pop on the network nodes to add neighbor (arp) entries when booting instances. The lack of an ARP entry resulted in the qrouter namespaces being unable to communicate with the instances, as their ARP request was not proxied and was dropped. The 'ip neigh replace' command could be seen failing within the log with a 'RTNETLINK answers: No buffer space available' message. To resolve this, we increased the gc_thresh sysctl parameters from their defaults.
To demonstrate, we booted four instances:
infra03_
+------
| ID | Name | Status || Power State | Networks |
+------
| 0b5678f8-
| be2ecc51-
| a41b432c-
| b2c4a80c-
Three of the four 'ip neigh replace' commands failed on one of the infra nodes running an l3 agent. Coincidentally, the one hosting the router for the respective tenant network:
2015-04-30 19:06:01.835 748 ERROR neutron.
Command: ['sudo', '/usr/local/
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No buffer space available\n'
2015-04-30 19:06:08.825 748 INFO neutron.
2015-04-30 19:06:21.800 748 ERROR neutron.
Command: ['sudo', '/usr/local/
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No buffer space available\n'
2015-04-30 19:06:34.585 748 INFO neutron.
2015-04-30 19:06:44.641 748 ERROR neutron.
Command: ['sudo', '/usr/local/
Exit code: 2
Stdout: ''
Stderr: 'RTNETLINK answers: No buffer space available\n'
The failure was verified by the lack of a permanent ARP entry on the infra node for the three instances above:
root@infra01_
? (10.87.80.39) at fa:16:3e:3e:4d:30 [ether] PERM on vxlan-17
root@infra01_
root@infra01_
root@infra01_
We increased the gc_thresh sysctl parameters from their defaults:
FROM:
net.ipv4.
net.ipv4.
net.ipv4.
TO:
sysctl -w net.ipv4.
sysctl -w net.ipv4.
sysctl -w net.ipv4.
They may not be ideal values, but nonetheless, increasing those values allowed subsequent instances to be booted without issue.
Changed in neutron: | |
assignee: | nobody → yalei wang (yalei-wang) |
Thanks for the report. What do you think would be good behavior here? Have the agent auto adjust or should we just have better logging?