Failures to connect to localhost due to arp table overflow disrupt whole cluster

Bug #1488938 reported by Bogdan Dobrelya
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
Critical
Sergey Kolekonov
6.0.x
Won't Fix
Critical
Alexander Nevenchannyy
6.1.x
Fix Released
Critical
Denis Meltsaykin

Bug Description

When network controller host a router that is connecting networks with many VMs, node's arp table gets overflowed and local connections start to fail.

This affects OCF monitoring where some scripts use client libraries which send requests to localhost.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The UX for this issue is a critical.

Changed in fuel:
importance: Undecided → Critical
tags: added: customer-found tricky
Changed in fuel:
milestone: none → 7.0
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

And for the MOS perspective, neutron agents code should be inspected for possible flaws causing so many connections (?) which make conntrack to behave so unstable.

Changed in mos:
assignee: nobody → MOS Neutron (mos-neutron)
importance: Undecided → High
milestone: none → 7.0
Changed in mos:
assignee: MOS Neutron (mos-neutron) → Eugene Nikanorov (enikanorov)
status: New → Confirmed
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Indeed it's hardly a neutron bug.
We need to test these conditions on our scale lab.
Right now it's not clear what we can do at neutron side.

Meanwhile the solution might be to configure conntrack on controllers and increase limits.

I'm putting this to Incomplete until we test it.

Changed in mos:
status: Confirmed → Incomplete
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Repro should involve creating the same amount of resources/connections as on the node that has hit the issue:

netstat -apn | wc -l
~1000

conntrack -L | wc -l
~4000

neutron floatingip-list | wc -l
~600

tags: added: scale
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Additional details:
router namespace has

conntrack -L | wc -l
430000

Reaching such load requires kernel parameters to be adjusted.
But then it seems that it's possible to reproduce the issue with less amount of connections in default settings

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Update: the arp neighbour cache may be undersized on nodes with many namespaces resulting in thrashing and garbage collection of the ARP table.
The following potential fix should be tested for the 'ping: sendmsg: Invalid argument' and 'net_ratelimit' warning signs:
sysctl net.ipv4.neigh.default.gc_thresh1=1024
sysctl net.ipv4.neigh.default.gc_thresh2=2048
sysctl net.ipv4.neigh.default.gc_thresh3=4096

while current out-of-box values are:
net.ipv4.neigh.default.gc_thresh1 = 128
net.ipv4.neigh.default.gc_thresh2 = 512
net.ipv4.neigh.default.gc_thresh3 = 1024

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Reproduced by lowering net.ipv4.neigh.default.gc_thresh* down to 4-8-16

Changed in mos:
status: Incomplete → Confirmed
Changed in fuel:
status: New → Confirmed
no longer affects: mos
Changed in fuel:
assignee: nobody → Sergey Kolekonov (skolekonov)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

thank you, @Eugene, for the great job done

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/218204

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/218204
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=6d32ad9e3e2fefa7819bfa7207623aa6667bf8bb
Submitter: Jenkins
Branch: master

commit 6d32ad9e3e2fefa7819bfa7207623aa6667bf8bb
Author: Sergey Kolekonov <email address hidden>
Date: Fri Aug 28 13:54:43 2015 +0300

    Avoid neighbour table overflow problem

    “Neighbour table overflow” error which occurs in large networks when there are
    two many ARP requests which the server is not able to reply.

    This problem can occur when there're a lot of connections in Neutron routers'
    namespaces. Tune systcl to avoid this problem.

    Closes-bug: #1488938
    Co-Authored-By: Eugene Nikanorov <email address hidden>
    Co-Authored-By: Bogdan Dobrelya <email address hidden>
    Change-Id: I00e954c4792fd4d7993fd5e36fef9be86af22196

Changed in fuel:
status: In Progress → Fix Committed
summary: - Failure with a kernel panic when the limit of conntrack being hit under
- load
+ Failures to connect to localhost due to arp table overflow disrupt whole
+ cluster
description: updated
Revision history for this message
Andrew Woodward (xarses) wrote :

+ Customer found on 6.0 added series for 6.0 and 6.1. Please backport

Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Won't Fix for 6.0-updates as the fix affects fuel-librarys - will document as known issue. Nominated for MU3 for 6.1

tags: added: on-verification
Revision history for this message
Alexander Saprykin (cutwater) wrote :

Verified on ISO 288

Changed in fuel:
status: Fix Committed → Fix Released
tags: removed: on-verification
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/6.1)

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/223069

Roman Rufanov (rrufanov)
tags: added: support
tags: added: release-notes rn6.0-mu-7
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/6.1)

Reviewed: https://review.openstack.org/223069
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=35012e035389b4f3ccb8387133bf30117d4b7821
Submitter: Jenkins
Branch: stable/6.1

commit 35012e035389b4f3ccb8387133bf30117d4b7821
Author: Sergey Kolekonov <email address hidden>
Date: Fri Aug 28 13:54:43 2015 +0300

    Avoid neighbour table overflow problem

    “Neighbour table overflow” error which occurs in large networks when there are
    two many ARP requests which the server is not able to reply.

    This problem can occur when there're a lot of connections in Neutron routers'
    namespaces. Tune systcl to avoid this problem.

    Closes-bug: #1488938
    Co-Authored-By: Eugene Nikanorov <email address hidden>
    Co-Authored-By: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 6d32ad9e3e2fefa7819bfa7207623aa6667bf8bb)
    Change-Id: I00e954c4792fd4d7993fd5e36fef9be86af22196

tags: added: on-verification
Revision history for this message
Alexey Stupnikov (astupnikov) wrote :

Verified on Fuel-6.1

tags: removed: on-verification
Revision history for this message
Alexander Bozhenko (alexbozhenko) wrote :

Could you please add the same for ipv6? There is a customer, who uses it in production.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.