ipv6 neighbor advertisement storm

Bug #1532338 reported by Ryan Moats
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
Expired
High
Unassigned

Bug Description

stable liberty cloud with 20 network nodes, running OVS and supporting 1200 projects. Each project has on network with IPv4 and IPv6 subnets and one project router to attach to the external network.

Network nodes are seeing 1000 IPv6 Neighbour Advertisements within 2.3 seconds.

Assaf Muller (amuller)
Changed in neutron:
importance: Undecided → High
tags: added: ipv6
Changed in neutron:
status: New → Confirmed
tags: added: kilo-backport-potential
Revision history for this message
Sean M. Collins (scollins) wrote :

Currently working on adjusting the router advertisement interval

https://review.openstack.org/#/c/265451/

Changed in neutron:
assignee: nobody → Sean M. Collins (scollins)
Revision history for this message
Sean M. Collins (scollins) wrote :

Ryan - can find out what type of NA's are being sent? My concern is if it is neighbor advertisements from a compute instance advertising its availability (type 135 and 136) it becomes more difficult to fix, compared to just regulating the amount of NAs with type 134 (router advertisements)

Revision history for this message
Sean M. Collins (scollins) wrote :

Just jotting more of my thoughts down - if it ends up being a flood of 135 and 136 packets, we may need to work on some code similar to the L2POP feature where we proxy the neighbor discovery protocol on the compute node, to reduce the amount of ND traffic that has to be multicast, similar to how L2POP handles ARP.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/265629

Revision history for this message
Ryan Moats (rmoats) wrote : Re: ipv6 neighbor advertisement storm seen in stable/liberty cloud

Sean - all of the packet in the capture I'm looking at have an icmpv6 type field of 136, so I think we are going to have to look at bit deeper (sadpanda)

Revision history for this message
Sean M. Collins (scollins) wrote :

OK - in your packet captures can you see if the ICMPv6 neighbor advertisement has the "solicit" flag set?

https://tools.ietf.org/html/rfc4861#section-4.4

      S Solicited flag. When set, the S-bit indicates that
                     the advertisement was sent in response to a
                     Neighbor Solicitation from the Destination address.
                     The S-bit is used as a reachability confirmation
                     for Neighbor Unreachability Detection. It MUST NOT
                     be set in multicast advertisements or in
                     unsolicited unicast advertisements.

The reason I ask is if it's in response to a ICMPv6 neighbor solicitation, we can intercept them and have them responded to by a mechanism similar to l2pop - where since we know that the IP address has been assigned and we know who it was assigned to, we can respond immediately and not have to broadcast it over the tunnel.

Unsolicited neighbor advertisements are a little more tricky to handle.

Revision history for this message
Assaf Muller (amuller) wrote :

I removed the 'in stable/liberty' part from the bug title - I'm not aware of a reason to believe this bug affects only a specific version of OpenStack.

summary: - ipv6 neighbor advertisement storm seen in stable/liberty cloud
+ ipv6 neighbor advertisement storm
Revision history for this message
Ryan Moats (rmoats) wrote :

Sean, all of the packets have a flags value of 0x20000 (router and solicited bits are not set and override are set)

Revision history for this message
Bhalachandra Banavalikar (bhal-banavalikar) wrote :

Sean,

Also, please note that, all of the neighbor advertisements (136) are showing single ICMPv6 option (Type : Target link-layer address (2)).

Thanks,
Bhal

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/266613

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/266613
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=18ec2e424e48fd9999235deeffffbffcff91d56f
Submitter: Jenkins
Branch: master

commit 18ec2e424e48fd9999235deeffffbffcff91d56f
Author: Brian Haley <email address hidden>
Date: Tue Jan 12 18:53:42 2016 -0500

    Register RA and PD config options in l3-agent

    In order for the l3-agent to see the RA and PD config options,
    it needs to register them when it starts. Noticed this when I
    went to override something for a test and it wouldn't work.
    It now passes the config down to radvd on start so the correct
    values are picked-up.

    Change-Id: Iec0e0d16eed4f12af77fcd4f0b93b641b1146293
    Related-Bug: #1532338

tags: added: needs-attention
Revision history for this message
Ryan Moats (rmoats) wrote :

Capture of Type 136 NA messages seen in storm

Revision history for this message
Sean M. Collins (scollins) wrote : Re: [Bug 1532338] Re: ipv6 neighbor advertisement storm

Continuing to hunt this down. The things that jump out at me are the
fact that the NA packets from each node are sent on extremely short
intervals, and 5 times.

RFC4861 specifies the following values for hosts, when sending NA
packets

MAX_ANYCAST_DELAY_TIME = 1 second
MAX_NEIGHBOR_ADVERTISEMENT = 3

So the fact that we're seeing 5 packets in the space of a couple
milliseconds has me suspicious. Kevin Benton reminded us on IRC about duplicate
FDB entries ( https://bugs.launchpad.net/bugs/1531013 ) - maybe this is
related.

--
Sean M. Collins

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/265451
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=d9e4d20da86f29f7ebdb3c6b07086924888edd39
Submitter: Jenkins
Branch: master

commit d9e4d20da86f29f7ebdb3c6b07086924888edd39
Author: Sean M. Collins <email address hidden>
Date: Fri Jan 8 13:32:14 2016 -0800

    Make advertisement intervals for radvd configurable

    Currently a global setting that is applied for all managed radvd
    processes. Per-process setting could be done in the future.

    For large clouds, it may be useful to increase the intervals, to reduce
    multicast storms.

    Co-Authored-By: Brian Haley <email address hidden>

    DocImpact Router advertisement intervals for radvd are now configurable
    Related-Bug: #1532338

    Change-Id: I6cc313599f0ee12f7d51d073a22321221fca263f

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/265460
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=92a81c34ece6b28b599acc822a9b54f9a26324a3
Submitter: Jenkins
Branch: master

commit 92a81c34ece6b28b599acc822a9b54f9a26324a3
Author: Brian Haley <email address hidden>
Date: Fri Jan 8 17:47:34 2016 -0500

    Increase default IPv6 router advertisement interval

    The current values of min:3 and max:10 mean radvd is sending
    an RA about every 7 seconds, which can be excessive when we
    have thousands of routers. Let's relax it by 10x since most
    VMs will send a Router Solicition at boot, obviating the need
    for a small interval.

    Related-Bug: #1532338
    Change-Id: Ie0a411f67d10ec1469841d70fb643409f77be56f

Revision history for this message
Ryan Moats (rmoats) wrote :

While this problem may still exist, it has led to an architectural rethink that has resulted
in a deeper ongoing exploration of the capabilities and scaling of OVN rather than
continuing with the reference ML2 plugin/L3 agent implementation

Revision history for this message
Sean M. Collins (scollins) wrote :

Marking this as incomplete, as a result of the reporter changing their environment away from the OVS ML2 plugin. It will now require someone to independently reproduce.

Changed in neutron:
status: Confirmed → Incomplete
status: Incomplete → Triaged
Revision history for this message
Sean M. Collins (scollins) wrote :

On second thought, I'm going to set it to triaged - since I wonder if this is related to https://bugs.launchpad.net/bugs/1531013

Revision history for this message
Assaf Muller (amuller) wrote :

@Ryan, if you gather any interesting data comparing OVN and the ref. impl., that could be useful to a lot of people.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

any update on this one?

Changed in neutron:
status: Triaged → Incomplete
assignee: Sean M. Collins (scollins) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/liberty)

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/308228

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/308228
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7560c84e7ab2b7162c5d1bfeb5ef6cb9bfb91bb5
Submitter: Jenkins
Branch: stable/liberty

commit 7560c84e7ab2b7162c5d1bfeb5ef6cb9bfb91bb5
Author: Brian Haley <email address hidden>
Date: Tue Jan 12 18:53:42 2016 -0500

    Register RA and PD config options in l3-agent

    In order for the l3-agent to see the RA and PD config options,
    it needs to register them when it starts. Noticed this when I
    went to override something for a test and it wouldn't work.
    It now passes the config down to radvd on start so the correct
    values are picked-up.

    Conflicts:
     neutron/agent/l3_agent.py
     neutron/opts.py
     neutron/tests/unit/agent/l3/test_agent.py

    Liberty modifications:
    - Liberty does not have a config generator, so updated l3_agent.ini file
      manually.

    Change-Id: Iec0e0d16eed4f12af77fcd4f0b93b641b1146293
    Related-Bug: #1532338
    (cherry picked from commit 18ec2e424e48fd9999235deeffffbffcff91d56f)

tags: added: in-stable-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/265629
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.