Route update loop with "allow_transit=true" in VN

Bug #1401010 reported by Praveen
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
Trunk
Fix Released
High
Nischal Sheth
OpenContrail
Fix Committed
High
Nischal Sheth

Bug Description

Hello,

To close on this topic, just to let you know that the issue was eventually a customer mistake (see below details) while using the brand new transitivity feature (1.2). Nevertheless, bottom line is that perhaps Contrail engineering could imagine a mechanism to prevent such mistake?

Here is a quick summary of the issue that we observed in our lab last week.

For our test purpose, we configure the following type of setup:

{Access VN1}___[VM]___ _________
                     { }
                     { Cust VN }--[VM]--{Internet VN}
{Access VN2}___[VM]___{_________}

The {Cust VN} is configured with “allow_transit=true” to allow the default route from the Internet VN to be learned by the Access VNx.
In this setup, each VN is created with an additional separate RT.

The issue happened when, by mistake, we have configured 4 customer setup, using all the same set of RT and the same IP ranges.
The Cust VNs being transitive, they were re-originating the routes in loop.
The Compute CPU where running at 100% (1000% as we have 10 cores) and the vrouter process was taking as much RAM that it could. It reached UP to 22Gb of RAM (on our system with 32Gb of RAM).
At the same time, the contrail control node was running at around 3 to 400% of CPU, without any memory usage increase.
Finally, the Contrail analytic node was very busy, receiving 10s of thousands of route messages. The contrail-analytics-api and Cassandra processes (java) where using all the RAM and CPU available.

The issue was solved when the RT’s where updated to be distinct on each VN. However, the memory used by the vrouter process on the compute node (22Gb), was not freed. A reset of the contrail and nova processes finally cleared all the problem.

While the issue came from a provisioning issue on the “service” side, it shows a weakness in the transitivity and reroute re-origination process. A route that was re-originated by the transitivity feature should not be re-originated anymore.

Cheers,

Nicolas

On Dec 3, 2014, at 18:17, Nicolas Marcoux <email address hidden> wrote:

Thx!

This is now evening and customer is off now.

Let’s try to plan this tomorrow, I will get back to you.

Cheers,

Nicolas

On Dec 3, 2014, at 17:58, Praveen K V <email address hidden> wrote:

Hi Nicolas,

I would like to login and take a look at the setup. Can you arrange for remote access and let me know?

Regards,
Praveen

From: Nicolas Marcoux <email address hidden>
Date: Wednesday, December 3, 2014 at 10:10 PM
To: ask-contrail <email address hidden>
Cc: Nicolas Marcoux <email address hidden>
Subject: Fwd: Contrail V-Router memory leak...

Hello,

OBS has very likely hit a memory leak on Contrail vRouter (1.2 version), it is using 22Gb RAM!

=> Is it a known issue? If not, would it be possible to get assistance for debugging and finding the root cause? (platform available for remote access)

Cheers,

Nicolas

Begin forwarded message:

From: <email address hidden>
Subject: Contrail V-Router memory leak...
Date: December 3, 2014 at 16:57:36 GMT+1
To: "<email address hidden>" <email address hidden>
Cc: GUINET Jean-Pierre SCE/IBNF <email address hidden>, GALLOT Frédéric SCE/IBNF <email address hidden>

Nicolas,

As discussed, we are seeing an abnormal memory utilization of the Vrouter process on our computes nodes:

sdn@RNET-SDN1:~$ top
top - 16:55:51 up 28 days, 14 min, 1 user, load average: 2.12, 0.74, 0.49
Tasks: 415 total, 1 running, 414 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.9%us, 0.7%sy, 0.0%ni, 98.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32644756k total, 32434692k used, 210064k free, 17204k buffers
Swap: 33517820k total, 30602472k used, 2915348k free, 40080k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 2842 root 20 0 79.7g 22g 36m S 0 73.1 368:13.15 contrail-vroute
 2592 nova 20 0 4412m 2.1g 3272 S 0 6.8 474:32.45 nova-compute
49071 libvirt- 20 0 5013m 485m 1400 S 6 1.5 510:37.18 qemu-system-x86
37332 libvirt- 20 0 4945m 457m 1400 S 6 1.4 525:38.29 qemu-system-x86
48910 libvirt- 20 0 4945m 355m 1416 S 7 1.1 527:02.85 qemu-system-x86
27659 libvirt- 20 0 5142m 340m 1416 S 6 1.1 515:59.12 qemu-system-x86
27458 libvirt- 20 0 6115m 324m 1416 S 6 1.0 522:00.75 qemu-system-x86
50875 libvirt- 20 0 5012m 243m 1400 S 6 0.8 512:30.28 qemu-system-x86
 2841 root 20 0 151m 11m 2320 S 0 0.0 7:15.52 python
16552 sdn 20 0 25032 7592 1748 S 0 0.0 0:00.25 bash

The compute is hosting only 6 VM, and is not heavily used...
Is it a known issue ?
We are opened for a troubleshooting session if needed.

Regards,

<image001.gif>

Pierre Aubry
EQUANT/IBNF/ENDD/NDE/NIE

fixe : +33 2 23 28 32 37
<email address hidden>

_________________________________________________________________________________________________________________________

Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.

Nicolas Marcoux
m: +33 6 86 73 94 72
<email address hidden>
www.juniper.net

<image001.gif>

Nicolas Marcoux
m: +33 6 86 73 94 72
<email address hidden>
www.juniper.net

Nicolas Marcoux
m: +33 6 86 73 94 72
<email address hidden>
www.juniper.net

Nischal Sheth (nsheth)
tags: added: contrail-control
Revision history for this message
Nischal Sheth (nsheth) wrote :

We plan to protect against this by adding a new OriginVnList attribute that
logically works the same way as AsPath. We can detect re-origination loops
this way.

Changed in opencontrail:
status: New → Triaged
assignee: Praveen (praveen-karadakal) → Nischal Sheth (nsheth)
Nischal Sheth (nsheth)
Changed in opencontrail:
status: Triaged → In Progress
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/6020
Committed: http://github.org/Juniper/contrail-controller/commit/0a70be6d79b21e9ad47e41cc2a3ec8fa8b230ad8
Submitter: Zuul
Branch: master

commit 0a70be6d79b21e9ad47e41cc2a3ec8fa8b230ad8
Author: Nischal Sheth <email address hidden>
Date: Tue Dec 30 14:00:46 2014 -0800

Prevent loops when re-originating service chain routes

Implement a new attribute called OriginVnPath to keep track of VNs from
which a given route has been re-originated. The OriginVn of ServiceChain
destination routing instance gets prepended to OriginVnPath when adding
the ServiceChain route. Further, a route is not re-originated if source
routing instance's OriginVn is already in the OriginVnPath.

This prevents route re-origination loops even if VNs are mis-configured
such that multiple VNs have the same route target and there are transit
VNs with mis-configured route targets as well.

Change-Id: Iab76ad4b07a69f71383db8e72531fb31bade04bd
Closes-Bug: 1401010

Nischal Sheth (nsheth)
Changed in opencontrail:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.