Route update loop with "allow_transit=true" in VN
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
Juniper Openstack | Status tracked in Trunk | |||||
Trunk |
Fix Released
|
High
|
Nischal Sheth | |||
OpenContrail |
Fix Committed
|
High
|
Nischal Sheth |
Bug Description
Hello,
To close on this topic, just to let you know that the issue was eventually a customer mistake (see below details) while using the brand new transitivity feature (1.2). Nevertheless, bottom line is that perhaps Contrail engineering could imagine a mechanism to prevent such mistake?
Here is a quick summary of the issue that we observed in our lab last week.
For our test purpose, we configure the following type of setup:
{Access VN1}___[VM]___ _________
{Access VN2}___
The {Cust VN} is configured with “allow_
In this setup, each VN is created with an additional separate RT.
The issue happened when, by mistake, we have configured 4 customer setup, using all the same set of RT and the same IP ranges.
The Cust VNs being transitive, they were re-originating the routes in loop.
The Compute CPU where running at 100% (1000% as we have 10 cores) and the vrouter process was taking as much RAM that it could. It reached UP to 22Gb of RAM (on our system with 32Gb of RAM).
At the same time, the contrail control node was running at around 3 to 400% of CPU, without any memory usage increase.
Finally, the Contrail analytic node was very busy, receiving 10s of thousands of route messages. The contrail-
The issue was solved when the RT’s where updated to be distinct on each VN. However, the memory used by the vrouter process on the compute node (22Gb), was not freed. A reset of the contrail and nova processes finally cleared all the problem.
While the issue came from a provisioning issue on the “service” side, it shows a weakness in the transitivity and reroute re-origination process. A route that was re-originated by the transitivity feature should not be re-originated anymore.
Cheers,
Nicolas
On Dec 3, 2014, at 18:17, Nicolas Marcoux <email address hidden> wrote:
Thx!
This is now evening and customer is off now.
Let’s try to plan this tomorrow, I will get back to you.
Cheers,
Nicolas
On Dec 3, 2014, at 17:58, Praveen K V <email address hidden> wrote:
Hi Nicolas,
I would like to login and take a look at the setup. Can you arrange for remote access and let me know?
Regards,
Praveen
From: Nicolas Marcoux <email address hidden>
Date: Wednesday, December 3, 2014 at 10:10 PM
To: ask-contrail <email address hidden>
Cc: Nicolas Marcoux <email address hidden>
Subject: Fwd: Contrail V-Router memory leak...
Hello,
OBS has very likely hit a memory leak on Contrail vRouter (1.2 version), it is using 22Gb RAM!
=> Is it a known issue? If not, would it be possible to get assistance for debugging and finding the root cause? (platform available for remote access)
Cheers,
Nicolas
Begin forwarded message:
From: <email address hidden>
Subject: Contrail V-Router memory leak...
Date: December 3, 2014 at 16:57:36 GMT+1
To: "<email address hidden>" <email address hidden>
Cc: GUINET Jean-Pierre SCE/IBNF <email address hidden>, GALLOT Frédéric SCE/IBNF <email address hidden>
Nicolas,
As discussed, we are seeing an abnormal memory utilization of the Vrouter process on our computes nodes:
sdn@RNET-SDN1:~$ top
top - 16:55:51 up 28 days, 14 min, 1 user, load average: 2.12, 0.74, 0.49
Tasks: 415 total, 1 running, 414 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.9%us, 0.7%sy, 0.0%ni, 98.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32644756k total, 32434692k used, 210064k free, 17204k buffers
Swap: 33517820k total, 30602472k used, 2915348k free, 40080k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2842 root 20 0 79.7g 22g 36m S 0 73.1 368:13.15 contrail-vroute
2592 nova 20 0 4412m 2.1g 3272 S 0 6.8 474:32.45 nova-compute
49071 libvirt- 20 0 5013m 485m 1400 S 6 1.5 510:37.18 qemu-system-x86
37332 libvirt- 20 0 4945m 457m 1400 S 6 1.4 525:38.29 qemu-system-x86
48910 libvirt- 20 0 4945m 355m 1416 S 7 1.1 527:02.85 qemu-system-x86
27659 libvirt- 20 0 5142m 340m 1416 S 6 1.1 515:59.12 qemu-system-x86
27458 libvirt- 20 0 6115m 324m 1416 S 6 1.0 522:00.75 qemu-system-x86
50875 libvirt- 20 0 5012m 243m 1400 S 6 0.8 512:30.28 qemu-system-x86
2841 root 20 0 151m 11m 2320 S 0 0.0 7:15.52 python
16552 sdn 20 0 25032 7592 1748 S 0 0.0 0:00.25 bash
The compute is hosting only 6 VM, and is not heavily used...
Is it a known issue ?
We are opened for a troubleshooting session if needed.
Regards,
<image001.gif>
Pierre Aubry
EQUANT/
fixe : +33 2 23 28 32 37
<email address hidden>
_______
Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.
This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.
Nicolas Marcoux
m: +33 6 86 73 94 72
<email address hidden>
www.juniper.net
<image001.gif>
Nicolas Marcoux
m: +33 6 86 73 94 72
<email address hidden>
www.juniper.net
Nicolas Marcoux
m: +33 6 86 73 94 72
<email address hidden>
www.juniper.net
tags: | added: contrail-control |
Changed in opencontrail: | |
status: | Triaged → In Progress |
Changed in opencontrail: | |
status: | In Progress → Fix Committed |
We plan to protect against this by adding a new OriginVnList attribute that
logically works the same way as AsPath. We can detect re-origination loops
this way.