Rebooting the CN causes traffic loss for existing active streams

Bug #1481932 reported by Ranjit patro
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenContrail
New
Undecided
Unassigned

Bug Description

Upon rebooting the controller, traffic loss is seen for existing traffic streams which are learned and installed in forwarding DB.
I see remote MAC are getting deleted on multiple TORs.
CN and TSN are running latest 2.0-74.

Tags: blocker bms qfx
Ranjit patro (rpatro)
tags: added: bms
Revision history for this message
Ranjit patro (rpatro) wrote :
Download full text (3.5 KiB)

Even after CN is UP after reboot , traffic never recover for most of streams.
ToR Agent shows initializing forever unless i stop all traffic .
When I stop all traffic, TSN state recovers to active.

WHen I start traffic again, though TSN and CN state is fine but traffic doesnt recover (fxpc is seen 100%, will open a separate QFX PR for this)

root@cd-st-lnxserver15:~# contrail-status
== Contrail vRouter ==
supervisor-vrouter: active
contrail-vrouter-agent active
contrail-vrouter-nodemgr active

== Contrail Control ==
supervisor-control: active
contrail-control active
contrail-control-nodemgr active
contrail-dns active
contrail-named active

== Contrail Analytics ==
supervisor-analytics: active
contrail-analytics-api active
contrail-analytics-nodemgr active
contrail-collector active
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active

== Contrail Config ==
supervisor-config: active
contrail-api:0 active
contrail-config-nodemgr active
contrail-device-manager active
contrail-discovery:0 active
contrail-schema active
contrail-svc-monitor active
ifmap active

== Contrail Web UI ==
supervisor-webui: active
contrail-webui active
contrail-webui-middleware active

== Contrail Database ==
supervisor-database: active
contrail-database active
contrail-database-nodemgr active

== Contrail Support Services ==
supervisor-support-service: active
rabbitmq-server active

root@cd-st-lnxserver15:~#

root@cd-st-lnxserver16:~# contrail-status >>>>>>>>>>
== Contrail vRouter ==
supervisor-vrouter: active
contrail-tor-agent-1 initializing (ToR:pdt-elit-01 connection down)
contrail-tor-agent-2 initializing (ToR:st-pdt-opus02 connection down)
contrail-tor-agent-3 active
contrail-vrouter-agent active
contrail-vrouter-nodemgr active

========Run time service failures=============
/var/crashes/core.contrail-vroute.12194.cd-st-lnxserver16.1438089702
/var/crashes/core.contrail-vroute.2154.cd-st-lnxserver16.1438739145
/var/crashes/core.contrail-vroute.7206.cd-st-lnxserver16.1438770395
/var/crashes/core.contrail-vroute.2150.cd-st-lnxserver16.1437999440
/var/crashes/core.contrail-vroute.6335.cd-st-lnxserver16.1438109983
/var/crashes/core.contrail-tor-ag.2151.cd-st-lnxserver16.1438023139
/var/crashes/core.contrail-vroute.24050.cd-st-lnxserver16.1438095047
/var/crashes/core.contrail-vroute.18873.cd-st-lnxserver16.1438755472
/var/crashes/core.contrail-vroute.24639.cd-st-lnxserver16.1438783974
/var/crashes/core.contrail-vrou...

Read more...

Ranjit patro (rpatro)
tags: added: blocker qfx
Revision history for this message
Hari Prasad Killi (haripk) wrote :

Hi Prabhu,
I think , this issue is also due to the same problem. I tried rebooting one QFX in the setup and see things coming up fine after that. Please let us know if you see anything else.

Regards,
Hari

From: Prabhu Seshachellam <email address hidden>
Date: Wednesday, August 5, 2015 12:28 PM
To: Hari <email address hidden>, Ranjit Patro <email address hidden>, Sudarsan Paragiri Mohan <email address hidden>
Cc: Ashish Ranjan <email address hidden>, Anoop Kumar Sahu <email address hidden>, Rahul Kasralikar <email address hidden>, Prabhjot Singh Sethi <email address hidden>, Manish Singh <email address hidden>, Praveen K V <email address hidden>, Divakar Dharanalakota <email address hidden>, Anand H Krishnan <email address hidden>
Subject: RE: Contrail OVSDB with ELIT 15.1X53

Hi Hari,

Thanks for the update.

Can your team look into the second issue, since you have setup ?

“ 1109621 - ELIT_PDT(VxLAN OVSDB): rebooting one of VTEP TOR, permanent traffic loss is seen b/w all other active TORs

As per Sudarsan, this is also contrail related.

/prabhu

From: Hari Prasad Killi
Sent: Tuesday, August 04, 2015 11:54 PM
To: Ranjit Patro; Sudarsan Paragiri Mohan; Prabhu Seshachellam
Cc: Ashish Ranjan; Anoop Kumar Sahu; Rahul Kasralikar; Prabhjot Singh Sethi; Manish Singh; Praveen K V; Divakar Dharanalakota; Anand H Krishnan
Subject: Re: Contrail OVSDB with ELIT 15.1X53

Hi,
Divakar and Anand have looked into this. Here is the summary of the issue:
Traffic sent wasn't broadcast traffic. Also, the dest mac of the data packet is not known.
This is resulting in the vrouter going thru the overflow table (size 4K) for each packet, which is taking more cpu cycles.
The SSL packets are also going thru the same cpu cores and are getting dropped randomly.
We changed the overflow table size to 1 (in /etc/modprobe.d/vrouter.conf) and reloaded vrouter. Things seem to be stable now. Will create a bug for this.

QFX not seeing any response for inactivity probe would also be due to the same reason. Please use this workaround on your TSN nodes, while we work on solving this.

Regards,
Hari

Revision history for this message
Ranjit patro (rpatro) wrote : Re: [Bug 1481932] Re: Rebooting the CN causes traffic loss for existing active streams
Download full text (3.8 KiB)

Hi Hari,
  Regarding "https://bugs.launchpad.net/opencontrail/+bug/1481932²
In this PR I see the issue even after overflow table size is set to 1.
Traffic never resumes even after controller is UP. Please note there are
couple of issues w.r.t this PR.
Below is the observation with overFlow table size set.

1) as soon as CN is rebooted, remote MAC delete is seen on multiple TORs.
2) Even after CN comes online, 50-60% of streams
3) ToR Agent shows initializing forever unless i stop all traffic .
When I stop all traffic, TSN state recovers to active.
4) WHen I start traffic again, though TSN and CN state is fine but traffic
doesnt recover (fxpc is seen 100%, will open a separate QFX PR for this)

On 8/10/15, 4:38 AM, "<email address hidden> on behalf of Hari Prasad
Killi" <<email address hidden> on behalf of <email address hidden>> wrote:

>*** This bug is a duplicate of bug 1481606 ***
> https://bugs.launchpad.net/bugs/1481606
>
>
>Hi Prabhu,
>I think , this issue is also due to the same problem. I tried rebooting
>one QFX in the setup and see things coming up fine after that. Please let
>us know if you see anything else.
>
>Regards,
>Hari
>
>From: Prabhu Seshachellam <email address hidden>
>Date: Wednesday, August 5, 2015 12:28 PM
>To: Hari <email address hidden>, Ranjit Patro <email address hidden>,
>Sudarsan Paragiri Mohan <email address hidden>
>Cc: Ashish Ranjan <email address hidden>, Anoop Kumar Sahu
><email address hidden>, Rahul Kasralikar <email address hidden>, Prabhjot
>Singh Sethi <email address hidden>, Manish Singh <email address hidden>,
>Praveen K V <email address hidden>, Divakar Dharanalakota
><email address hidden>, Anand H Krishnan <email address hidden>
>Subject: RE: Contrail OVSDB with ELIT 15.1X53
>
>Hi Hari,
>
>Thanks for the update.
>
>Can your team look into the second issue, since you have setup ?
>
>
>³ 1109621 - ELIT_PDT(VxLAN OVSDB): rebooting one of VTEP TOR, permanent
>traffic loss is seen b/w all other active TORs

>
>As per Sudarsan, this is also contrail related.
>
>/prabhu
>
>
>From: Hari Prasad Killi
>Sent: Tuesday, August 04, 2015 11:54 PM
>To: Ranjit Patro; Sudarsan Paragiri Mohan; Prabhu Seshachellam
>Cc: Ashish Ranjan; Anoop Kumar Sahu; Rahul Kasralikar; Prabhjot Singh
>Sethi; Manish Singh; Praveen K V; Divakar Dharanalakota; Anand H Krishnan
>Subject: Re: Contrail OVSDB with ELIT 15.1X53
>
>Hi,
>Divakar and Anand have looked into this. Here is the summary of the issue:
>Traffic sent wasn't broadcast traffic. Also, the dest mac of the data
>packet is not known.
>This is resulting in the vrouter going thru the overflow table (size 4K)
>for each packet, which is taking more cpu cycles.
>The SSL packets are also going thru the same cpu cores and are getting
>dropped randomly.
>We changed the overflow table size to 1 (in /etc/modprobe.d/vrouter.conf)
>and reloaded vrouter. Things seem to be stable now. Will create a bug for
>this.
>
>QFX not seeing any response for inactivity probe would also be due to the
>same reason. Please use this workaround on your TSN nodes, while we work
>on solving this.
>
>Regards,
>Hari
>
>
>** This bug has been marked a duplicate of bug 1481606
> TSN : High cpu utilization ...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.