Vrouter agent crashed in FlowEntry::SetRpfNH

Bug #1472796 reported by Vinod Nair
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Released
High
Naveen N
Trunk
Fix Committed
High
Naveen N

Bug Description

On a long running traffic test, vrouter agent crashed in multiple compute nodes
The cores are in http://10.87.129.3/pxe/Standard/vin/agent_7/
backtrace is as below

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/bin/contrail-vrouter-agent'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fb38b827cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
56 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb)
(gdb) bt
#0 0x00007fb38b827cc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007fb38b82b0d8 in __GI_abort () at abort.c:89
#2 0x00007fb38b820b86 in __assert_fail_base (fmt=0x7fb38b971830 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=assertion@entry=0x1005a1b "l3_flow_ == true", file=file@entry=0x1005c90 "controller/src/vnsw/agent/pkt/flow_table.cc",
    line=line@entry=1291, function=function@entry=0x1006720 "bool FlowEntry::SetRpfNH(FlowTable*, const AgentRoute*)") at assert.c:92
#3 0x00007fb38b820c32 in __GI___assert_fail (assertion=0x1005a1b "l3_flow_ == true",
    file=0x1005c90 "controller/src/vnsw/agent/pkt/flow_table.cc", line=1291,
    function=0x1006720 "bool FlowEntry::SetRpfNH(FlowTable*, const AgentRoute*)") at assert.c:101
#4 0x0000000000ad9a06 in FlowEntry::SetRpfNH(FlowTable*, AgentRoute const*) ()
#5 0x0000000000ae43d2 in FlowEntry::InitFwdFlow(PktFlowInfo const*, PktInfo const*, PktControlInfo const*, PktControlInfo const*) ()
#6 0x0000000000afbb2f in PktFlowInfo::Add(PktInfo const*, PktControlInfo*, PktControlInfo*) ()
#7 0x0000000000b06d5b in FlowHandler::Run() ()
#8 0x0000000000b03a26 in Proto::ProcessProto(boost::shared_ptr<PktInfo>) ()
#9 0x0000000000b0451c in boost::detail::function::function_obj_invoker1<boost::_bi::bind_t<bool, boost::_mfi::mf1<bool, Proto, boost::shared_ptr<PktInfo> >, boost::_bi::list2<boost::_bi::value<Proto*>, boost::arg<1> > >, bool, boost::shared_ptr<PktInfo> >::invoke(boost::detail::function::function_buffer&, boost::shared_ptr<PktInfo>) ()
#10 0x0000000000b05b8f in QueueTaskRunner<boost::shared_ptr<PktInfo>, WorkQueue<boost::shared_ptr<PktInfo> > >::RunQueue() ()
#11 0x0000000000fa3000 in TaskImpl::execute() ()
#12 0x00007fb38c3f6b3a in ?? () from /usr/lib/libtbb.so.2
#13 0x00007fb38c3f2816 in ?? () from /usr/lib/libtbb.so.2
#14 0x00007fb38c3f1f4b in ?? () from /usr/lib/libtbb.so.2
#15 0x00007fb38c3ee0ff in ?? () from /usr/lib/libtbb.so.2
#16 0x00007fb38c3ee2f9 in ?? () from /usr/lib/libtbb.so.2
#17 0x00007fb38c612182 in start_thread (arg=0x7fb33c779700) at pthread_create.c:312
#18 0x00007fb38b8eb47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
      q

Version : 2.2 Build 59 Juno , Ubuntu 14.04

Tags: vrouter
no longer affects: juniperopenstack/r2.30
tags: added: vrouter
Revision history for this message
Naveen N (naveenn) wrote :

RPF nexthop for a VM interface flow gets set from layer 3 table routes.
In scenario of intra-VN ecmp traffic originated from ECMP source,
layer3 route is used to set RPF nexthop.
Ideally all ECMP traffic should be treated as layer-3 flow, in this
scenario since destination is Unicast packet gets bridged and
agent asserts since ECMP is not expected for layer2 traffic.

Solution is to force routing by replying for ARP request with VRRP mac
when the source route points to ECMP

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22-dev

Review in progress for https://review.opencontrail.org/12576
Submitter: Divakar Dharanalakota (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/12582
Submitter: Divakar Dharanalakota (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/12583
Submitter: Divakar Dharanalakota (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/12582
Committed: http://github.org/Juniper/contrail-vrouter/commit/13ed9d0ceb53f9acde1af5b97a8758287d4ddeaa
Submitter: Zuul
Branch: R2.20

commit 13ed9d0ceb53f9acde1af5b97a8758287d4ddeaa
Author: Divakar <email address hidden>
Date: Thu Jul 23 16:22:40 2015 +0530

Forcing L3 processing for Ecmp

When an ARP request is received from a VM which is part of ECMP to a non
ECMP destination, ARP response is sent using the stitching
information of destination route. This leads to VM learning destination
VM's MAC directly enabling an L2 switching. But as source VM is part of
ECMP, the packets need to be L3 routed, as there is no ECMP support for L2
switched packet and also the reverse path packets (from non ECMP to ECMP)
are as such L3 routed.
Responding to VM with stiched MAC, leads to following issue in particular. IP
packet from VM will come to Vrouter module with L2 Mac, and this results in
flow trap to Agent. Agent while setting the forward L2 flow, attempts to set
RPF nexthop, which is composite Nexthop and asserts as L2 flow can not have
composite nexthop.

To fix this, ARP response is sent now with VRRP mac, which makes the flow as
L3 flow.

Change-Id: I28948ba73ed5c08602812e665bc4c49c5e1906aa
closes-bug: #1472796

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/12583
Committed: http://github.org/Juniper/contrail-vrouter/commit/a4069fab9f652728bdf67e91fd564938fd5830e3
Submitter: Zuul
Branch: master

commit a4069fab9f652728bdf67e91fd564938fd5830e3
Author: Divakar <email address hidden>
Date: Thu Jul 23 16:36:39 2015 +0530

Forcing L3 processing for Ecmp

When an ARP request is received from a VM which is part of ECMP to a non
ECMP destination, ARP response is sent using the stitching
information of destination route. This leads to VM learning destination
VM's MAC directly enabling an L2 switching. But as source VM is part of
ECMP, the packets need to be L3 routed, as there is no ECMP support for L2
switched packet and also the reverse path packets (from non ECMP to ECMP)
are as such L3 routed.
Responding to VM with stiched MAC, leads to following issue in particular. IP
packet from VM will come to Vrouter module with L2 Mac, and this results in
flow trap to Agent. Agent while setting the forward L2 flow, attempts to set
RPF nexthop, which is composite Nexthop and asserts as L2 flow can not have
composite nexthop.

To fix this, ARP response is sent now with VRRP mac, which makes the flow as
L3 flow.

Change-Id: I3ba4c66288f86e190d39f8761d4952b42050ebd8
closes-bug: #1472796

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/12700
Submitter: Naveen N (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22-dev

Review in progress for https://review.opencontrail.org/13982
Submitter: Divakar Dharanalakota (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/14033
Submitter: Naveen N (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/14033
Committed: http://github.org/Juniper/contrail-controller/commit/940e4c1252b4d9dea4a410a30069436cfda4ff87
Submitter: Zuul
Branch: R2.20

commit 940e4c1252b4d9dea4a410a30069436cfda4ff87
Author: Naveen N <email address hidden>
Date: Thu Sep 24 02:24:07 2015 -0700

* If a ECMP flow gets processed as layer-2 flow mark it as short flow

If source route is composite, then the flow has to be layer3
procesess so that proper load distribution happens, this is
forced but proxying with vrouter mac for ECMP destinations
and if ECMP source originates the packet then also vrouter
would proxy with vrrp mac so that packet gets processed
as layer3 flow. If a flow transition from non-ECMP to
ECMP then for some time flow would be layer2. Mark such flow
as short flow for that time.
Closes-bug:#1472796

Change-Id: I4256011c5838a68c4d9180d63df5d561f5d878f1

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/14086
Submitter: Naveen N (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Review in progress for https://review.opencontrail.org/16103
Submitter: Naveen N (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/16103
Committed: http://github.org/Juniper/contrail-controller/commit/5e9a68085190956a467a12a5867d7d4ea6f3f984
Submitter: Zuul
Branch: master

commit 5e9a68085190956a467a12a5867d7d4ea6f3f984
Author: Naveen N <email address hidden>
Date: Wed Jan 6 11:41:08 2016 +0530

* If a ECMP flow gets processed as layer-2 flow mark it as short flow

If source route is composite, then the flow has to be layer3
procesess so that proper load distribution happens, this is
forced but proxying with vrouter mac for ECMP destinations
and if ECMP source originates the packet then also vrouter
would proxy with vrrp mac so that packet gets processed
as layer3 flow. If a flow transition from non-ECMP to
ECMP then for some time flow would be layer2. Mark such flow
as short flow for that time.
Closes-bug:#1472796,#1523059

Change-Id: I3eab6ed0772f758aadecb7f22cb28822277bdda9

information type: Proprietary → Public
Revision history for this message
Jeba Paulaiyan (jebap) wrote :

Verified this using Contrail Solution Test:

Traffic test with 10 tenants, ~135 VMs, ~60 BMS; all VMs/BMS sending traffic to each other; features are vDNS/LLS/FIP/snat/lbaas; lots of flow setup/teardown; high b/w long lived flows; (3-4Gbps on each vRouter).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.