[R2.20] Incorrect NH on TSN causes VM<>BMS packets to be dropped

Bug #1450683 reported by amit surana
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
Medium
Manish Singh
Trunk
Fix Committed
Medium
Manish Singh

Bug Description

R2.20 Ubuntu 14.0.4

Ended up in a situation where the overlay VM is unable to reach the BMS server. I haven't attempted a recreate to nail the exact steps because I wanted to leave the setup as is, if someone wants to debug it.

Started with 2 TORs (ovs/vtep), each having 1 BMS. Tor Agent connections to the 2 TORs are up and the tunnels are setup correctly. The BMS MACs are known to the Tor Agent. Then added a guest VM in VN1. The BMSs are part of the same VN.

From vrouter hosting the VM(14.1.1.3):

Flags: L=Label Valid, P=Proxy ARP, T=Trap ARP, F=Flood ARP
Destination PPL Flags Label Nexthop Stitched MAC(Index)
14.1.1.1/32 32 PT - 6 -
14.1.1.2/32 32 PT - 6 -
14.1.1.3/32 32 P - 15 2:42:4d:b0:a8:6e(172048)
169.254.169.254/32 32 PT - 5 -

Kernel L2 Bridge table 0/1

Flags: L=Label Valid, Df=DHCP flood

Index DestMac Flags Label/VNID Nexthop
97192 ff:ff:ff:ff:ff:ff L 4 23
150292 0:10:18:96:f0:b6 - 2
156248 0:1:0:0:0:3b L 4 13
172048 2:42:4d:b0:a8:6e - 18
242336 0:1:0:0:0:3c L 4 12
252916 0:0:5e:0:1:0 - 2

0:1:0:0:0:3b and 0:1:0:0:0:3c are the two BMSs.

VxLAN tunnel to BMS is setup correctly:

root@c4-fpc10:~# nh --get 13
Id:013 Type:Tunnel Fmly: AF_INET Flags:Valid, Vxlan, Rid:0 Ref_cnt:2 Vrf:0
 Oif:0 Len:14 Flags Valid, Vxlan, Data:f8 c0 01 23 13 c5 00 10 18 96 f0 b6 08 00
 Vrf:0 Sip:11.1.7.2 Dip:11.1.9.2

root@c4-fpc10:~# nh --get 12
Id:012 Type:Tunnel Fmly: AF_INET Flags:Valid, Vxlan, Rid:0 Ref_cnt:2 Vrf:0
 Oif:0 Len:14 Flags Valid, Vxlan, Data:f8 c0 01 23 13 c5 00 10 18 96 f0 b6 08 00
 Vrf:0 Sip:11.1.7.2 Dip:11.1.6.2

The L2 broadcast route points to the TSN:

root@c4-fpc10:~# nh --get 23
Id:023 Type:Composite Fmly:AF_BRIDGE Flags:Valid, Multicast, L2, Rid:0 Ref_cnt:4 Vrf:1
 Sub NH(label): 22(0) 19(0)

root@c4-fpc10:~# nh --get 22
Id:022 Type:Composite Fmly: AF_INET Flags:Valid, Fabric, Rid:0 Ref_cnt:2 Vrf:1
 Sub NH(label): 21(1030)

root@c4-fpc10:~# nh --get 21
Id:021 Type:Tunnel Fmly: AF_INET Flags:Valid, MPLSoGRE, Rid:0 Ref_cnt:2 Vrf:0
 Oif:0 Len:14 Flags Valid, MPLSoGRE, Data:f8 c0 01 23 13 c5 00 10 18 96 f0 b6 08 00
 Vrf:0 Sip:11.1.7.2 Dip:11.1.1.2

However, on the TSN, there is no corresponding entry for MPLS label 1030.

root@c1-qa15:~# mpls --dump
MPLS Input Label Map

   Label NextHop
-------------------
      16 9

As such, all the broadcast packets from the compute server hosting the guest VM are dropped on the TSN with reason 'invalid NH'.

Topology:

#Role definition of the hosts.
env.roledefs = {
    'all': [host1, host2, host3, host4, host5],
    'cfgm': [host3],
    'openstack': [host3],
    'control': [host3],
    'compute': [host1, host2, host4, host5],
    'collector': [host3],
    'webui': [host3],
    'database': [host3],
    'build': [host_build],
    'toragent': [host2],
    'tsn': [host2],
    'storage-master': [host3],
    'storage-compute': [host1, host2, host4]
}

controller: 10.87.129.245 root:n1keenA

Changed in juniperopenstack:
assignee: nobody → Manish Singh (manishs)
amit surana (asurana-t)
tags: added: vrouter
Revision history for this message
amit surana (asurana-t) wrote :

the core file can be found here:

10.84.5.100:/cs-shared/bugs/1450683/core.9946

In addition a Tor Agent crash was also seen on the setup. If it turns out that this bug and the crash have different root cause, we can open another bug to track the latter.

Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x0000000000824380 in AgentPath::nexthop() const ()
(gdb) bt
#0 0x0000000000824380 in AgentPath::nexthop() const ()
#1 0x00000000009c9c24 in AgentXmppChannel::BuildTorMulticastMessage(autogen::EnetItemType&, std::basic_stringstream<char, std::char_traits<char>, std::allocator<char> >&, AgentRoute*, boost::asio::ip::address_v4 const*, std::string const&, std::vector<int, std::allocator<int> > const*, unsigned int, unsigned int, bool) ()
#2 0x00000000009cb7ec in AgentXmppChannel::ControllerSendEvpnRouteCommon(AgentRoute*, boost::asio::ip::address_v4 const*, std::string, std::vector<int, std::allocator<int> > const*, unsigned int, unsigned int, bool) ()
#3 0x00000000009cbe2c in AgentXmppChannel::ControllerSendEvpnRouteDelete(AgentXmppChannel*, AgentRoute*, std::string, unsigned int, unsigned int) ()
#4 0x00000000009b55ab in RouteExport::MulticastNotify(AgentXmppChannel*, bool, DBTablePartBase*, DBEntryBase*) ()
#5 0x0000000000ca0e82 in DBTableBase::RunNotify(DBTablePartBase*, DBEntryBase*) ()
#6 0x0000000000ca3518 in DBTablePartBase::RunNotify() ()
#7 0x0000000000c9f9ed in DBPartition::QueueRunner::Run() ()
#8 0x0000000000d9da70 in TaskImpl::execute() ()
#9 0x00007fe62c67ab3a in ?? () from /usr/lib/libtbb.so.2
#10 0x00007fe62c676816 in ?? () from /usr/lib/libtbb.so.2
#11 0x00007fe62c675f4b in ?? () from /usr/lib/libtbb.so.2
#12 0x00007fe62c6720ff in ?? () from /usr/lib/libtbb.so.2
#13 0x00007fe62c6722f9 in ?? () from /usr/lib/libtbb.so.2
#14 0x00007fe62c896182 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007fe62bb6efbd in clone () from /lib/x86_64-linux-gnu/libc.so.6

tags: added: blocker
information type: Proprietary → Public
Revision history for this message
Manish Singh (manishs) wrote :
Revision history for this message
Manish Singh (manishs) wrote :

Control node did not send the nexthop for Tor 11.1.7.2

2015-04-30 23:40:24.981 XmppRxStream: Received xmpp message from: 11.1.4.2 Port 5269 Size: 1416 Packet: <?xml version="1.0"?> <message <email address hidden>" to="c1-qa15/bgp-peer"> <event xmlns="http://jabber.org/protocol/pubsub"> <items node="25/242/default-domain:admin:VN-1-L3:VN-1-L3"> <item id="4-ff:ff:ff:ff:ff:ff,0.0.0.0/32"> <entry> <nlri> <af>25</af> <safi>242</safi> <ethernet-tag>4</ethernet-tag> <mac>ff:ff:ff:ff:ff:ff</mac> <address>0.0.0.0/32</address> </nlri> <next-hops /> <olist /> <virtual-network>unresolved</virtual-network> <sequence-number>0</sequence-number> <security-group-list /> <local-preference>100</local-preference> <edge-replication-not-supported>true</edge-replication-not-supported> <assisted-replication-supported>true</assisted-replication-supported> <leaf-olist> <next-hop> <af>1</af> <address>11.1.6.2</address> <label>4</label> <tunnel-encapsulation-list> <tunnel-encapsulation>vxlan</tunnel-encapsulation> </tunnel-encapsulation-list> </next-hop> <next-hop> <af>1</af> <address>11.1.9.2</address> <label>4</label> <tunnel-encapsulation-list> <tunnel-encapsulation>vxlan</tunnel-encapsulation> </tunnel-encapsulation-list> </next-hop> </leaf-olist> <replicator-address></replicator-address> </entry> </item> </items> </event> </message> $ controller/src/xmpp/xmpp_connection.cc 474

Since agent did not get the info so it did not program it.
Will have to see what was the state in control node.
Please let us know if this is seen again as we will have to check CN.

Revision history for this message
Manish Singh (manishs) wrote :

In addition to above explanation broadly this can happen for two reasons:
1) CN had issue, 2) Tor-agent for 11.1.7.2 didnt send subscription.
Since supervisor-vrouter was restarted there is no clue to see which of the above happened.

Will recommend following if this issue is seen again:
1) Route table introspect frm all tor-agent and vrouter-agent(TSN).
2) Route table from CN.
3) XMPP trace messages (XmppMessageTrace)

Revision history for this message
amit surana (asurana-t) wrote :

11.1.7.2 is not the TOR; its a compute node.

The TORs are 11.1.6.2 and 11.1.9.2. The TSN/TorAgent process is running on 11.1.1.2.

Revision history for this message
amit surana (asurana-t) wrote :

Recreated on build 30.

Setup has 2 TORs, each with 2 interfaces/BMS, one in vlan10 and the other in vlan20. Initially, all the routes are correctly programmed and connectivity is fine. Upon deleting the physical/logical interfaces belonging to one of the TORs (TOR-2) from the UI, the symptoms described in the bug are seen. The broadcast route on the vRouter points to the TSN, however, the TSN node does not have corresponding entries for the incoming mpls labels. As such, all traffic is dropped with error invalid NH. Same as what is described in the bug above.

Setup details:

Controller: 10.87.129.245, root:n1keenA
TSN: 10.87.129.74 (mgmt), root:n1keenA, 11.1.1.2 (Data)
Compute: 10.87.140.126 (mgmt), root:juniper123, 11.1.7.2 (Data)
TOR-2: 10.87.140.225 (mgmt), root:juniper123, 11.1.9.2 (Data)
TOR-3: 10.87.130.83 (mgmt), root:n1keenA, 11.1.6.2 (Data)

Revision history for this message
Manish Singh (manishs) wrote :

This will get fixed via https://review.opencontrail.org/#/c/10902/
Please verify after that.

Wrong interface type will result in multicast deleting multicast object.

Revision history for this message
Mirai Neeku (mirai-neeku) wrote :

I am facing the same issue. But I have installed the openstack multi node package through deb file downloaded through the juniper website.

How can i resolve this issue. As I don't have any directory Juniper/contrail-controller / src/vnsw/agent/oper/vm_interface.cc

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.