vrouter crashes with NetworkPolicy using high number of community tag

Bug #1794702 reported by Slobodan Blatnjak
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R4.1
Fix Committed
High
Pramodh D'Souza
R5.0
Fix Committed
High
Pramodh D'Souza
Trunk
Fix Committed
High
Pramodh D'Souza

Bug Description

4.1.1 vrouter systematically crashes with a coredump when create a Heat stack with a network policy that contains a high number of CT.
It looks that the limit is around 60 CT. When we use a higher number of CT, customer can reproduce it systematically.

Will attach:
- several coredump generated just after the creation
- the contrail logs directory
- the heat stack with envionment file

Revision history for this message
Slobodan Blatnjak (sblatnjak) wrote :
Changed in juniperopenstack:
importance: Undecided → High
milestone: none → r4.1.2.0
milestone: r4.1.2.0 → none
Revision history for this message
Slobodan Blatnjak (sblatnjak) wrote :

Hello,

Customer asked for an update.
They are very close to the limit at the moment (~50), and concerned to reach the limit with new services planed by the end of this year. - At that time, they will have 4.1.2 in production.

Thanks,
Slobodan

Revision history for this message
Pramodh D'Souza (psdsouza) wrote :

Please provide the bt when filing these bugs. Also better to include OS version etc.

Jeba Paulaiyan (jebap)
tags: added: blocker
Jeba Paulaiyan (jebap)
tags: removed: blocker
Revision history for this message
Pramodh D'Souza (psdsouza) wrote :
Download full text (5.9 KiB)

bt
Program terminated with signal 11, Segmentation fault.
#0 __strlen_sse2_pminub () at ../sysdeps/x86_64/multiarch/strlen-sse2-pminub.S:49
49 pcmpeqb (%rax), %xmm0
(gdb) bt
#0 __strlen_sse2_pminub () at ../sysdeps/x86_64/multiarch/strlen-sse2-pminub.S:49
#1 0x000000000251733e in pugi::impl::(anonymous namespace)::strlength (s=0x3030313a3939393e <Address 0x3030313a3939393e out of bounds>) at build/third_party/pugixml/src/pugixml.cpp:173
#2 0x0000000002518c69 in pugi::impl::(anonymous namespace)::strcpy_insitu (dest=@0x2b3914aafc40: 0x0, header=@0x2b3914aafc30: 47524159861456, header_mask=8, source=0x3030313a3939393e <Address 0x3030313a3939393e out of bounds>) at build/third_party/pugixml/src/pugixml.cpp:1528
#3 0x000000000251de47 in pugi::xml_attribute::set_value (this=0x2b38fcdfb940, rhs=0x3030313a3939393e <Address 0x3030313a3939393e out of bounds>) at build/third_party/pugixml/src/pugixml.cpp:3874
#4 0x000000000251dd05 in pugi::xml_attribute::operator= (this=0x2b38fcdfb940, rhs=0x3030313a3939393e <Address 0x3030313a3939393e out of bounds>) at build/third_party/pugixml/src/pugixml.cpp:3835
#5 0x0000000002516b8d in XmlPugi::AddAttribute (this=0x2b39149a2a00, key="node", value=<error reading variable: Cannot access memory at address 0x3030313a39393926>) at controller/src/xml/xml_pugi.cc:289
Python Exception <type 'exceptions.ValueError'> Cannot find type const VnListType::_Rep_type:
#6 0x000000000217f536 in AgentXmppChannel::ControllerSendV4V6UnicastRouteCommon (this=0x2b39005eda10, route=0x2b39149ee680, vn_list=std::set with 1 elements, sg_list=0x2b39142647b0, tag_list=0x2b39142647c8, communities=0x2b39142647e0, mpls_label=78, bmap=6, path_preference=..., associate=true,
    type=Agent::INET4_UNICAST, ecmp_load_balance=..., native_vrf_id=4294967295) at controller/src/vnsw/agent/controller/controller_peer.cc:1843
Python Exception <type 'exceptions.ValueError'> Cannot find type const VnListType::_Rep_type:
#7 0x0000000002183bce in AgentXmppChannel::ControllerSendRouteAdd (peer=0x2b39005eda10, route=0x2b39149ee680, nexthop_ip=0x54de3d0, vn_list=std::set with 1 elements, label=78, bmap=6, sg_list=0x2b39142647b0, tag_list=0x2b39142647c8, communities=0x2b39142647e0, type=Agent::INET4_UNICAST,
    path_preference=..., ecmp_load_balance=..., native_vrf_id=4294967295) at controller/src/vnsw/agent/controller/controller_peer.cc:2440
#8 0x0000000002169021 in RouteExport::UnicastNotify (this=0x2b393009e1d0, bgp_xmpp_peer=0x2b39005eda10, partition=0x2b3910031e60, e=0x2b39149ee680, type=Agent::INET4_UNICAST) at controller/src/vnsw/agent/controller/controller_export.cc:237
#9 0x0000000002168b60 in RouteExport::Notify (this=0x2b393009e1d0, agent=0x54de150, bgp_xmpp_peer=0x2b39005eda10, associate=true, type=Agent::INET4_UNICAST, partition=0x2b3910031e60, e=0x2b39149ee680) at controller/src/vnsw/agent/controller/controller_export.cc:166
#10 0x000000000216be36 in boost::_mfi::mf6<void, RouteExport, Agent const*, AgentXmppChannel*, bool, Agent::RouteTableType, DBTablePartBase*, DBEntryBase*>::operator() (this=0x2b39148d9460, p=0x2b393009e1d0, a1=0x54de150, a2=0x2b39005eda10, a3=true, a4=Agent::INET4_UNICAST, a5=0x2b3910031e60,
    a6=0x2b39149ee680) a...

Read more...

Revision history for this message
Pramodh D'Souza (psdsouza) wrote :
Download full text (3.2 KiB)

Obeservations - 81 communities found in the route.
(gdb) p item
$1 = {<AutogenProperty> = {_vptr.AutogenProperty = 0x2a8b4d0 <vtable for autogen::ItemType+16>}, entry = {<AutogenProperty> = {_vptr.AutogenProperty = 0x2a8b4f0 <vtable for autogen::EntryType+16>}, nlri = {<AutogenProperty> = {_vptr.AutogenProperty = 0x2a8b650 <vtable for autogen::IPAddressType+16>},
      af = 1, safi = 1, address = "100.64.0.20/32"}, next_hops = {<AutogenProperty> = {_vptr.AutogenProperty = 0x2a8b590 <vtable for autogen::NextHopListType+16>}, next_hop = std::vector of length 1, capacity 1 = {{<AutogenProperty> = {_vptr.AutogenProperty = 0x2a8b5b0 <vtable for autogen::NextHopType+16>},
          af = 1, address = "198.19.56.212", mac = "", label = 78, vni = 0, tunnel_encapsulation_list = {<AutogenProperty> = {_vptr.AutogenProperty = 0x2a8b630 <vtable for autogen::TunnelEncapsulationListType+16>}, tunnel_encapsulation = std::vector of length 2, capacity 2 = {"gre", "udp"}},
          virtual_network = "default-domain:GEN-TNT-INTERNET:GEN-VRF-INTERNET", tag_list = {<AutogenProperty> = {_vptr.AutogenProperty = 0x2a8b610 <vtable for autogen::TagListType+16>}, tag = std::vector of length 0, capacity 0}}}}, version = 1, virtual_network = "", mobility = {<AutogenProperty> = {
        _vptr.AutogenProperty = 0x2a8b510 <vtable for autogen::MobilityType+16>}, seqno = 0, sticky = false}, sequence_number = 0, security_group_list = {<AutogenProperty> = {_vptr.AutogenProperty = 0x2a8b530 <vtable for autogen::SecurityGroupListType+16>},
      security_group = std::vector of length 0, capacity 0}, community_tag_list = {<AutogenProperty> = {_vptr.AutogenProperty = 0x2a8b570 <vtable for autogen::CommunityTagListType+16>}, community_tag = std::vector of length 81, capacity 81 = {"no-reoriginate", "999:10001", "999:10002", "999:10003",
        "999:10004", "999:10005", "999:10006", "999:10007", "999:10008", "999:10009", "999:10010", "999:10011", "999:10012", "999:10013", "999:10014", "999:10015", "999:10016", "999:10017", "999:10018", "999:10019", "999:10020", "999:10021", "999:10022", "999:10023", "999:10024", "999:10025", "999:10026",
        "999:10027", "999:10028", "999:10029", "999:10030", "999:10031", "999:10032", "999:10033", "999:10034", "999:10035", "999:10036", "999:10037", "999:10038", "999:10039", "999:10040", "999:10041", "999:10042", "999:10043", "999:10044", "999:10045", "999:10046", "999:10047", "999:10048", "999:10049",
        "999:10050", "999:10051", "999:10052", "999:10053", "999:10054", "999:10055", "999:10056", "999:10057", "999:10058", "999:10059", "999:10060", "999:10061", "999:10062", "999:10063", "999:10064", "999:10065", "999:10066", "999:10067", "999:10068", "999:10069", "999:10070", "999:10071", "999:10072",
        "999:10073", "999:10074", "999:10075", "999:10076", "999:10077", "999:10078", "999:10079", "999:10080"}}, local_preference = 100, med = 0, load_balance = {<AutogenProperty> = {_vptr.AutogenProperty = 0x2a8b550 <vtable for autogen::LoadBalanceType+16>}, load_balance_fields = {<AutogenProperty> = {
          _vptr.AutogenProperty = 0x2a8b5f0 <vtable for autogen::LoadBalanceFieldListType+16>}, load_balance_field_list = std::vector of length 0...

Read more...

Revision history for this message
Pramodh D'Souza (psdsouza) wrote :
Download full text (4.4 KiB)

while encoding the above route there was a buffer overrun.

(gdb) p sizeof(data_)
$13 = 4096
(gdb) p datalen_
$3 = 4511
(gdb)
(gdb) p data_
$2 = "<?xml version=\"1.0\"?>\n<iq type=\"set\" from=\"pop296b-compute0.snow.rennes.lab\" to=\"<email address hidden>/bgp-peer\" id=\"pubsub16728\">\n<pubsub xmlns=\"http://jabber.org/protocol/pubsub\">\n<publish node=\"1/1/default-domain:GEN-TNT-INTERNET:GEN-VRF-INTERNET:GEN-VRF-INTERNET/100.64.0.20/32\">\n<item>\n<entry>\n<nlri>\n<af>1</af>\n<safi>1</safi>\n<address>100.64.0.20/32</address>\n</nlri>\n<next-hops>\n<next-hop>\n<af>1</af>\n<address>198.19.56.212</address>\n<mac></mac>\n<label>78</label>\n<vni>0</vni>\n<tunnel-encapsulation-list>\n<tunnel-encapsulation>gre</tunnel-encapsulation>\n<tunnel-encapsulation>udp</tunnel-encapsulation>\n</tunnel-encapsulation-list>\n<virtual-network>default-domain:GEN-TNT-INTERNET:GEN-VRF-INTERNET</virtual-network>\n<tag-list />\n</next-hop>\n</next-hops>\n<version>1</version>\n<virtual-network></virtual-network>\n<mobility seqno=\"0\" sticky=\"false\" />\n<sequence-number>0</sequence-number>\n<security-group-list />\n<community-tag-list>\n<community-tag>no-reoriginate</community-tag>\n<community-tag>999:10001</community-tag>\n<community-tag>999:10002</community-tag>\n<community-tag>999:10003</community-tag>\n<community-tag>999:10004</community-tag>\n<community-tag>999:10005</community-tag>\n<community-tag>999:10006</community-tag>\n<community-tag>999:10007</community-tag>\n<community-tag>999:10008</community-tag>\n<community-tag>999:10009</community-tag>\n<community-tag>999:10010</community-tag>\n<community-tag>999:10011</community-tag>\n<community-tag>999:10012</community-tag>\n<community-tag>999:10013</community-tag>\n<community-tag>999:10014</community-tag>\n<community-tag>999:10015</community-tag>\n<community-tag>999:10016</community-tag>\n<community-tag>999:10017</community-tag>\n<community-tag>999:10018</community-tag>\n<community-tag>999:10019</community-tag>\n<community-tag>999:10020</community-tag>\n<community-tag>999:10021</community-tag>\n<community-tag>999:10022</community-tag>\n<community-tag>999:10023</community-tag>\n<community-tag>999:10024</community-tag>\n<community-tag>999:10025</community-tag>\n<community-tag>999:10026</community-tag>\n<community-tag>999:10027</community-tag>\n<community-tag>999:10028</community-tag>\n<community-tag>999:10029</community-tag>\n<community-tag>999:10030</community-tag>\n<community-tag>999:10031</community-tag>\n<community-tag>999:10032</community-tag>\n<community-tag>999:10033</community-tag>\n<community-tag>999:10034</community-tag>\n<community-tag>999:10035</community-tag>\n<community-tag>999:10036</community-tag>\n<community-tag>999:10037</community-tag>\n<community-tag>999:10038</community-tag>\n<community-tag>999:10039</community-tag>\n<community-tag>999:10040</community-tag>\n<community-tag>999:10041</community-tag>\n<community-tag>999:10042</community-tag>\n<community-tag>999:10043</community-tag>\n<community-tag>999:10044</community-tag>\n<community-tag>999:10045</community-tag>\n<community-tag>999:10046</community-tag>\n<community-tag>999:10047</community-tag>\n<community-tag>999:10048</comm...

Read more...

Revision history for this message
Pramodh D'Souza (psdsouza) wrote :

Regarding the fix and investigation to provide a fix:

- Agent currently uses a fixed buffer of 4096 per route to encode the XMPP message.
  This is widespread and used for all types of routes today.
- Control Node uses a different lib api (third party) that does not require a local buffer to send messages to the Agent.
- Control Node handle sending routes received from the Agent via XMPP that are large.
   * max buffer for non bgpaas peers is 32K for bgpaas peers is 4K, note the route above although > 4K in xmpp format is around 300bytes at most when sent to a BGP peer.
   * Has logic to pack multiple routes in single update and also handles a single route exceeding the buffer limit (supresses update).

A clean fix requires transitioning the Agent to the same api's at the CN to handle sending messages (non trivial fix) + UT.

A shorter fix would be to change buffers from current 4K to 8K on the agent, requires UT on Agent and CN to be added.

Recommend moving out of 4.1.2.

Jeba Paulaiyan (jebap)
tags: added: releasenote
Revision history for this message
Himanshu (bhimanshu) wrote :

@Slobodan, based on the internal discussion with Engineering, this cant be a blocker as we understand the issue and have a workaround. The fix however involves changing the existing behavior which would fall under Feature request. Please share your thoughts.

We also requested Scott Whyte to document these types of known limitations/thresholds in our user guide or release notes.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/47454
Submitter: Pramodh D'Souza (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R5.0

Review in progress for https://review.opencontrail.org/47484
Submitter: Pramodh D'Souza (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.1

Review in progress for https://review.opencontrail.org/47485
Submitter: Pramodh D'Souza (<email address hidden>)

Revision history for this message
Jeba Paulaiyan (jebap) wrote :

Notes:

It is recommended to keep the number of community tags less than 50 in Network Policy

Jeba Paulaiyan (jebap)
information type: Proprietary → Public
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/47454
Submitter: Pramodh D'Souza (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/47454
Committed: http://github.com/Juniper/contrail-controller/commit/5723f35b577855a31a3564b5f9f88ce9c59f26b6
Submitter: Zuul v3 CI (<email address hidden>)
Branch: master

commit 5723f35b577855a31a3564b5f9f88ce9c59f26b6
Author: Pramodh D'Souza <email address hidden>
Date: Thu Nov 1 16:41:55 2018 -0700

Prevent Agent crashes due to buffer overrun

Currently a fixed buffer of 4096 bytes is used while encoding xmpp messages
sent to the control node. The size of the message varies depending on
configuration and features enabled, hence is not easy to estimate the maximum
xmpp message size. The xmpp messages tend to be quite large since they are in
plain text format and are encoded based on the names of fields in the schema
(.xsd), some names are rather lenghty, when such elements are a list the
messages explode in size. It should be noted that for routes, the real limit
would be reached by BGP on the Control Node where the same routes could be one
twentieth of the sizeof the xmpp message. The control node handles the case
where the route is too large to send in a single update message. After
considering the pros and cons it seems better to use a variable buffer on the
Agent while encoding messages sent to the Control Node just as the Control Node
does when sending messages to the Agent. Moreover just chainging the buffer
size to some arbitrary size still leaves us with the problem of explaining
limitiations to customers in terms of how many extended communities, tags etc
will be supported. Note that when this problem occurs the agent is likely to
continously reboot and not recover and could also potentially exhibit strange
behaviour due to memory corruption.

Change-Id: Iddc7ef653a5dbad3307bdabfbe691b569c866985
Closes-Bug: 1794702

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/47485
Committed: http://github.com/Juniper/contrail-controller/commit/75a15427ba70f4498a676d6aa23995f1b78b1330
Submitter: Zuul (<email address hidden>)
Branch: R4.1

commit 75a15427ba70f4498a676d6aa23995f1b78b1330
Author: Pramodh D'Souza <email address hidden>
Date: Thu Nov 1 16:41:55 2018 -0700

Prevent Agent crashes due to buffer overrun

Currently a fixed buffer of 4096 bytes is used while encoding xmpp messages
sent to the control node. The size of the message varies depending on
configuration and features enabled, hence is not easy to estimate the maximum
xmpp message size. The xmpp messages tend to be quite large since they are in
plain text format and are encoded based on the names of fields in the schema
(.xsd), some names are rather lenghty, when such elements are a list the
messages explode in size. It should be noted that for routes, the real limit
would be reached by BGP on the Control Node where the same routes could be one
twentieth of the sizeof the xmpp message. The control node handles the case
where the route is too large to send in a single update message. After
considering the pros and cons it seems better to use a variable buffer on the
Agent while encoding messages sent to the Control Node just as the Control Node
does when sending messages to the Agent. Moreover just chainging the buffer
size to some arbitrary size still leaves us with the problem of explaining
limitiations to customers in terms of how many extended communities, tags etc
will be supported. Note that when this problem occurs the agent is likely to
continously reboot and not recover and could also potentially exhibit strange
behaviour due to memory corruption.

Change-Id: Iddc7ef653a5dbad3307bdabfbe691b569c866985
Closes-Bug: 1794702

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/47484
Committed: http://github.com/Juniper/contrail-controller/commit/1179af07fd9733ac3aa4db2c96ed0edaea16758f
Submitter: Zuul v3 CI (<email address hidden>)
Branch: R5.0

commit 1179af07fd9733ac3aa4db2c96ed0edaea16758f
Author: Pramodh D'Souza <email address hidden>
Date: Thu Nov 1 16:41:55 2018 -0700

Prevent Agent crashes due to buffer overrun

Currently a fixed buffer of 4096 bytes is used while encoding xmpp messages
sent to the control node. The size of the message varies depending on
configuration and features enabled, hence is not easy to estimate the maximum
xmpp message size. The xmpp messages tend to be quite large since they are in
plain text format and are encoded based on the names of fields in the schema
(.xsd), some names are rather lenghty, when such elements are a list the
messages explode in size. It should be noted that for routes, the real limit
would be reached by BGP on the Control Node where the same routes could be one
twentieth of the sizeof the xmpp message. The control node handles the case
where the route is too large to send in a single update message. After
considering the pros and cons it seems better to use a variable buffer on the
Agent while encoding messages sent to the Control Node just as the Control Node
does when sending messages to the Agent. Moreover just chainging the buffer
size to some arbitrary size still leaves us with the problem of explaining
limitiations to customers in terms of how many extended communities, tags etc
will be supported. Note that when this problem occurs the agent is likely to
continously reboot and not recover and could also potentially exhibit strange
behaviour due to memory corruption.

Change-Id: Iddc7ef653a5dbad3307bdabfbe691b569c866985
Closes-Bug: 1794702

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.