[baremetal] Broadcast storm is generated by controller node during startup

Bug #1590170 reported by Artem Panchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Fuel QA Team
Mitaka
Invalid
High
Fuel QA Team

Bug Description

Fuel version info (9.0 mos #427): http://paste.openstack.org/show/508639/

During system startup (haproxy/vrouter namespaces) controller generates a lot of broadcast traffic (>100K pps), looks like incoming ARP requests/replies are promiscuously forwarded in management and public networks (see attached dump2 file). This leads to packet storm in network infra and in most cases switches shutdowns such ports (for example see bug #1589530):

Jun 7 20:07:10.351: %PM-4-ERR_DISABLE: storm-control error detected on Gi0/4, putting Gi0/4 in err-disable state
Jun 7 20:07:10.686: %STORM_CONTROL-3-SHUTDOWN: A packet storm was detected on Gi0/4. The interface has been disabled.
Jun 7 20:07:10.703: %SW_MATM-4-MACFLAP_NOTIF: Host b235.6fa0.7c50 in vlan 365 is flapping between port Gi0/4 and port Gi0/3
Jun 7 20:07:11.634: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/4, changed state to down

(NOTE: gi 0/3 is connected to eno1 on node-7, gi 0/4 - to eno2 on node-7; b235.6fa0.7c50 is a MAC of 'b_management' interface on node-7 in 'haproxy' NS)

Logs from node-7: http://paste.openstack.org/show/508768/
TCPdump statistics (for mirrored uplink port, 3 seconds between 20:07:09 and 20:07:11): http://paste.openstack.org/show/508769/

Steps to reproduce:

1. Enable storm control for ports which controller node is attached to, set 10% of 1Gb broadcast limit (or 100Mb/s)
2. Reboot controller node

Expected result: node is rebooted, no broadcast storm is detected

Actual result: node is rebooted, broadcast storm is detected, ports are disabled on the switch
Environment configuration:
3 controllers
3 computes
3 ceph-osd nodes

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
  • dump2 Edit (27.8 MiB, application/octet-stream)
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

Diagnostic snapshot: https://drive.google.com/file/d/0B-ky-xP2ZjWaNTg5eW8ycVFTaVE/view?usp=sharing

Controller node that was rebooted: node-7

Maciej Relewicz (rlu)
Changed in fuel:
status: New → Confirmed
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Maksim Malchuk (mmalchuk)
tags: added: area-library move-to-mu
tags: added: blocker-for-qa
description: updated
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Agreed that it maybe the lab misconfiguration, will check.

Changed in fuel:
status: Confirmed → Incomplete
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Artem Panchenko (apanchenko-8) agreed.

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

We can't prove that it's not lab misconfiguration (besides I'm not sure it's qa responsibility to do it) but this will still affect us during baremetal acceptance testing, we can't re-setup this lab on the verge of HCF.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

According to snapshot and timestamps provided by Artem, this flood matches with the following commands in ocf ns_IPaddr2:

https://github.com/openstack/fuel-library/blob/3508d8bc3499f580205f32e5cdaf1dbbcedf7728/files/fuel-ha-utils/ocf/ns_IPaddr2#L395-L401

According to tcpdump, packets are the same as produced by this arping command:

https://github.com/openstack/fuel-library/blob/3508d8bc3499f580205f32e5cdaf1dbbcedf7728/files/fuel-ha-utils/ocf/ns_IPaddr2#L400-L401

So the final looks like this:
ip netns exec vrouter arping -A -c 32 -w 10 -I b_vrouter_pub -q 172.16.162.70

But I'm not able to reproduce it with this command. And also this command generates no more than 32 packets, it can't generate 100k packets in 1 second.

Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Yes, Alex, this command start the ARP flow, but the storm actually produced due to wrong switch configurations on the lab, perhaps.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Yeah, also while rebooting controller and debugging with tcpdump I've captured 20k packets during 1 second from some foreign IPs like this:
http://paste.openstack.org/show/510565/
Please note those IPs are not even from the env I was working on.

At the same time I see only 16 packets "Reply 172.16.162.70 is-at 26:8f:21:ce:51:c2, length 46" that were sent by vip__vrouter_pub start.

Changed in fuel:
assignee: Maksim Malchuk (mmalchuk) → Fuel QA Team (fuel-qa)
Revision history for this message
Maksim Malchuk (mmalchuk) wrote :

Thanks, Aleksandr for confirmation my thoughts.

Revision history for this message
Timur Nurlygayanov (tnurlygayanov) wrote :

Please ignore my previous comment it is related to another launchpad issue.

Revision history for this message
Volodymyr Shypyguzov (vshypyguzov) wrote :

Failed to reproduce on 9.0 iso #477
Rebooted all three controllers one-by-one, including primary, all three at once and even cinder node ( node-7 from original bug)
After rebooting everything works fine, no storm detected

Changed in fuel:
status: Incomplete → Invalid
status: Invalid → Incomplete
tags: removed: area-library
Revision history for this message
Volodymyr Shypyguzov (vshypyguzov) wrote :

Reproduced again on the HA environment with ceph as storage.

Changed in fuel:
status: Incomplete → Confirmed
description: updated
Revision history for this message
tata (tatayo) wrote :

Sorry, bad click :/ Can't change it back to "confirmed"

Changed in fuel:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.