charm-magpie

icmp packet loss with async mtu check

Bug #2046202 reported by Márton Kiss on 2023-12-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	charm-magpie	Confirmed	Undecided	Unassigned

Bug Description

In some customer environments with Cisco Nexus switches the magpie tests are randomly loosing a single icmp packet under load.

2023-12-08 08:45:10 INFO unit.magpie-internal-space/0.juju-log server.go:325 ping stdout PING 10.99.14.7 (10.99.14.7) 8972(9000) bytes of data.
8980 bytes from 10.99.14.7: icmp_seq=1 ttl=64 time=0.209 ms
8980 bytes from 10.99.14.7: icmp_seq=2 ttl=64 time=0.184 ms
8980 bytes from 10.99.14.7: icmp_seq=3 ttl=64 time=0.202 ms
8980 bytes from 10.99.14.7: icmp_seq=4 ttl=64 time=0.196 ms
8980 bytes from 10.99.14.7: icmp_seq=5 ttl=64 time=0.522 ms
8980 bytes from 10.99.14.7: icmp_seq=6 ttl=64 time=2.34 ms <- response time raised
8980 bytes from 10.99.14.7: icmp_seq=7 ttl=64 time=1.58 ms
8980 bytes from 10.99.14.7: icmp_seq=8 ttl=64 time=0.860 ms
8980 bytes from 10.99.14.7: icmp_seq=9 ttl=64 time=1.29 ms
8980 bytes from 10.99.14.7: icmp_seq=10 ttl=64 time=1.42 ms <- packet loss happening after this event
8980 bytes from 10.99.14.7: icmp_seq=12 ttl=64 time=2.01 ms
8980 bytes from 10.99.14.7: icmp_seq=13 ttl=64 time=2.24 ms
8980 bytes from 10.99.14.7: icmp_seq=14 ttl=64 time=1.94 ms
8980 bytes from 10.99.14.7: icmp_seq=15 ttl=64 time=0.151 ms
8980 bytes from 10.99.14.7: icmp_seq=16 ttl=64 time=0.154 ms <- response time normalised
...

--- 10.99.14.7 ping statistics ---
40 packets transmitted, 39 received, 2.5% packet loss, time 4028ms
rtt min/avg/max/mdev = 0.151/0.502/2.340/0.652 ms

When the magpie test was executed with check_iperf=false, no packet loss was experienced at all. The goal of mtu should be to make sure that for exampel 9k icmp packet size is able to pass. However, this call is using the same async ping call as the ping mesh check, with the same settings and parallel execution. Based on several test runs it seems to be that the mtu test is flooding the switches with large size icmp packets, what is (due to the hook execution order) running on multiple units in the same time, together with iperf testing. This seems to be filling out the bandwidth of the 25Gbit link, and icmp has no QOS configured on those environments.

https://opendev.org/openstack/charm-magpie/src/branch/master/src/lib/charms/layer/magpie_tools.py#L801

I reverted the ping code for the synchronous execution of the mtu check, and in this case the packet loss was not happening at all. It would be great to consider of sending a single packet only for testing, and run only one iperf + mtu icmp per deployment to avoid overloading the switches from multiple directions (subsequent charm actions instead of random hook execution for example).

The patch used for sync mtu testing:
https://pastebin.canonical.com/p/VdjTgdmryC/