icmp packet loss with async mtu check
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
charm-magpie |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
In some customer environments with Cisco Nexus switches the magpie tests are randomly loosing a single icmp packet under load.
2023-12-08 08:45:10 INFO unit.magpie-
8980 bytes from 10.99.14.7: icmp_seq=1 ttl=64 time=0.209 ms
8980 bytes from 10.99.14.7: icmp_seq=2 ttl=64 time=0.184 ms
8980 bytes from 10.99.14.7: icmp_seq=3 ttl=64 time=0.202 ms
8980 bytes from 10.99.14.7: icmp_seq=4 ttl=64 time=0.196 ms
8980 bytes from 10.99.14.7: icmp_seq=5 ttl=64 time=0.522 ms
8980 bytes from 10.99.14.7: icmp_seq=6 ttl=64 time=2.34 ms <- response time raised
8980 bytes from 10.99.14.7: icmp_seq=7 ttl=64 time=1.58 ms
8980 bytes from 10.99.14.7: icmp_seq=8 ttl=64 time=0.860 ms
8980 bytes from 10.99.14.7: icmp_seq=9 ttl=64 time=1.29 ms
8980 bytes from 10.99.14.7: icmp_seq=10 ttl=64 time=1.42 ms <- packet loss happening after this event
8980 bytes from 10.99.14.7: icmp_seq=12 ttl=64 time=2.01 ms
8980 bytes from 10.99.14.7: icmp_seq=13 ttl=64 time=2.24 ms
8980 bytes from 10.99.14.7: icmp_seq=14 ttl=64 time=1.94 ms
8980 bytes from 10.99.14.7: icmp_seq=15 ttl=64 time=0.151 ms
8980 bytes from 10.99.14.7: icmp_seq=16 ttl=64 time=0.154 ms <- response time normalised
...
--- 10.99.14.7 ping statistics ---
40 packets transmitted, 39 received, 2.5% packet loss, time 4028ms
rtt min/avg/max/mdev = 0.151/0.
When the magpie test was executed with check_iperf=false, no packet loss was experienced at all. The goal of mtu should be to make sure that for exampel 9k icmp packet size is able to pass. However, this call is using the same async ping call as the ping mesh check, with the same settings and parallel execution. Based on several test runs it seems to be that the mtu test is flooding the switches with large size icmp packets, what is (due to the hook execution order) running on multiple units in the same time, together with iperf testing. This seems to be filling out the bandwidth of the 25Gbit link, and icmp has no QOS configured on those environments.
I reverted the ping code for the synchronous execution of the mtu check, and in this case the packet loss was not happening at all. It would be great to consider of sending a single packet only for testing, and run only one iperf + mtu icmp per deployment to avoid overloading the switches from multiple directions (subsequent charm actions instead of random hook execution for example).
The patch used for sync mtu testing:
https:/
Can you confirm what value was set for the following options?
- check_iperf
- ping_mesh_mode
Sounds like both are set to True. In our internal process nowadays check_iperf=false is expected and iperf check can be run using an action. But we couldn't flip the default value for check_iperf out of the box due to keeping the backward compatibility.