[RFE] Test for hash mismatch

Bug #1974466 reported by Peter Jose De Sousa
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
charm-magpie
Triaged
Wishlist
Unassigned

Bug Description

Hi,

Recently we experienced issues with a deployment where the network of the deployment was sending and recieving packets on different ports (hash mismatch). We discovered that running MTR on the node, selecting different ports helped to replicate/catch the issue. (Packet loss)

This bug is to document an improvment to magpie that includes to testing, potentially using MTR or other tool to test a hash mismatch on the underlying infrastructure.

Thank you,

Peter

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

For refence on this issue, check the following pcaps and check communication to/from 192.168.108.109 https://private-fileshare.canonical.com/~pjds/hash-mismatch-issue/

description: updated
description: updated
Changed in charm-magpie:
status: New → Triaged
importance: Undecided → Wishlist
Revision history for this message
Nobuto Murata (nobuto) wrote (last edit ):

@Peter, what was the fix in the end out of curiosity and to think about what would be the best to detect it?

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

Hi @Nobuto it would be great if we could detect packet loss on the application level. We saw these packets coming back on a different NIC meaning the application would drop it.

Recapping the algo for others who might read this thread later;

We typically use Layer 2+3 which is something like: SOURCE_IP XOR SOURCE_PORT XOR DST_IP XOR DST_PORT XOR 0x0ffff to calculate the hash.

The exact methodology on how these bits can be XOR'd between the nodes, I'm not quite sure, but it would be great if we can incorporated into magpie somehow,

Cheers,
Peter

Revision history for this message
Nobuto Murata (nobuto) wrote :

There is an ongoing review:
https://review.opendev.org/c/openstack/charm-magpie/+/841826

And it will bring a basic packet loss detection with ping (assuming it works in your case if mtr helped), and it will show the status like:

> magpie/0* blocked executing 0 192.168.151.104 (upgrade-charm) icmp failed: ['1: 10% packet loss', '2: 10% packet loss'], local hostname ok (famous-gibbon), dns ok, iperf leader, mtu: 1500
> magpie/1 blocked idle 1 192.168.151.102 icmp failed: ['0: 15% packet loss', '2: 25% packet loss'], local hostname ok (up-tahr), dns ok, net mtu ok: 1500, 13675 mbit/s
> magpie/2 blocked idle 2 192.168.151.103 icmp failed: ['0: 5% packet loss', '1: 15% packet loss'], local hostname ok (more-ibex), dns ok, network mtu check failed, 12757 mbit/s

Revision history for this message
Peter Jose De Sousa (pjds) wrote (last edit ):

I think to add to this bug, in the deployment our cloud passed magie testing, I propose converting this bug into a general magpie improvements, testing for mismatch, and mellanox error counters. e.g.: rx_pcs_symbol_err_phy https://support.mellanox.com/s/article/understanding-mlx5-ethtool-counters

We discovered later in our troubleshooting this counter was rising, also exhibiting link_down events.

Every deployment of the nodes these counters are present on at least one node. I think if any of these counters are present (above zero) we should have a warning, and blocked state. Unless there are scenarios where the counter should not be zero, right now, I am not aware of one.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.