charm-magpie

[RFE] Test for hash mismatch

Bug #1974466 reported by Peter Jose De Sousa on 2022-05-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	charm-magpie	Triaged	Wishlist	Unassigned

Bug Description

Hi,

Recently we experienced issues with a deployment where the network of the deployment was sending and recieving packets on different ports (hash mismatch). We discovered that running MTR on the node, selecting different ports helped to replicate/catch the issue. (Packet loss)

This bug is to document an improvment to magpie that includes to testing, potentially using MTR or other tool to test a hash mismatch on the underlying infrastructure.

Thank you,

Peter

See original description

Revision history for this message

Peter Jose De Sousa (pjds) wrote on 2022-05-20:

For refence on this issue, check the following pcaps and check communication to/from 192.168.108.109 https://private-fileshare.canonical.com/~pjds/hash-mismatch-issue/

description:	updated
description:	updated

Billy Olsen (billy-olsen) on 2022-05-20

Changed in charm-magpie:
status:	New → Triaged
importance:	Undecided → Wishlist

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-05-23 (last edit on 2022-05-23):

@Peter, what was the fix in the end out of curiosity and to think about what would be the best to detect it?

Revision history for this message

Peter Jose De Sousa (pjds) wrote on 2022-05-23:

Hi @Nobuto it would be great if we could detect packet loss on the application level. We saw these packets coming back on a different NIC meaning the application would drop it.

Recapping the algo for others who might read this thread later;

We typically use Layer 2+3 which is something like: SOURCE_IP XOR SOURCE_PORT XOR DST_IP XOR DST_PORT XOR 0x0ffff to calculate the hash.

The exact methodology on how these bits can be XOR'd between the nodes, I'm not quite sure, but it would be great if we can incorporated into magpie somehow,

Cheers,
Peter

Revision history for this message

Nobuto Murata (nobuto) wrote on 2022-05-23:

There is an ongoing review:
https://review.opendev.org/c/openstack/charm-magpie/+/841826

And it will bring a basic packet loss detection with ping (assuming it works in your case if mtr helped), and it will show the status like:

> magpie/0* blocked executing 0 192.168.151.104 (upgrade-charm) icmp failed: ['1: 10% packet loss', '2: 10% packet loss'], local hostname ok (famous-gibbon), dns ok, iperf leader, mtu: 1500
> magpie/1 blocked idle 1 192.168.151.102 icmp failed: ['0: 15% packet loss', '2: 25% packet loss'], local hostname ok (up-tahr), dns ok, net mtu ok: 1500, 13675 mbit/s
> magpie/2 blocked idle 2 192.168.151.103 icmp failed: ['0: 5% packet loss', '1: 15% packet loss'], local hostname ok (more-ibex), dns ok, network mtu check failed, 12757 mbit/s

Revision history for this message

Peter Jose De Sousa (pjds) wrote on 2022-06-17 (last edit on 2022-06-17):

I think to add to this bug, in the deployment our cloud passed magie testing, I propose converting this bug into a general magpie improvements, testing for mismatch, and mellanox error counters. e.g.: rx_pcs_symbol_err_phy https://support.mellanox.com/s/article/understanding-mlx5-ethtool-counters

We discovered later in our troubleshooting this counter was rising, also exhibiting link_down events.

Every deployment of the nodes these counters are present on at least one node. I think if any of these counters are present (above zero) we should have a warning, and blocked state. Unless there are scenarios where the counter should not be zero, right now, I am not aware of one.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.