Backport: bonding: fix state transition issue in link monitoring

Bug #1852077 reported by Aleksei on 2019-11-11
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Status tracked in Focal
Bionic
Medium
Po-Hsu Lin
Disco
Medium
Po-Hsu Lin
Eoan
Medium
Po-Hsu Lin
Focal
Medium
Po-Hsu Lin

Bug Description

== Justification ==
From the well explained commit message:

Since de77ecd4ef02 ("bonding: improve link-status update in
mii-monitoring"), the bonding driver has utilized two separate variables
to indicate the next link state a particular slave should transition to.
Each is used to communicate to a different portion of the link state
change commit logic; one to the bond_miimon_commit function itself, and
another to the state transition logic.

 Unfortunately, the two variables can become unsynchronized,
resulting in incorrect link state transitions within bonding. This can
cause slaves to become stuck in an incorrect link state until a
subsequent carrier state transition.

 The issue occurs when a special case in bond_slave_netdev_event
sets slave->link directly to BOND_LINK_FAIL. On the next pass through
bond_miimon_inspect after the slave goes carrier up, the BOND_LINK_FAIL
case will set the proposed next state (link_new_state) to BOND_LINK_UP,
but the new_link to BOND_LINK_DOWN. The setting of the final link state
from new_link comes after that from link_new_state, and so the slave
will end up incorrectly in _DOWN state.

 Resolve this by combining the two variables into one.

== Fixes ==
* 1899bb32 (bonding: fix state transition issue in link monitoring)

This patch can be cherry-picked into E/F

For older releases like B/D, it will needs to be backported as they are
missing the slave_err() printk marco added in 5237ff79 (bonding: add
slave_foo printk macros) as well as the commit to replace netdev_err()
with slave_err() in e2a7420d (bonding/main: convert to using slave
printk macros)

For Xenial, the commit that causes this issue, de77ecd4, does not exist.

== Test ==
Test kernels can be found here:
https://people.canonical.com/~phlin/kernel/lp-1852077-bonding/

The X-hwe and Disco kernel were tested by the bug reporter, Aleksei,
the patched kernel works as expected.

== Regression Potential ==
Low.
This patch just unify the variable used in link state change commit
logic to prevent the occurrence of an incorrect state. And the changes
are limited to the bonding driver itself.

(Although the include/net/bonding.h will be used in other drivers, but
the changes to that file is only affecting this bond_main.c driver)

== Original Bug Report ==
There's an issue with bonding driver in the current ubuntu kernels.
Sometimes one link stuck in a weird state.
It was fixed with patch https://www.spinics.net/lists/netdev/msg609506.html in upstream.
Commit 1899bb325149e481de31a4f32b59ea6f24e176ea.

We see this bug with linux 4.15 (ubuntu xenial, hwe kernel), but it should be reproducible with other current kernel versions.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1852077

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Po-Hsu Lin (cypressyew) on 2019-11-12
tags: added: bionic disco
Po-Hsu Lin (cypressyew) on 2019-11-12
tags: added: eoan focal
Po-Hsu Lin (cypressyew) wrote :

Hi Aleksei,

can you give the X-hwe kernel here with the backported patch a try?
https://people.canonical.com/~phlin/kernel/lp-1852077-bonding/

It looks like other kernels, B/D/E/F, are affected as well, the same backported patch has been applied to B/D, and the upstream fix can be cherry-picked into E/F.

It will be great to verify them as well.

Thanks in advance.

Changed in linux (Ubuntu Bionic):
status: New → In Progress
Changed in linux (Ubuntu Disco):
status: New → In Progress
Changed in linux (Ubuntu Eoan):
status: New → In Progress
Changed in linux (Ubuntu Focal):
status: Incomplete → In Progress
Changed in linux (Ubuntu Bionic):
assignee: nobody → Po-Hsu Lin (cypressyew)
Changed in linux (Ubuntu Disco):
assignee: nobody → Po-Hsu Lin (cypressyew)
Changed in linux (Ubuntu Eoan):
assignee: nobody → Po-Hsu Lin (cypressyew)
Changed in linux (Ubuntu Focal):
assignee: nobody → Po-Hsu Lin (cypressyew)
Aleksei (zakharov-a-g) wrote :

Hi,
thank you!

I checked X-hwe and D kernels and i couldn't reproduce the bug, driver seems to work fine.

Po-Hsu Lin (cypressyew) wrote :

Awesome,
thanks for testing these, I will SRU them.

Po-Hsu Lin (cypressyew) wrote :
description: updated
description: updated

Still waiting on these patches being committed to all the Ubuntu trees.
Any ETA? Is this waiting on being picked up via -stable?

tags: added: sts

FWIW, the fix has been committed to -stable:

"bonding: fix state transition issue in link monitoring"
Commit: 1899bb325149e481de31a4f32b59ea6f24e176ea

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/drivers/net/bonding?id=1899bb325149e481de31a4f32b59ea6f24e176ea

Stefan Bader (smb) on 2019-11-28
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
Changed in linux (Ubuntu Disco):
importance: Undecided → Medium
Changed in linux (Ubuntu Eoan):
importance: Undecided → Medium
Changed in linux (Ubuntu Focal):
importance: Undecided → Medium
Stefan Bader (smb) on 2019-11-28
Changed in linux (Ubuntu Disco):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Eoan):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Aleksei (zakharov-a-g) wrote :

I can see that fixes were released, from changelog here: https://launchpad.net/ubuntu/+source/linux/+changelog

Should statuses be updated on this page? Or it's not fully fixed?

Po-Hsu Lin (cypressyew) wrote :

Hello,
right now it's still on it's way to the proposed pocket, and needs to be tested later.
The status will be changed to fix-released after it's been pushed to the release pocket,
Thanks.

Hi Po,

IIUC this bug is related to commit 627450ac21f7f4a44b949c5d5e2c35829ff1784f, which is in 4.15.0-74, which I see now in -updates / -security. Isn't it completed yet?

$ git tag --contains 627450ac21f7f4a44b949c5d5e2c35829ff1784f
Ubuntu-4.15.0-73.82
Ubuntu-4.15.0-74.84
Ubuntu-raspi2-4.15.0-1053.57
Ubuntu-snapdragon-4.15.0-1070.77

$ rmadison linux-image-4.15.0-74-generic
 linux-image-4.15.0-74-generic | 4.15.0-74.83~16.04.1 | xenial-security | amd64, arm64, armhf, i386, ppc64el, s390x
 linux-image-4.15.0-74-generic | 4.15.0-74.83~16.04.1 | xenial-updates | amd64, arm64, armhf, i386, ppc64el, s390x
 linux-image-4.15.0-74-generic | 4.15.0-74.84 | bionic-security | amd64, arm64, armhf, i386, ppc64el, s390x
 linux-image-4.15.0-74-generic | 4.15.0-74.84 | bionic-updates | amd64, arm64, armhf, i386, ppc64el, s390x

Po-Hsu Lin (cypressyew) wrote :

Hello Fabio,
yes you're right!

The patch has been applied via the stable update process. So my patchset was superseded.

B/D - https://lists.ubuntu.com/archives/kernel-team/2019-November/105911.html
E/F - https://lists.ubuntu.com/archives/kernel-team/2019-December/106055.html

I will mark this one as Fix-released.
Thanks for the follow-up.

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu Disco):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu Eoan):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu Focal):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers