VF-lag failed to activate in Nvidia\Mellanox Nics at deployment stage

Bug #2020085 reported by waleed mousa
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
os-net-config
Fix Released
Undecided
Unassigned

Bug Description

Description of problem:

Provisioned the following config
- type: ovs_bridge
  name: br-link0
  mtu: 9000
  use_dhcp: false
  members:
  - type: linux_bond
    name: mx-bond
    mtu: 9000
    bonding_options: "mode=active-backup"
    members:
      - type: sriov_pf
        name: nic11
        numvfs: 10
        primary: true
        promisc: true
        use_dhcp: false
        defroute: false
        link_mode: switchdev

      - type: sriov_pf
        name: nic12
        numvfs: 10
        promisc: true
        use_dhcp: false
        defroute: false
        link_mode: switchdev

- type: vlan
  device: mx-bond
  vlan_id: {{ lookup('vars', networks_lower['Tenant'] ~ '_vlan_id') }}
  addresses:
  - ip_netmask: {{ lookup('vars', networks_lower['Tenant'] ~ '_ip') }}/{{ lookup('vars', networks_lower['Tenant'] ~ '_cidr') }}

nic11 and nic12 are mellanox nics

After provisioning the above configuration, we'll see that VF-LAG functionality is broken with the following error message:

[ 1005.396799] mlx5_core 0000:03:00.0: mlx5_cmd_out_err:783:(pid 6933): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x7d49cb), err(-22)
[ 1005.411593] mlx5_core 0000:03:00.0: mlx5_create_lag:522:(pid 6933): Failed to create LAG (-22)
[ 1005.420340] mlx5_core 0000:03:00.0: mlx5_activate_lag:583:(pid 6933): Failed to activate VF LAG
               Make sure all VFs are unbound prior to VF LAG activation or deactivation

And also, it may break the connectivity over the vlan interface.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-net-config (master)
Changed in os-net-config:
status: New → In Progress
waleed mousa (waleedm)
summary: - VF-lag failed to activate in Nvidia\Mellanox Nics at deployment sate
+ VF-lag failed to activate in Nvidia\Mellanox Nics at deployment stage
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-net-config (master)

Reviewed: https://review.opendev.org/c/openstack/os-net-config/+/883503
Committed: https://opendev.org/openstack/os-net-config/commit/b1a7c9c5f0c2832ff504ea4557305ca1b94a196e
Submitter: "Zuul (22348)"
Branch: master

commit b1a7c9c5f0c2832ff504ea4557305ca1b94a196e
Author: waleedm <email address hidden>
Date: Thu May 18 12:54:29 2023 +0000

    Fix breaking vf-lag functionality in os-net-config

    Because of racing issue to activate vf-lag after moving the second
    sriov_pf interface to switchdev mode in Nvidia\Mellanox nics, we may
    bind sriov_vfs while the LAG is not active yet.
    Another reason for breaking vf-lag functionality is that we are doing
    ifdown/ifup for sriov_pfs after binding the vfs(in case of linux_bond
    is member of ovs_bridge).

    As a solution for this issue, we are doing the binding after assuring
    the LAG is active, and also moving the ifdown/ifup before start binding

    Closes-Bug: #2020085
    Change-Id: If0cad8c856ee62064205b9a88f0148980653fcb2

Changed in os-net-config:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-net-config (stable/zed)

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/os-net-config/+/884746

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to os-net-config (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/os-net-config/+/884747

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to os-net-config (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/os-net-config/+/884747
Committed: https://opendev.org/openstack/os-net-config/commit/de876c5f8f91bffa1a30e9f27929bf30b7836e2a
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit de876c5f8f91bffa1a30e9f27929bf30b7836e2a
Author: waleedm <email address hidden>
Date: Thu May 18 12:54:29 2023 +0000

    Fix breaking vf-lag functionality in os-net-config

    Because of racing issue to activate vf-lag after moving the second
    sriov_pf interface to switchdev mode in Nvidia\Mellanox nics, we may
    bind sriov_vfs while the LAG is not active yet.
    Another reason for breaking vf-lag functionality is that we are doing
    ifdown/ifup for sriov_pfs after binding the vfs(in case of linux_bond
    is member of ovs_bridge).

    As a solution for this issue, we are doing the binding after assuring
    the LAG is active, and also moving the ifdown/ifup before start binding

    Closes-Bug: #2020085
    Change-Id: If0cad8c856ee62064205b9a88f0148980653fcb2
    (cherry picked from commit b1a7c9c5f0c2832ff504ea4557305ca1b94a196e)

tags: added: in-stable-wallaby
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.