MAAS does not properly detect max interface speed for interfaces which use multiple phyiscal ports(Cisco UCS B200 M5 blade)

Bug #1881821 reported by Jeff Hillman
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Lee Trager
2.7
Fix Released
Critical
Lee Trager
2.8
Fix Released
Critical
Lee Trager
linux (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

MAAS 2.7.1
Ubuntu 18.04.4

When attempting to commission a Cisco UCS B200 M5 blade, the commissioning finishes (if smartctl-validate is disabled), however on the machine in question in the Networking tab, it only shows 2 nics, eth0 and eth1. and shows no storage.

Going to SSH in the middle of commissioning, running lshw shows all 10 NICs, and the disk of the proper size.

This is causing the buckets.yaml for FCE to be inaccurate, and thus cannot write a bucketsconfig.yaml since the actual devices don't show up.

This was tried with bionic {ga,hwe,hwe-edge} and focal ga kernels for commissioning, all have the same behavior.

I was previously hitting this bug https://bugs.launchpad.net/maas/+bug/1878643 but applied the fix from comment #10, and am no longer getting any commissioning errors.

Tags: cpe-onsite

Related branches

Revision history for this message
Jeff Hillman (jhillman) wrote :

fdisk -l output on a node in commissioning state

Revision history for this message
Jeff Hillman (jhillman) wrote :

ip a from node in a commissioning state

Revision history for this message
Jeff Hillman (jhillman) wrote :

lshw from a node in a commissioning state

Revision history for this message
Jeff Hillman (jhillman) wrote :

Subscribing field critical as this is blocking a deployment

Revision history for this message
Lee Trager (ltrager) wrote :

From IRC conversion

ethtool output:
https://pastebin.canonical.com/p/N2p5hnWKd5/

50-commissioning-01 output:
https://pastebin.canonical.com/p/2PjBJ7bKJK/

Revision history for this message
Lee Trager (ltrager) wrote :

You can see from that output that link speed is being reported as 20000 while interface speed is reported as 10000. MAAS does not allow the link speed to exceed the interface speed which causes the failure. Given that both LXD and ethtool show the same data this seems to be a kernel bug.

Changed in maas:
status: New → Triaged
summary: - node inventory isn't being properly reported for Cisco UCS B200 M5 blade
+ Kernel does not report interface speed correctly for Cisco UCS B200 M5
+ blade
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1881821

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Jeff Hillman (jhillman) wrote : Re: Kernel does not report interface speed correctly for Cisco UCS B200 M5 blade

The environment is proxied and very heavily firewalled. When I go to run that I cannot reach launchpad. if there are any specific logs requested, I can gather them.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Jeff Hillman (jhillman) wrote :

RE: potential kernel bug issue.

The driver for enic that we ship is 2.3.0-45, as a test, from Cisco.com the 3.2.210.18-738.12 driver was downloaded and compiled. it functions, but also reports 20gb link and with nothing above 10gb advertisements.

Revision history for this message
Thadeu Lima de Souza Cascardo (cascardo) wrote :

That speed comes from drivers/net/ethernet/cisco/enic/enic_ethtool.c:enic_get_ksettings.

static int enic_get_ksettings(struct net_device *netdev,
                              struct ethtool_link_ksettings *ecmd)
{
        struct enic *enic = netdev_priv(netdev);
        struct ethtool_link_settings *base = &ecmd->base;

        ethtool_link_ksettings_add_link_mode(ecmd, supported,
                                             10000baseT_Full);
        ethtool_link_ksettings_add_link_mode(ecmd, supported, FIBRE);
        ethtool_link_ksettings_add_link_mode(ecmd, advertising,
                                             10000baseT_Full);
        ethtool_link_ksettings_add_link_mode(ecmd, advertising, FIBRE);
        base->port = PORT_FIBRE;

        if (netif_carrier_ok(netdev)) {
                base->speed = vnic_dev_port_speed(enic->vdev);
                base->duplex = DUPLEX_FULL;
        } else {
                base->speed = SPEED_UNKNOWN;
                base->duplex = DUPLEX_UNKNOWN;
        }

        base->autoneg = AUTONEG_DISABLE;

        return 0;
}

vnic_dev_port_speed comes from the hardware at function drivers/net/ethernet/cisco/enic/vnic_dev.c:vnic_dev_notify_setcmd.

static int vnic_dev_notify_setcmd(struct vnic_dev *vdev,
        void *notify_addr, dma_addr_t notify_pa, u16 intr)
{
        u64 a0, a1;
        int wait = 1000;
        int r;

        memset(notify_addr, 0, sizeof(struct vnic_devcmd_notify));
        vdev->notify = notify_addr;
        vdev->notify_pa = notify_pa;

        a0 = (u64)notify_pa;
        a1 = ((u64)intr << 32) & 0x0000ffff00000000ULL;
        a1 += sizeof(struct vnic_devcmd_notify);

        r = vnic_dev_cmd(vdev, CMD_NOTIFY, &a0, &a1, wait);
        vdev->notify_sz = (r == 0) ? (u32)a1 : 0;
        return r;
}

struct vnic_devcmd_notify {
        u32 csum; /* checksum over following words */

        u32 link_state; /* link up == 1 */
        u32 port_speed; /* effective port speed (rate limit) */
        u32 mtu; /* MTU */
        u32 msglvl; /* requested driver msg lvl */
        u32 uif; /* uplink interface */
        u32 status; /* status bits (see VNIC_STF_*) */
        u32 error; /* error code (see ERR_*) for first ERR */
        u32 link_down_cnt; /* running count of link down transitions */
        u32 perbi_rebuild_cnt; /* running count of perbi rebuilds */
};

Revision history for this message
Jeff Hillman (jhillman) wrote :

RE: it being a kernel bug

A link can span multiple interfaces, so the reported speed of a link can be higher than the individual interfaces . See page 5 https://www.cisco.com/c/dam/en/us/products/collateral/interfaces-modules/unified-computing-system-adapters/vic-tuning-wp.pdf

Revision history for this message
Andy Whitcroft (apw) wrote :

This appears to be a bug in the cisco enic driver. It will always report 10000baseT/Full, regardless of the link speed itself. It has always been like this in mainline, and I am also not supprised if the upstream driver is also similarly broken.

static int enic_get_ksettings(struct net_device *netdev,
                              struct ethtool_link_ksettings *ecmd)
{
[...]
        ethtool_link_ksettings_add_link_mode(ecmd, supported,
                                             10000baseT_Full);
        ethtool_link_ksettings_add_link_mode(ecmd, supported, FIBRE);
        ethtool_link_ksettings_add_link_mode(ecmd, advertising,
                                             10000baseT_Full);
        ethtool_link_ksettings_add_link_mode(ecmd, advertising, FIBRE);
        base->port = PORT_FIBRE;

        if (netif_carrier_ok(netdev)) {
                base->speed = vnic_dev_port_speed(enic->vdev);
                base->duplex = DUPLEX_FULL;
        } else {
                base->speed = SPEED_UNKNOWN;
                base->duplex = DUPLEX_UNKNOWN;
        }
[...]
}

Revision history for this message
Andy Whitcroft (apw) wrote :

It may be that we are not able to assume these figures are the same.

Changed in maas:
importance: Undecided → Critical
assignee: nobody → Lee Trager (ltrager)
Revision history for this message
Andy Whitcroft (apw) wrote :

Ahh yes, comment #11 even details how this is reasonable, that the interface speed exceeds the per-link advertisments.

Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

Testing with MAAS 2.6 and the network interfaces were detected correctly. This points to MAAS 2.7 having a bug in the way it tests the link speeds. Also, the smartctl-validate tests were successful

Revision history for this message
Adam Collard (adam-collard) wrote :

Thanks for the background info, in 2.7 MAAS started modelling the speeds of interfaces and links, so I expect that 2.6 will not have the same problem.

Revision history for this message
Camille Rodriguez (camille.rodriguez) wrote :

Could we change the name of the bug? It is not a kernel bug

Revision history for this message
Lee Trager (ltrager) wrote :

MAAS 2.7 introduced network testing. As part of network testing we added the following features which require accurate interface and link speed information.

1. The interface and uplink speeds are now shown in the API and over the UI.
2. MAAS now warns when an interface is connected to an uplink which is slower than its maximum supported speed.
3. Users may now acquire machines based on its uplink speed.

I could stop verifying that link speed >= max interface speed however that will mean 1 is incorrect and 2 is broken. I understand that some interfaces achieve higher speeds by using more physical ports but neither ethtool nor LXD give any information on this.

Does anyone know the proper way to calculate the maximum supported interface speed on a device like this?

summary: - Kernel does not report interface speed correctly for Cisco UCS B200 M5
- blade
+ MAAS does not properly detect max interface speed for interfaces which
+ use multiple phyiscal ports(Cisco UCS B200 M5 blade)
Revision history for this message
Jeff Hillman (jhillman) wrote : Re: [Bug 1881821] Re: Kernel does not report interface speed correctly for Cisco UCS B200 M5 blade

That is not an easy answer.

In the world of UCS, there are several factors that can affect this:

1) Type of IOM (I/O Module)

2) Port config of the IOM (single link or port-channel)

3) Number of inks from the Fabric Interconnect to said IOM

4) Model of adapter in the blade.

Depending on the combination of these things, it can be a 10, 20, or 40 GB
link.

On Wed, Jun 3, 2020, 9:40 PM Lee Trager <email address hidden> wrote:

> MAAS 2.7 introduced network testing. As part of network testing we added
> the following features which require accurate interface and link speed
> information.
>
> 1. The interface and uplink speeds are now shown in the API and over the
> UI.
> 2. MAAS now warns when an interface is connected to an uplink which is
> slower than its maximum supported speed.
> 3. Users may now acquire machines based on its uplink speed.
>
> I could stop verifying that link speed >= max interface speed however
> that will mean 1 is incorrect and 2 is broken. I understand that some
> interfaces achieve higher speeds by using more physical ports but
> neither ethtool nor LXD give any information on this.
>
> Does anyone know the proper way to calculate the maximum supported
> interface speed on a device like this?
>
> ** Summary changed:
>
> - Kernel does not report interface speed correctly for Cisco UCS B200 M5
> blade
> + MAAS does not properly detect max interface speed for interfaces which
> use multiple phyiscal ports(Cisco UCS B200 M5 blade)
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1881821
>
> Title:
> MAAS does not properly detect max interface speed for interfaces which
> use multiple phyiscal ports(Cisco UCS B200 M5 blade)
>
> Status in MAAS:
> Triaged
> Status in MAAS 2.7 series:
> Triaged
> Status in MAAS 2.8 series:
> Triaged
> Status in linux package in Ubuntu:
> Confirmed
>
> Bug description:
> MAAS 2.7.1
> Ubuntu 18.04.4
>
> When attempting to commission a Cisco UCS B200 M5 blade, the
> commissioning finishes (if smartctl-validate is disabled), however on
> the machine in question in the Networking tab, it only shows 2 nics,
> eth0 and eth1. and shows no storage.
>
> Going to SSH in the middle of commissioning, running lshw shows all 10
> NICs, and the disk of the proper size.
>
> This is causing the buckets.yaml for FCE to be inaccurate, and thus
> cannot write a bucketsconfig.yaml since the actual devices don't show
> up.
>
> This was tried with bionic {ga,hwe,hwe-edge} and focal ga kernels for
> commissioning, all have the same behavior.
>
> I was previously hitting this bug
> https://bugs.launchpad.net/maas/+bug/1878643 but applied the fix from
> comment #10, and am no longer getting any commissioning errors.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1881821/+subscriptions
>

Alberto Donato (ack)
Changed in maas:
milestone: 2.8.0rc1 → 2.8.0
Revision history for this message
Francis Ginther (fginther) wrote :

Setting to invalid for the kernel.

Changed in linux (Ubuntu):
status: Confirmed → Invalid
Alberto Donato (ack)
Changed in maas:
milestone: 2.8.0 → 2.9.0b1
status: Triaged → Fix Committed
Lee Trager (ltrager)
no longer affects: maas/2.7
Revision history for this message
Laurie Matthews (lschatzel) wrote :

Is there a release date for 2.7, or a formal way to request an escalation for it? We have an active deploy that we need to turnover to customer for testing by June 25.

Lee Trager (ltrager)
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.