Mellonx MT28800 Family [ConnectX-5 Ex] does not report 100G as max interface speed

Bug #1878643 reported by Luis
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
High
Lee Trager
2.7
Fix Released
High
Lee Trager
lxd
Fix Released
Unknown

Bug Description

Failing the comissioning because of smartctl-validate failure.

MAAS server Ubuntu release: 18.04
MAAS v. 2.7.0 (8235-g.fea3a1678)
Commissioning OS: 18.04 & 20.04 same results.

Status

Failed

Exit Status

2

logs combined:

Unable to run 'smartctl-validate': 'MAAS did not detect any storage devices during commissioning!'

Given parameters:
{'storage': {'argument_format': '{path}', 'type': 'storage', 'value': 'all'}}Discovered storage devices:[{'NAME': 'sda', 'MODEL': 'MTFDDAK480TDC-1A', 'SERIAL': '500a0751249c230c'}, {'NAME': 'sdb', 'MODEL': 'MTFDDAK480TDC-1A', 'SERIAL': '500a0751249c2278'}, {'NAME': 'sdc', 'MODEL': 'MTFDDAK480TDC-1A', 'SERIAL': '500a07512494f86e'}, {'NAME': 'sdd', 'MODEL': 'MTFDDAK480TDC-1A', 'SERIAL': '500a0751249c23ff'}, {'NAME': 'sde', 'MODEL': 'MTFDDAK480TDC-1A', 'SERIAL': '500a0751249c2408'}]

Discovered interfaces:

{'bc:97:e1:14:7d:90': 'ens5f0np0'}

Running the smartctl-validate device or smarctl -xa device on the server (accessing by ssh avoiding power off the server) works fine.

Last lines on cloud-init-output.log

Starting testing scripts...
Installing apt packages for smartctl-validate (id: 871, script_version_id: 1)
Starting smartctl-validate (id: 871, script_version_id: 1)
Failed to execute smartctl-validate (id: 871, script_version_id: 1): 2
1 test scripts failed to run
Cloud-init v. 20.1-10-g71af48df-0ubuntu5 running 'modules:final' at Thu, 14 May 2020 15:44:48 +0000. Up 103.37 seconds.

Luis (luis-ramirez)
summary: - Smart-validate failure on Lenovo RS635 with LSI MegaRaid 730-9i
+ Smart-validate failure on Lenovo RS635 with LSI MegaRaid 730-8i
Revision history for this message
Lee Trager (ltrager) wrote : Re: Smart-validate failure on Lenovo RS635 with LSI MegaRaid 730-8i

Can you confirm no storage devices were discovered? You can do this by clicking on the storage tab on the machine details page.

Please upload the following to the bug:
* The MAAS logs for each controller, stored in /var/log/maas
* The output of the commissioning script 50-maas-01-commissioning from the failing machine

Changed in maas:
status: New → Incomplete
Revision history for this message
Luis (luis-ramirez) wrote :

Yes. No device were discovered.
Maas log attached

Revision history for this message
Luis (luis-ramirez) wrote :

Hi

You can find attached the log file

50-maas-01-commissioning log file

Revision history for this message
Lee Trager (ltrager) wrote :

Could you also upload /var/log/maas/regiond.log?

Revision history for this message
Luis (luis-ramirez) wrote :

Uploaded!

Thx in advance.

Revision history for this message
Lee Trager (ltrager) wrote :

MAAS is failing to process the commissioning results. When that happens storage devices and interfaces don't get created. This seems due to the Mellonx cards, they both have the following:

"supported_modes": [
    "1000baseKX/Full",
    "10000baseKR/Full",
    "40000baseKR4/Full",
    "40000baseCR4/Full",
    "40000baseSR4/Full",
    "40000baseLR4/Full",
    "25000baseCR/Full"
],
"link_speed": 100000,

They are reporting they support speeds up to 40G/s and are currently connected to a 100G/s link. MAAS verifies that link speed doesn't exceed max interface speed which you can see in regiond.log.

django.core.exceptions.ValidationError: ['link_speed may not exceed interface_speed']

Could you try updating the firmware on your Mellonx cards?

summary: - Smart-validate failure on Lenovo RS635 with LSI MegaRaid 730-8i
+ Mellonx MT28800 Family [ConnectX-5 Ex] does not report 100G as max
+ interface speed
Revision history for this message
Lee Trager (ltrager) wrote :

Another thing to try would be to switch the commissioning operating system to Focal. This can be done on the settings page after you download the image. When commissioning select "Allow SSH access and prevent machine powering off" If commissioning fails SSH in and gather the output of

uname -a
and ethtool <dev> for each Mellonx card

Revision history for this message
Luis (luis-ramirez) wrote :
Download full text (4.9 KiB)

ubuntu@grand-jennet:~$ ethtool ens3f0
Settings for ens3f0:
        Supported ports: [ Backplane ]
        Supported link modes: 1000baseKX/Full
                                10000baseKR/Full
                                40000baseKR4/Full
                                40000baseCR4/Full
                                40000baseSR4/Full
                                40000baseLR4/Full
                                25000baseCR/Full
                                25000baseKR/Full
                                25000baseSR/Full
                                50000baseCR2/Full
                                50000baseKR2/Full
                                100000baseKR4/Full
                                100000baseSR4/Full
                                100000baseCR4/Full
                                100000baseLR4_ER4/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Supported FEC modes: None RS
        Advertised link modes: 1000baseKX/Full
                                10000baseKR/Full
                                40000baseKR4/Full
                                40000baseCR4/Full
                                40000baseSR4/Full
                                40000baseLR4/Full
                                25000baseCR/Full
                                25000baseKR/Full
                                25000baseSR/Full
                                50000baseCR2/Full
                                50000baseKR2/Full
                                100000baseKR4/Full
                                100000baseSR4/Full
                                100000baseCR4/Full
                                100000baseLR4_ER4/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Advertised FEC modes: RS
        Link partner advertised link modes: Not reported
        Link partner advertised pause frame use: No
        Link partner advertised auto-negotiation: Yes
        Link partner advertised FEC modes: Not reported
        Speed: 100000Mb/s
        Duplex: Full
        Port: Direct Attach Copper
        PHYAD: 0
        Transceiver: internal
        Auto-negotiation: on
Cannot get wake-on-lan settings: Operation not permitted
        Current message level: 0x00000004 (4)
                               link
        Link detected: yes
ubuntu@grand-jennet:~$ ethtool ens3f1
Settings for ens3f1:
        Supported ports: [ Backplane ]
        Supported link modes: 1000baseKX/Full
                                10000baseKR/Full
                                40000baseKR4/Full
                                40000baseCR4/Full
                                40000baseSR4/Full
                                40000baseLR4/Full
                                25000baseCR/Full
                                25000baseKR/Full
                                25000baseSR/Full
                                50000baseCR2/Full
                                50000baseKR2/Full
                                100000baseKR4/Full
                                100000baseSR4/Full
                          ...

Read more...

Revision history for this message
Lee Trager (ltrager) wrote :

Thanks for gathering that. This appears to be a bug in LXD. I have created a bug with LXD and linked it to this one.

Changed in maas:
status: Incomplete → Triaged
milestone: none → 2.8.0rc1
Revision history for this message
Lee Trager (ltrager) wrote :

The LXD team believes this may have been recently fixed. Can you download this attachment and extract it in /usr/share/maas/machine-resources?

Lee Trager (ltrager)
Changed in maas:
assignee: nobody → Lee Trager (ltrager)
importance: Undecided → High
Changed in lxd:
status: Unknown → New
Changed in lxd:
status: New → Fix Released
Revision history for this message
Lee Trager (ltrager) wrote :

In LP:1878685 I had to update the LXD sources used for commissioning. This should contain the fix for this bug however I don't have access to hardware to verify with.

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Luis (luis-ramirez) wrote :

I'll try to apply tomorrow; I let you know if everything goes well.

Alberto Donato (ack)
Changed in maas:
milestone: 2.8.0rc1 → 2.8.0
Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

I'm facing the same issue for a customer with a 50GbE Mellanox MT27800 Family [ConnectX-5] in MAAS v2.7.1.
I haven't replaced the machine-resource binary directly in /usr/share/maas/machine-resources,
but by running the binary in one of nodes that has this NIC, I am now seeing the expected output.

"supported_modes": [
    "1000baseKX/Full",
    "10000baseKR/Full",
    "40000baseKR4/Full",
    "40000baseCR4/Full",
    "40000baseSR4/Full",
    "40000baseLR4/Full",
    "25000baseCR/Full",
    "25000baseKR/Full",
    "25000baseSR/Full",
    "50000baseCR2/Full",
    "50000baseKR2/Full"
],

The last two,
    "50000baseCR2/Full",
    "50000baseKR2/Full"
those were not available in the 50-maas-01-commissioning output, so I believe with this LXD fix should resolve the issue.

Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

I'm adding field-critical since there is no clean workaround other than downgrading to v2.6

Nobuto Murata (nobuto)
Changed in maas:
status: Incomplete → Confirmed
Alberto Donato (ack)
Changed in maas:
status: Confirmed → Fix Committed
Alberto Donato (ack)
Changed in maas:
status: Fix Committed → Fix Released
Changed in lxd:
status: Fix Released → Unknown
Revision history for this message
Trent Lloyd (lathiat) wrote :

According to Nobuto, for MAAS 2.7 this is fixed by the same commit as for Bug #1881821 - can we please confirm that and mark this Fix Committed for the 2.7 series.

Revision history for this message
Nobuto Murata (nobuto) wrote :

> According to Nobuto, for MAAS 2.7 this is fixed by the same commit as for Bug #1881821 - can we please confirm that and mark this Fix Committed for the 2.7 series.

Just to be precise, the fix in the bug #1881821 is about marking "link_speed may not exceed interface_speed" non fatal.
https://code.launchpad.net/~ltrager/maas/+git/maas/+merge/385546

But the refresh of LXD resource will fix the original issue (the link speed was not properly detected), which has been addressed by the lxd commit 83791015327fdf2ee29b070808eba11cabe1f06d in:
https://github.com/lxc/lxd/issues/7370

In either case, the issue doesn't happen with 2.7.2~rc1 available in:
https://launchpad.net/~maas/+archive/ubuntu/2.7-next
Because it has both fixes.

Changed in lxd:
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.