ipmi-config connection timeout

Bug #1762555 reported by Douglas
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
maas (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

I'm trying to commission a machine and I'm getting the following error in the maas.log:

Apr 9 00:24:39 maas maas.drivers.power.ipmi: [warn] Failed to change the boot order to PXE xxxxxx: /usr/sbin/ipmi-config: connection timeout
Apr 9 00:25:45 maas maas.drivers.power.ipmi: [warn] Failed to change the boot order to PXE xxxxxx: /usr/sbin/ipmi-config: connection timeout
Apr 9 00:27:13 maas maas.power: [error] Error changing power state (on) of node: server1 (8w7s4m)
Apr 9 00:27:14 maas maas.node: [info] server1: Status transition from COMMISSIONING to FAILED_COMMISSIONING
Apr 9 00:27:14 maas maas.node: [error] server1: Marking node failed: Power on for the node failed: Could not contact node's BMC: Connection timed out while performing power action. Check BMC configuration and connectivity and try again.

MaaS Versions:
# dpkg -l | grep maas
ii maas 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS client and command-line interface
ii maas-common 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all Region Controller for MAAS
ii python3-django-maas 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.3.0-6434-gd354690-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

Ubuntu version:
# lsb_release -rd
Description: Ubuntu 16.04.3 LTS
Release: 16.04

Expected result:
MaaS is able to correctly reboot/start/stop the machine, it just can't change the boot order to PXE. I would assume that if it can't change the boot order after 1-2 attempts, MaaS wouldn't fail the commissioning but still allow another 5-10 mins before failing, to let the server PXE boot since the server is most likely already configured to PXE boot as the first option. It might also be nice to have an option to disable MaaS from trying to set the boot option so we don't waste time resetting the server when we know it can't set the boot option.

What happens:
Right now, MaaS turns the server on, gets the first ipmi-config timeout, resets the server, gets the second timeout, resets the server, and then immediately fails the commissioning.

Tags: ipmi maas
Douglas (dgrosvenor)
tags: added: ipmi maas
Changed in maas (Ubuntu):
status: New → Incomplete
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Douglas,

If MAAS cannot set the boot order to PXE, it doesn't matter, because that's a best-effort and not something that causes a failure.

The failure, however, is that your BMC is either reporting that it is failing ti power on, failing to report it was powered on correctly, or not reporting it at all. MAAS does this:

1. MAAS attempts to set the machine to PXE boot. If it fails, it doens't matter, it continues.
2. MAAS tells the machine to power on and checks if it powered on. If it didn't power on, it re-attempts to power on and check if it powered on.

MAAS does 2 in an interval of (1, 2, 2, 4, 6, 8, 12) seconds, unless the tool reports there's fatal errors.

That said, the times we have typically seen the issues you are reporting, although very few cases, it have been due to a buggy BMC that locks itself up. As such, I would recommend you try by upgrading the firmware.

Once that, could you also provide the output of:

ipmipower -W opensesspriv -D LAN_2_0 -u <user> -p <password> -h <host> --cycle --on-if-off
ipmipower -W opensesspriv -D LAN_2_0 -u <user> -p <password> -h <host> --stat

And repeat that, if you can script it to see if your BMC locks or reports an failure?

Revision history for this message
Douglas (dgrosvenor) wrote :

Andres,
Thanks for the quick reply and the info on the MaaS subsystem. A couple things I have noticed:

1. MaaS can power on/off the machine correctly and can detect that the system is either power on/off.
2. I believe the servers I'm trying to boot are using shared nics which seem to flap when the system is either starting up/shutting down, which might explain why the MaaS check might fail. It looks like the BMC deactivates for about a minute:

# bash ./ipmi.sh server1 user
Mon Apr 9 23:41:52 EDT 2018
server1: ok
Mon Apr 9 23:41:52 EDT 2018
server1: on
Mon Apr 9 23:41:52 EDT 2018
# bash ./ipmi.sh server1 user
Mon Apr 9 23:41:53 EDT 2018
server1: ok
Mon Apr 9 23:41:53 EDT 2018
server1: on
Mon Apr 9 23:41:53 EDT 2018
# bash ./ipmi.sh server1 user
Mon Apr 9 23:41:53 EDT 2018
server1: ok
Mon Apr 9 23:41:53 EDT 2018
server1: on
Mon Apr 9 23:41:53 EDT 2018
<I powered down the server through the IPMI interface>
# bash ./ipmi.sh server1 user
Mon Apr 9 23:41:59 EDT 2018
server1: connection timeout
Mon Apr 9 23:42:19 EDT 2018
server1: connection timeout
Mon Apr 9 23:42:39 EDT 2018
# bash ./ipmi.sh server1 user
Mon Apr 9 23:42:52 EDT 2018
server1: ok
Mon Apr 9 23:43:05 EDT 2018
server1: on
Mon Apr 9 23:43:05 EDT 2018

Even just continually sending those same commands it looks like the BMC is locking up/flapping:

# for i in `seq 1 100`;do bash ./ipmi.sh server1 user; sleep 3; done
Tue Apr 10 00:09:31 EDT 2018
server1: ok
Tue Apr 10 00:09:31 EDT 2018
server1: off
Tue Apr 10 00:09:31 EDT 2018
Tue Apr 10 00:09:34 EDT 2018
server1: ok
Tue Apr 10 00:09:34 EDT 2018
server1: off
Tue Apr 10 00:09:35 EDT 2018
Tue Apr 10 00:09:38 EDT 2018
server1: connection timeout
Tue Apr 10 00:09:58 EDT 2018
server1: connection timeout
Tue Apr 10 00:10:18 EDT 2018
Tue Apr 10 00:10:21 EDT 2018
server1: ok
Tue Apr 10 00:10:32 EDT 2018
server1: on
Tue Apr 10 00:10:32 EDT 2018
Tue Apr 10 00:10:35 EDT 2018
server1: ok
Tue Apr 10 00:10:36 EDT 2018
server1: on
Tue Apr 10 00:10:36 EDT 2018
Tue Apr 10 00:10:39 EDT 2018
server1: connection timeout
Tue Apr 10 00:10:59 EDT 2018
server1: connection timeout
Tue Apr 10 00:11:19 EDT 2018
Tue Apr 10 00:11:22 EDT 2018
server1: connection timeout
Tue Apr 10 00:11:42 EDT 2018
server1: on
Tue Apr 10 00:11:46 EDT 2018
Tue Apr 10 00:11:49 EDT 2018
server1: ok
Tue Apr 10 00:11:49 EDT 2018
server1: on
Tue Apr 10 00:11:49 EDT 2018
Tue Apr 10 00:11:52 EDT 2018
server1: ok
Tue Apr 10 00:11:52 EDT 2018
server1: on
Tue Apr 10 00:11:52 EDT 2018
Tue Apr 10 00:11:55 EDT 2018
server1: connection timeout
Tue Apr 10 00:12:15 EDT 2018
server1: connection timeout
Tue Apr 10 00:12:36 EDT 2018
Tue Apr 10 00:12:39 EDT 2018
server1: connection timeout
Tue Apr 10 00:12:59 EDT 2018
server1: on
Tue Apr 10 00:13:02 EDT 2018

And unfortunately, I think this server has the latest BMC firmware. I'll try and use the dedicated BMC NIC and see if that helps.

Revision history for this message
Douglas (dgrosvenor) wrote :

Apologies, I thought I responded to this a while ago. After switching the cable to the dedicated IPMI nic, everything seemed to work out without an issue. Kind of sucks that the shared IPMI flakes out but it is what it is. Thanks for your help!

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for maas (Ubuntu) because there has been no activity for 60 days.]

Changed in maas (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.