maas.power Error changing power state (on) while commissioning the node

Bug #1635107 reported by Mohammad
38
This bug affects 8 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Undecided
Unassigned

Bug Description

I have five Dell 1950 RAC, BMC and Network card firmware updated. I have configured one of the servers for the MAAS cluster controller and the other can simply boot from PXE as a Node.

My problem is once I do Commissioning MAAS fail and log the error:

Failed to power on node - Node could not be powered on:
Could not contact node's BMC: Connection timed out while
performing power action. Check BMC configuration and
connectivity and try again.

However, after 10 to 20 seconds it detects the power status:

Queried node's BMC - Power state queried: off

I realized, if I turn of the server 5 to 8 seconds before doing commissioning the process will be successful.

In addition, I can turn on and off the server by using ipmipower and etherwake

It seems to be the IPMI query time out problem. In older version it was possible to change the time-out in /etc/maas/templates/power/ipmi.template. But it is not available in new version!!

How can I change the timeout value?? where is it?

As you see below, the node was returning the power state but after commissioning the server require 10 to 15 seconds to respond to power state again (I cannot ping the node BMC IP address).

Node Log:
Queried node's BMC - Power state queried: off Thu, 20 Oct. 2016 11:34:48
 Failed to power on node - Node could not be powered on: Could not contact node's BMC: Connection timed out while performing power action. Check BMC configuration and connectivity and try again. Thu, 20 Oct. 2016 11:34:28
 Node changed status - From 'Commissioning' to 'Failed commissioning' Thu, 20 Oct. 2016 11:34:28
 Marking node failed - Node could not be powered on: Could not contact node's BMC: Connection timed out while performing power action. Check BMC configuration and connectivity and try again. Thu, 20 Oct. 2016 11:34:28
 User aborting node commissioning - (root) Thu, 20 Oct. 2016 11:34:27
 User aborting node commissioning - (root) Thu, 20 Oct. 2016 11:34:26
 User aborting node commissioning - (root) Thu, 20 Oct. 2016 11:34:26
 User aborting node commissioning - (root) Thu, 20 Oct. 2016 11:34:26
 User aborting node commissioning - (root) Thu, 20 Oct. 2016 11:34:25
 User aborting node commissioning - (root) Thu, 20 Oct. 2016 11:34:24

MAAS.LOG:

Oct 20 11:38:53 maas maas.import-images: [INFO] Downloading image descriptions from http://localhost:5240/MAAS/images-stream/streams/v1/index.json
Oct 20 11:38:54 maas maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Oct 20 11:40:47 maas maas.power: [ERROR] Power state could not be queried: Connection timed out while performing power action. Check BMC configuration and connectivity and try again.
Oct 20 11:40:48 maas maas.power: [ERROR] primary: Failed to refresh power state: Connection timed out while performing power action. Check BMC configuration and connectivity and try again.
Oct 20 11:46:02 maas maas.power: [ERROR] Power state could not be queried: Connection timed out while performing power action. Check BMC configuration and connectivity and try again.
Oct 20 11:46:03 maas maas.power: [ERROR] primary: Failed to refresh power state: Connection timed out while performing power action. Check BMC configuration and connectivity and try again.
Oct 20 11:51:18 maas maas.power: [ERROR] Power state could not be queried: Connection timed out while performing power action. Check BMC configuration and connectivity and try again.
Oct 20 11:51:18 maas maas.power: [ERROR] primary: Failed to refresh power state: Connection timed out while performing power action. Check BMC configuration and connectivity and try again.
Oct 20 11:56:18 maas maas.power: [ERROR] Power state could not be queried: Connection timed out while performing power action. Check BMC configuration and connectivity and try again.
Oct 20 11:56:18 maas maas.power: [ERROR] primary: Failed to refresh power state: Connection timed out while performing power action. Check BMC configuration and connectivity and try again.
Oct 20 11:58:53 maas maas.import-images: [INFO] Started importing boot images.
Oct 20 11:58:53 maas maas.import-images: [INFO] Downloading image descriptions from http://localhost:5240/MAAS/images-stream/streams/v1/index.json
Oct 20 11:58:54 maas maas.import-images: [INFO] Finished importing boot images, the region does not have any new images.
Oct 20 12:01:32 maas maas.power: [ERROR] Power state could not be queried: Connection timed out while performing power action. Check BMC configuration and connectivity and try again.
Oct 20 12:01:33 maas maas.power: [ERROR] primary: Failed to refresh power state: Connection timed out while performing power action. Check BMC configuration and connectivity and try again.

Output from dpkg -l '*maas*'|cat

Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===============================-==============================-============-=================================================
ii maas 2.0.0+bzr5189-0ubuntu1~16.04.1 all "Metal as a Service" is a physical cloud and IPAM
ii maas-cli 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS client and command-line interface
un maas-cluster-controller <none> <none> (no description available)
ii maas-common 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS server common files
ii maas-dhcp 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS DHCP server
ii maas-dns 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS DNS server
ii maas-proxy 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS Caching Proxy
ii maas-rack-controller 2.0.0+bzr5189-0ubuntu1~16.04.1 all Rack Controller for MAAS
ii maas-region-api 2.0.0+bzr5189-0ubuntu1~16.04.1 all Region controller API service for MAAS
ii maas-region-controller 2.0.0+bzr5189-0ubuntu1~16.04.1 all Region Controller for MAAS
un maas-region-controller-min <none> <none> (no description available)
un python-django-maas <none> <none> (no description available)
un python-maas-client <none> <none> (no description available)
un python-maas-provisioningserver <none> <none> (no description available)
ii python3-django-maas 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS server Django web framework (Python 3)
ii python3-maas-client 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS python API client (Python 3)
ii python3-maas-provisioningserver 2.0.0+bzr5189-0ubuntu1~16.04.1 all MAAS server provisioning libraries (Python 3)

Revision history for this message
Newell Jensen (newell-jensen) wrote :

You can change the "timeout" in the actual power driver now which is located at:

/usr/lib/python3/dist-packages/provisioningserver/drivers/power/ipmi.py.

See if this helps your situation.

Revision history for this message
Leonardo Borda (lborda) wrote :

@newell-jensen

Hi,
I am going through that same error message here and I am failing to find where I can set the timeout for the ipmi driver. The wait_time value is there but not being used /usr/lib/python3/dist-packages/provisioningserver/drivers/power/ipmi.py from what I can see.

maas 2.1.1+bzr5544-0ubuntu1~16.04.1

Revision history for this message
Newell Jensen (newell-jensen) wrote :

Leonardo,

The wait_time is being used. The retry logic is handled by the base class.
Did you restart rackd after modifying and saving the file? If not, you need to.

Revision history for this message
Björn Tillenius (bjornt) wrote :

It would be good to change the timeout as Newell suggested and see whether it helps.

Changed in maas:
status: New → Incomplete
Revision history for this message
Daniele Pizzolli (daniele-pizzolli) wrote :

Hi,

I have similar problem using maas 2.2.0+bzr6054-0ubu.
The target hosts are DELL R210II

To successful end the Commissioning I changed the timeout as follows.

The machines are powered on and off 5 times before the PXE using the
following patch. This works but is not Ideal. Do you see drawbacks
to increase the timeout from the start eg: (16, 32, 64)?

Could you add user configurable timeout?
Also there is no real indication that the problem is a timeout.
The logs (sorry, not saved) simply say something about BMC error
and took a dedicated search to find this bug report.

Best.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for MAAS because there has been no activity for 60 days.]

Changed in maas:
status: Incomplete → Expired
Revision history for this message
Mike Kingsbury (mike.kingsbury) wrote :

This is still a problem. For me, Dell R710 hardware, shared NIC with the iDRAC (IPMI hardware) that goes unavailable for a short duration during boot cycles. I've hacked in the timeout switches in ipmi.py to get by, otherwise I get failures for deploys.

Revision history for this message
Daniele Pizzolli (daniele-pizzolli) wrote :

There are affected people and patches to review.

Changed in maas:
status: Expired → Confirmed
Revision history for this message
Michael Quiniola (qthepirate) wrote :

This affects me too.
I verified updated firmware.

I believe there is an overall issue as all of my current nodes have stopped responding to power checks (They all give an error) even though they worked before.

Revision history for this message
Michael Quiniola (qthepirate) wrote :

An Update:

Changing the wait timeout to (64, 128, 256) was the only thing that worked for me.
There may also be an issue in certain servers (this issue was on an R710) with the configuring memory in the beginning that may have delayed it somehow. I'm not an expert but it was a guess.

Revision history for this message
ananke (ananke) wrote :

Was wondering what the status of this bug is. I'm in a similar situation with Dell PE710 systems with iDRACs. When attempting to commission a node MAAS turns on the node in question, waits about 60 seconds, then power cycles it. That of course is not long enough to even go through half of the POST.

If MAAS can't automatically determine the timeout based on the IPMI vendor, it would be nice to have these timeouts available as a configurable option, even if it's buried in a config file. Having to adjust the code is suboptimal.

Thanks!

Revision history for this message
Rodney Wild (rdwild) wrote :

I am still having this issue with dell R610 & R710 machines. And every time I update, the update resets the timeouts back to old values and I have to go in and re-edit it. This needs to be fixed at the source.

Revision history for this message
Adam Collard (adam-collard) wrote :

This bug has not seen any activity in the last 6 months, so it is being automatically closed.

If you are still experiencing this issue, please feel free to re-open.

MAAS Team

Changed in maas:
status: Confirmed → Invalid
Revision history for this message
Mr.K (mrk131211) wrote :

Error: Failed to query node's BMC - Connection timed out while performing power action. Check BMC configuration and connectivity and try again.

I am still facing this issue, may i know somebody has a solution.

Thank You

Revision history for this message
Daniele Pizzolli (daniele-pizzolli) wrote :

Hi, our workaround for old dell servers, is to enable all the chipers (we have trusted ipmi network, so the security is not a primary issue). As far as I know, there is no way to edit the script from the UI, and the source for Maas is not the file system but the database.

We think that there are some mis-aligment between the chipers enabled by maas and the ones really enabled in the dell firmare, but no time to investigate it furhter.

So the workaround is to edit the file in place and then delete it from the database and reload it.

Our patch is the following:

+++ bmc_config.py.ori 2021-08-05 14:19:34.500857402 +0200
--- bmc_config.py 2021-08-05 14:26:25.647118148 +0200
+++++++++++++++
+++ 554,559 ++++
--- 554,561 ----
                  # Leave secure ciphers as is. Most tools default to 3 while
                  # 17 is considered the most secure.
                  new_cipher_suite_privs += c
+ new_cipher_suite_privs = "aaaaaaaaaaaaaaX"
+ print("INFO: Hardcoded found working for old dell servers: %s" % new_cipher_suite_privs)

          if new_cipher_suite_privs != current_suite_privs:
              channel, _ = self._get_ipmitool_lan_print()

The procedure to reload the file in the database is the following:

maas-region shell
from metadataserver.models import Script
target = Script.objects.filter(name="30-maas-01-bmc-config")[0]
target.delete()

from metadataserver.builtin_scripts import load_builtin_scripts
load_builtin_scripts()

Hope that helps!

Revision history for this message
Zoltan Arnold Nagy (zoltan) wrote :

on maas 3.1 this is still present as an issue. when trying to commission/deploy a box that takes a long time to power on (A100 GPUs for example), deployment fails as it takes 3-4 minutes for the box to actually power on.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.