MAAS fails to power up machines when trying to install nodes

Bug #1171418 reported by Raphaël Badin on 2013-04-22
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
High
Gavin Panella
1.2
High
Raphaël Badin
maas (Ubuntu)
High
Unassigned
Precise
High
Unassigned
Quantal
High
Unassigned
Raring
High
Unassigned

Bug Description

Integration tests on raring (package built from trunk) are failing since Apr 19, 2013 11:31:01 PM.
http://10.189.74.2:8080/view/MAAS/job/raring-adt-maas-daily/133/ARCH=amd64,label=lenovo-RD230-01/console
http://10.189.74.2:8080/view/MAAS/job/raring-adt-maas-daily/135/ARCH=amd64,label=lenovo-RD230-01/console
MAAS fails to power up machines when installing a node. This is not happening every single time so this is a racy issue.

[Impact]

This affects corner cases when nodes are told by MAAS to be started when they haven't finished their shut down process. However, the fix enables MAAS to tell the nodes to make the power action regardless of their current state.

This is a corner case because before deploying nodes maas ensures that all nodes are turned off, however, there can be cases on which nodes have not finished their turn off process after commissioning, and the nodes are told to be deployed.

[Test Case]
To reproduce, simply do the following:

1. Install maas and enlist/commission IPMI nodes.
2. turn on manually one of the nodes.
3. With maas, deploy the node.
4. With the fix, the node will be rebooted and the installation will proceed. Without the fix the installation will never start.

[Regression Potential]
Minimal. This change has been tested in the Lab and manual testing. It ensures that the machine gets powered off/on regardless of its current state power state, allowing MAAS to perform the action always, when requested. The Server Team and MAAS team are committed to provided appropriate fixes in event of a regression.

Related branches

Raphaël Badin (rvb) wrote :

After investigating the issue, we found that the fix landed by https://code.launchpad.net/~vanhoof/maas/ipmi-state-fix_lp1086160/+merge/159714 (fix for bug 1086160) is responsible for the problem: the fix landed in this branch uncovered a bug in how MAAS deals with ipmi.

Before bug 1086160 was fixed, the ipmi template was *always* issuing the power command (because get_power_state() was broken). Now that we check the state of the node before powering it up, it the node is being brought down but is still up when get_power_state() is called, the ipmipower command won't be issued.

This is an example of what happens: right after "--off" is issued, the node is still up and thus "--stat" returns "on":
ubuntu@lenovo-RD230-01:~$ ipmipower -h 192.168.22.33 -u root -p ubuntu --off && ipmipower -h 192.168.22.33 -u root -p ubuntu --stat
192.168.22.33: ok <- this is the result of the "--off" command
192.168.22.33: on <- this is the result of the "--stat" command

ipmipower is clever enough to understand that, if "--on" is issued while the node is being powered down, the node needs to be powered up after it has gone down:
ubuntu@lenovo-RD230-01:~$ ipmipower -h 192.168.22.33 -u root -p ubuntu --off && ipmipower -h 192.168.22.33 -u root -p --on
=> the node is powered down *then up*.

In conclusion, we should probably revert to the old behavior and not check the return value of "--stat" at all, just issue the --on/--off command. (Note that MAAS executes these ipmi commands asynchronously [using celery] so we cannot use ipmipower's --wait-until-on/--wait-until-off commands to solve this problem).

Changed in maas:
importance: Undecided → Critical
status: New → Triaged
tags: added: power
summary: - MAAS fails to power up machines when trying to install a node.
+ MAAS fails to power up machines when trying to install nodes
Raphaël Badin (rvb) on 2013-04-23
Changed in maas:
assignee: nobody → Gavin Panella (allenap)
status: Triaged → Fix Committed
Raphaël Badin (rvb) wrote :

Reducing the priority to 'High' because this only happens when nodes are deployed milliseconds after a node is commissioned. This is what happens in the lab but this is unlikely to happen in real life.

Changed in maas:
importance: Critical → High
Changed in maas (Ubuntu):
status: New → Confirmed
importance: Undecided → High
description: updated
Clint Byrum (clint-fewbar) wrote :

Hi, this is missing Regression Potential. Please see https://wiki.ubuntu.com/StableReleaseUpdates#Procedure for more information.

description: updated
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package maas - 1.3+bzr1461+dfsg-0ubuntu3

---------------
maas (1.3+bzr1461+dfsg-0ubuntu3) saucy; urgency=low

  * debian/patches:
    - 99-fix-ipmi-stat-lp1086160.patch: Drop. The following patch removes
      the need for this fix. (LP: #1171988)
    - 99-fix-ipmi-lp1171418.patch: Do not check current node state when
      executing an ipmi command, which ensures that nodes are always
      turned on/off regardless of their power state. This fixes corner
      cases found when running automated tests. (LP: #1171418)
    - 99-fix-comissioning-lp1131418.patch: Fixes the commissioning process,
      allowing nodes to successfully commission, when tag's with no
      definition have been created. This issue will only appear when these
      special tags are created. (LP: #1131418)
    - 99-import-raring-images-lp1182642.patch: Enables the import of raring
      images by default (LP: #1182642)
    - 99-fix-new-image-install-lp1182646.patch: Fixes the installation of
      new ephemeral images, that fail due to not being able to overwrite
      a symlink. (LP: #1182646)
 -- Andres Rodriguez <email address hidden> Tue, 23 Apr 2013 14:02:33 -0400

Changed in maas (Ubuntu):
status: Confirmed → Fix Released
Changed in maas:
status: Fix Committed → Fix Released

Hello Raphaël, or anyone else affected,

Accepted maas into raring-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/maas/1.3+bzr1461+dfsg-0ubuntu2.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in maas (Ubuntu Raring):
status: New → Fix Committed
tags: added: verification-needed
Raphaël Badin (rvb) wrote :

Hi Chris Halse Rogers, I've just tested the package (1.3+bzr1461+dfsg-0ubuntu2.1) in our QA lab and it fixes the bug.

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package maas - 1.3+bzr1461+dfsg-0ubuntu2.1

---------------
maas (1.3+bzr1461+dfsg-0ubuntu2.1) raring-proposed; urgency=low

  * debian/patches:
    - 99-fix-ipmi-stat-lp1086160.patch: Drop. The following patch removes
      the need for this fix. (LP: #1171988)
    - 99-fix-ipmi-lp1171418.patch: Do not check current node state when
      executing an ipmi command, which ensures that nodes are always
      turned on/off regardless of their power state. This fixes corner
      cases found when running automated tests. (LP: #1171418)
    - 99-fix-comissioning-lp1131418.patch: Fixes the commissioning process,
      allowing nodes to successfully commission, when tag's with no
      definition have been created. This issue will only appear when these
      special tags are created. (LP: #1131418)
    - 99-import-raring-images-lp1182642.patch: Enables the import of raring
      images by default (LP: #1182642)
    - 99-fix-new-image-install-lp1182646.patch: Fixes the installation of
      new ephemeral images, that fail due to not being able to overwrite
      a symlink. (LP: #1182646)
 -- Andres Rodriguez <email address hidden> Tue, 23 Apr 2013 14:02:33 -0400

Changed in maas (Ubuntu Raring):
status: Fix Committed → Fix Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Changed in maas (Ubuntu Precise):
importance: Undecided → High
Changed in maas (Ubuntu Quantal):
importance: Undecided → High
Changed in maas (Ubuntu Raring):
importance: Undecided → High

Hello Raphaël, or anyone else affected,

Accepted maas into precise-proposed. The package will build now and be available at http://launchpad.net/ubuntu/+source/maas/1.2+bzr1373+dfsg-0ubuntu1~12.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in maas (Ubuntu Precise):
status: New → Fix Committed
tags: removed: verification-done
tags: added: verification-needed
Chris Long (clong-i) wrote :

Hi All.
I'm completely new to posting here so please forgive my inexperience.
Here's my problem and fix in cli however what I wish to do is add this fix to the ipmi.template so that maas will work correctly.
This is while running maas on a HP Blade server C7000 series with ProLiant BL2x220c G5 server blades - iLo 2.0

non working example:
ipmipower -h 10.200.0.36 -u maas -p xxxxx --stat
10.200.0.36: authentication type unavailable for attempted privilege level

working example:
ipmipower -D LAN_2_0 -h 10.200.0.36 -u maas -p xxxxx --stat
10.200.0.36: off

Another which controls power on and off:
ipmipower -h 10.200.0.36 -u maas -p xxxxx --on
 10.200.0.36: authentication type unavailable for attempted privilege level

ipmipower -h 10.200.0.36 -u maas -p xxxxx --off
 10.200.0.36: authentication type unavailable for attempted privilege level

working example:
ipmipower -D LAN_2_0 -h 10.200.0.36 -u maas -p xxxxx --on
10.200.0.36: ok

ipmipower -D LAN_2_0 -h 10.200.0.36 -u maas -p kC17O6xZ --off
10.200.0.36: ok

So where and how do I add "-D LAN_2_0" to enable the IPMI commands to work within MAAS? I'm sure other people will be experiencing this particular issue.

Do I add anything to either of these files?
/usr/share/pyshared/provisioningserver/power/templates/ipmi.template
/usr/lib/python2.7/dist-packages/provisioningserver/power/templates/ipmi.template

Thanks in advance.

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package maas - 1.2+bzr1373+dfsg-0ubuntu1~12.04.2

---------------
maas (1.2+bzr1373+dfsg-0ubuntu1~12.04.2) precise-proposed; urgency=low

  * MAAS Stable Release Update, debian/patches:
    - 99_filestorage_empty_files_lp1204507.patch: Fix to allow the storage
      of empty files when using Juju Go, otherwise machines will fail to
      bootstrap. (LP: #1204507)
    - 99_fix_highbank_localboot_lp1172966.patch: Fix to PXE LOCALBOOT on
      highbank servers by removing a PXE message. Otherwise highbank will
      fail to pxe boot. (LP: #1172966)
    - 99_no_ipmi_detection_kvm_lp1064527.patch: Fix to ensure that IPMI
      detection does not happen on KVM VM's, otherwise enlistment and
      commissioning process will take too long. (LP: #1064527)
    - 99_update_cluster_info_cli_lp1172193.patch: Fix to allow admins to
      update cluster information from the API/CLI and not only restrict it
      to the WebUI. (LP: #1172193)
    - 99_fix_ipmi_power_command_lp1171418: Fix to ensure that ipmi commands
      are always executed regardless of the state of the machine.
      (LP: #1171418)
    - 99_default_timezone_utc_lp1211447.patch: Default to UTC for the
      deployed nodes. (LP: #1211447)
 -- Andres Rodriguez <email address hidden> Mon, 12 Aug 2013 12:18:34 -0400

Changed in maas (Ubuntu Precise):
status: Fix Committed → Fix Released
Rolf Leggewie (r0lf) wrote :

quantal has seen the end of its life and is no longer receiving any updates. Marking the quantal task for this ticket as "Won't Fix".

Changed in maas (Ubuntu Quantal):
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers