Allow tuning of IPMI wait_time on power change
Bug #1921616 reported by
Victor Tapia
This bug affects 2 people
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
MAAS | Status tracked in 3.6 | |||||
3.4 |
Won't Fix
|
Medium
|
Unassigned | |||
3.5 |
Won't Fix
|
Medium
|
Unassigned | |||
3.6 |
Triaged
|
Medium
|
Unassigned |
Bug Description
Some IPMI implementations on machines with lots of devices to enumerate on boot, such as Dell R6525 systems with 4.32.xx and later iDrac firmware, can require a long time to update the power status after a power change (12+ seconds in some tests but can be longer). It would be great to have a way to customize the wait time per machine instead of relying on the 4, 8, 16 and 32 seconds timeout iteration. Such iteration can force a "--on-if-off --cycle" on a booting machine that, depending on the firmware, can leave it powered off making the deployment fail. This feature would help deploy machines with such irregular IPMI behaviors.
summary: |
- [Feature Request] Tunable IPMI wait_time on power change + Allow tuning of IPMI wait_time on power change |
Changed in maas: | |
status: | Invalid → Triaged |
importance: | Undecided → Medium |
Changed in maas: | |
status: | Triaged → Confirmed |
Changed in maas: | |
milestone: | none → 3.4.0 |
status: | Confirmed → Triaged |
Changed in maas: | |
milestone: | 3.4.0 → 3.4.x |
Changed in maas: | |
milestone: | 3.4.x → 3.5.x |
To post a comment you must log in.
I'm presently using a work around that modifys the hard coded wait_time values in the ipmi power driver, however this isn't desirable for all the obvious reasons.
The vendor has said there's no way they can assure a time to power on status, and it may vary from chassis to chassis. As systems (particularly AMD) get more packed, it'll take longer. While the IPMI spec itself suggests it's a power-is-applied style status, Dell have interpreted it differently.
It only became an issue with later iDrac firmwares, as there appears to have been a change in the time-to-respond behaviour between 4.30 and 4.32. 4.30 used to take 10 to 15 seconds to respond to the first status query, where as 4.32 responds much much faster, hence triggering the retry loop. As to why the two subsequent "--on-if-off --cycle" causes the system itself to power off, it is a mystery, we can only assume some kind of protection feature against flapping.