P8 node "modoc" is not very stable for kernel SRU testing

Bug #1883245 reported by Po-Hsu Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
Undecided
Unassigned

Bug Description

This Power8 node "modoc" is not very stable for kernel SRU testing.

MAAS deployment is not working reliably recently.

Fail rate is 45 out 63 attempts

And when you try to release it with disk erasing, sometimes it will "Failed disk erasing"

I have tried to revive it with the following attempts:
1. ipmi chassis power cycle command to power cycle it
2. Manually turn off / on the pdu from bos01-b-07-fa1-12 and bos01-b-07-fb1-12

But yet it's still very unstalbe.

Tags: ppc64el
Frank Heimes (fheimes)
tags: added: ppc64el
Revision history for this message
Frank Heimes (fheimes) wrote :

If modoc failed to deploy with MAAS in the past, did you noticed any error messages in MAAS?
Especially in the 'Tests', 'Logs' or 'Events' section of the Machine details in MAAS or did it just hang?

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

IIRC there will be a web consle for the BMC, but when I tried to access it the web console is not accessible (refused to connect)

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

If modoc is giving inconsistent results, do you want to switch kernel testing to witchita while we investigate?

Changed in ubuntu-power-systems:
status: New → Triaged
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

I have watched one deployment failure event, the deployment will fail in ~1 minute with "Installation has failed and no output was given."

The event suggest that the BMC is not working:
    Marking node failed - Power on for the node failed: Failed talking to node's BMC: Failed to power c3mfb7. BMC never transitioned from off to on.

I need to power off / on modoc_psu1 and modoc_psu2 to make the power supply back to work.

Revision history for this message
Frank Heimes (fheimes) wrote :

Depending on the Power system the boot up process (especially cold start, but also restart/recycle) may take quite some time (several minutes).
I'm wondering if MAAS sometimes just times out - does not wait long enough (and if such a timeout is configurable ...)
Looks like there is no official MAAS UI way to tweak the ipmi timeout, but found this:
https://bugs.launchpad.net/maas/+bug/1521290/comments/8
(Anyway, I will not just change any config w/o getting a +1 from an expert ...)

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Following a firmware update, modoc appears to be functioning better now.
Marking as "Fix Released" for now, but if it starts to misbehave again, please feel free to reopen.

Changed in ubuntu-power-systems:
status: Triaged → Fix Released
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Yes I have modoc stressed with 4 consecutive jenkins testing jobs, with different series deployed. The result is positive. I didn't see the "Failed talking to node's BMC" error in the events again.

Thanks for your help.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.