i3.metal flavour type fails to respond after a reboot

Bug #1822175 reported by Sean Feole on 2019-03-28
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-aws (Ubuntu)
High
Colin Ian King

Bug Description

Series: Cosmic
Instance Size: I3.Metal
Region: (Default) US-WEST-2
Kernel: linux-aws

During SRU testing the i3.metal instance flavor type will sometimes fail to respond after the instance is rebooted. Usually this has been seen at least 2 or 3 times during at test cycle.

While rebooting an I3.Metal instance on the AWS Cloud. I observed the following crash which resulting in tearing down the instance and starting over. The instance was only restarted ~4 times at the time of this failure.

[[0;32m OK [0m] Reached target Shutdown.
[[0;32m OK [0m] Reached target Final Step.
         Starting Reboot...
         Stopping LVM2 metadata daemon...
[[0;32m OK [0m] Stopped LVM2 metadata daemon.
[ 447.340575] INFO: rcu_sched self-detected stall on CPU
[ 447.340577] INFO: rcu_sched self-detected stall on CPU
[ 447.340580] INFO: rcu_sched self-detected stall on CPU
[ 447.340587] INFO: rcu_sched self-detected stall on CPU
[ 447.340590] INFO: rcu_sched self-detected stall on CPU
[ 447.340592] INFO: rcu_sched self-detected stall on CPU
[ 447.340595] Uhhuh. NMI received for unknown reason 21 on CPU 0.
[ 447.340599] INFO: rcu_sched self-detected stall on CPU
[ 447.340602] INFO: rcu_sched self-detected stall on CPU
[ 447.340606] INFO: rcu_sched self-detected stall on CPU
[ 447.340614] 53-...!: (43 GPs behind) idle=7ce/1/0 softirq=392/392 fqs=0
[ 447.340617] INFO: rcu_sched self-detected stall on CPU
[ 447.340621] Do you have a strange power saving mode enabled?
[ 447.340628] 1-...!: (1 ticks this GP) idle=79e/1/0 softirq=881/881 fqs=0
[ 447.340632] INFO: rcu_sched self-detected stall on CPU
[ 447.340634] INFO: rcu_sched self-detected stall on CPU
[ 447.340636] INFO: rcu_sched self-detected stall on CPU
[ 447.340639] INFO: rcu_sched self-detected stall on CPU
[ 447.340641] INFO: rcu_sched self-detected stall on CPU
[ 447.340644] INFO: rcu_sched self-detected stall on CPU
[ 447.340647] INFO: rcu_sched self-detected stall on CPU

The full log can be seen in the attached file.

Sean Feole (sfeole) wrote :
Changed in linux-aws (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
Changed in linux-aws (Ubuntu):
importance: Undecided → High
status: New → In Progress
Colin Ian King (colin-king) wrote :

I've been running 4 i3 instances running bionic, cosmic and xenial without any boot hangs using the following kernel boot parameters: nmi_watchdog=0 pcie_aspm=off nohpet

These have been running now for 8+ hours, with easily over 70 reboots per instance w/o any issues. I'll try and factor out which one(s) are the ones that are required.

Colin Ian King (colin-king) wrote :

I've been running this for 3+ days now and cannot reproduce this specific issue. From the look of the error it appears to be a hardware related NMI issue, so perhaps we have some faulty H/W in this specific case.

When running these tests for several days now with and without the kernel parameters I have observed the following:

1. It can take >10-15 minutes for a reboot.
2. Our instances were being accidentally deleted by a jenkins job which could be a reason why some of our original assumptions that the VM had died on reboot were incorrect.
3. When rebooting almost immediately when ssh access becomes available reboot gets stuck with systemd issues:

sudo reboot
systemctl status reboot.target
Failed to get properties: Connection timed out

and the only way to reboot is using the following:
sudo systemctl --force reboot

This could also be a reason why the automated reboot testing got locked up and we mistakenly believed that reboots were failing due to H/W issues.

Colin Ian King (colin-king) wrote :

I can't repro this specific issue, I believe it may be a H/W issue as reported in this bug. Other reasons for the reboot tests failing are marked in comment #3. Marking this as won't fix.

Changed in linux-aws (Ubuntu):
status: In Progress → Won't Fix
Changed in linux-aws (Ubuntu):
status: Won't Fix → In Progress
Changed in linux-aws (Ubuntu):
status: In Progress → Won't Fix
status: Won't Fix → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments