Nodes stuck in 'Graceful Recovery Wait' status
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Eric MacDonald |
Bug Description
Host level task/status is seen to get stuck with 'Graceful Recovery Wait' after nodes have already recovered.
Severity: Low
Steps to Reproduce:
On virtualbox 2+2 TIC.
1. modify service parameter 'heartbeat_period' to 1000 in order to controll heartbeats loss more easily.
2. run below commands on host to make virtualbox VMs mgmt link down/up (VMs NIC 2 is for mgmt here)
VBoxManage controlvm compute-0 setlinkstate2 off
VBoxManage controlvm compute-1 setlinkstate2 off
sleep 4
VBoxManage controlvm compute-0 setlinkstate2 on
VBoxManage controlvm compute-1 setlinkstate2 on
sleep 15
VBoxManage controlvm compute-0 setlinkstate2 off
VBoxManage controlvm compute-1 setlinkstate2 off
sleep 4
VBoxManage controlvm compute-0 setlinkstate2 on
VBoxManage controlvm compute-1 setlinkstate2 on
3.after a while, some nodes are seen to get stuck in 'Graceful Recovery Wait' status.
run step 2 again if the issue doesn't occur)
Expected Behavior: No stuck status
Actual Behavior: Stuck status sometimes on some hosts
Reproducibility: Intermittent
System Configuration: Multi-node system
Branch/Pull Time/Commit: 18.03 patch current (13) and likely in current Aug 25, 2020 content
Last Pass: Never seen. test escape.
Timestamp/Logs: task: Graceful Recovery Wait
Test Activity: Other
Workaround: Swact to other controller
Changed in starlingx: | |
assignee: | nobody → Eric MacDonald (rocksolidmtce) |
tags: | added: stx.metal |
Changed in starlingx: | |
importance: | Undecided → Medium |
status: | New → Triaged |
tags: | added: stx.5.0 |
The following merged update fixes this issue
update: Fix Graceful Recovery handling while in Graceful Recovery handling
review: https:/ /review. opendev. org/c/starlingx /metal/ +/780976
commit: https:/ /opendev. org/starlingx/ metal/commit/ 5c83453fdf8775e 5d776a02a2b5c06 810d84cb55