compute node hangs on unlock due to ovs-vswitchd memory initialization error.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
ChenjieXu |
Bug Description
Brief Description
-----------------
A worker node is failing to unlock because there is no memory allocated/reserved for vswitch use. This system has 4 worker nodes. All 4 are configured identically, but a single node is failing to unlock because ovs-vswitchd is failing to start. Over ~6 different installations this has happened twice on initial unlock.
Severity
--------
Major.
Steps to Reproduce
------------------
Install controller-0, configure it with Ansible, install, configure and unlock remaining nodes. Observe that some worker nodes may not go unlocked/enabled and instead just hang during their post-unlock initialization.
Expected Behavior
------------------
Nodes should unlock without issue.
Actual Behavior
----------------
A worker node is hung during initial post-unlock initialization.
Reproducibility
---------------
30-50%
System Configuration
-------
2+4
Branch/Pull Time/Commit
-------
Private load rebased on May 10th.
Last Pass
---------
Passes occassionally on this load.
Timestamp/Logs
--------------
Comparing the system memory configuration between a good (compute-2) and bad (compute-3) node there is a clear discrepancy between the total memory available on both nodes.
[wrsroot@
+------
| processor | mem_tot | mem_platfo | mem_ava | hugepages(hp)_ | vs_hp_ | vs_hp_ | vs_hp_ | vs_hp | vm_tota | vm_hp_total_2 | vm_hp_avail_2M | vm_hp_pending_2M | vm_hp_total_1G | vm_hp_avail_1G | vm_hp_pending_1G | vm_hp_use_1G |
| | al(MiB) | rm(MiB) | il(MiB) | configured | size(M | total | avail | _reqd | l_4K | M | | | | | | |
| | | | | | iB) | | | | | | | | | | | |
+------
| 0 | 1024 | 8000 | 1024 | True | 1024 | 0 | 0 | None | 0 | 0 | 0 | None | 1 | 1 | None | True |
| 1 | 4408 | 2000 | 4408 | True | 1024 | 0 | 0 | None | 866304 | 0 | 0 | None | 1 | 1 | None | True |
+------
[wrsroot@
+------
| processor | mem_tot | mem_platfo | mem_ava | hugepages(hp)_ | vs_hp_ | vs_hp_ | vs_hp_ | vs_hp | vm_tota | vm_hp_total_ | vm_hp_avail_2M | vm_hp_pending_2M | vm_hp_total_1G | vm_hp_avail_1G | vm_hp_pending_1G | vm_hp_use_1G |
| | al(MiB) | rm(MiB) | il(MiB) | configured | size(M | total | avail | _reqd | l_4K | 2M | | | | | | |
| | | | | | iB) | | | | | | | | | | | |
+------
| 0 | 58316 | 8000 | 57292 | True | 1024 | 1 | 0 | None | 0 | 28646 | 28646 | None | 0 | 0 | None | True |
| 1 | 62064 | 2000 | 61040 | True | 1024 | 1 | 0 | None | 865894 | 28829 | 28829 | None | 0 | 0 | None | True |
+------
The local information on the node does not seem to agree with the system inventory data:
compute-3:~$ free -g
total used free shared buff/cache available
Mem: 125 1 123 0 0 123
Swap: 0 0 0
compute-3:~$ sudo cat /proc/meminfo
Password:
MemTotal: 131810660 kB
MemFree: 129926964 kB
MemAvailable: 129766652 kB
Buffers: 36456 kB
Cached: 414384 kB
SwapCached: 0 kB
Active: 489440 kB
Inactive: 227012 kB
Active(anon): 271960 kB
Inactive(anon): 8348 kB
Active(file): 217480 kB
Inactive(file): 218664 kB
Unevictable: 5424 kB
Mlocked: 5424 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 148 kB
Writeback: 0 kB
AnonPages: 271020 kB
Mapped: 63744 kB
Shmem: 10692 kB
Slab: 149848 kB
SReclaimable: 59264 kB
SUnreclaim: 90584 kB
KernelStack: 10928 kB
PageTables: 8144 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 65905328 kB
Committed_AS: 740776 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 574016 kB
VmallocChunk: 34291828732 kB
HardwareCorrupted: 0 kB
CmaTotal: 16384 kB
CmaFree: 9216 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 17036 kB
DirectMap2M: 3028992 kB
DirectMap1G: 133169152 kB
compute-3:~$ sudo find /sys -name "nr_huge*"
/sys/devices/
/sys/devices/
/sys/devices/
/sys/devices/
/sys/kernel/
/sys/kernel/
/sys/kernel/
/sys/kernel/
compute-3:~$ sudo find /sys -name "nr_huge*" | xargs -L1 grep -E "^"
0
0
0
0
0
0
0
0
compute-3:~$ cat /proc/cmdline
BOOT_IMAGE=
The end result is that the default hiera data for the node configures 0 memory for vswitch use (192.168.144.28 is compute-3):
[wrsroot@
Password:
/opt/platform/
/opt/platform/
/opt/platform/
/opt/platform/
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
2019-05-
Test Activity
-------------
Developer testing
Changed in starlingx: | |
assignee: | Forrest Zhao (forrest.zhao) → ChenjieXu (midone) |
Changed in starlingx: | |
status: | New → Incomplete |
The same issue is also reported in: https:/ /bugs.launchpad .net/starlingx/ +bug/1829403