Error locating any hugetable limits (cgroup/path issue) resulting in empty hypervisor list

Bug #1824580 reported by Wendy Mitchell
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Jim Gauld

Bug Description

Brief Description
-----------------
Failure locating any hugetable limits - cgroup/path issue. Results in empty openstack hypervisor list.

Severity
--------
major

Steps to Reproduce
------------------
1. Install and configure lab (running a reapply)
2. Check that the openstack-compute-node label key has value enabled
3. Confirm that worker nodes are unlocked, enabled and available and hypervisors are running on each worker node.

Expected Behavior
------------------
Expected hypervisors to exist on the worker node.

Actual Behavior
----------------

No hypervisors. The error appears to be cgroup and hugepage related (kubepods path doesn't exist)
cannot access /sys/fs/cgroup/hugetlb/kubepods/hugetlb

$ nova hypervisor-list
+----+---------------------+-------+--------+
| ID | Hypervisor hostname | State | Status |
+----+---------------------+-------+--------+
+----+---------------------+-------+--------+
[wrsroot@controller-0 ~(keystone_admin)]$ openstack hypervisor list

ontroller-0:~$ kubectl get pods -n openstack -o wide | grep libvirt
libvirt-libvirt-default-sgcns 0/1 CrashLoopBackOff 27 116m 192.168.204.96 compute-1 <none> <none>
libvirt-libvirt-default-vkkgw 0/1 CrashLoopBackOff 49 3h50m 192.168.204.79 compute-0 <none> <none>
libvirt-libvirt-default-wjhsv 0/1 CrashLoopBackOff 49 3h50m 192.168.204.247 compute-2 <none> <none>
controller-0:~$ kubectl logs -n openstack libvirt-libvirt-default-sgcns
++ grep libvirtd

....
+ '[' -n '' ']'
+ rm -f /var/run/libvirtd.pid
+ [[ -c /dev/kvm ]]
+ chmod 660 /dev/kvm
+ chown root:kvm /dev/kvm
+ CGROUPS=
+ for CGROUP in cpu rdma hugetlb
+ '[' -d /sys/fs/cgroup/cpu ']'
+ CGROUPS+=cpu,
+ for CGROUP in cpu rdma hugetlb
+ '[' -d /sys/fs/cgroup/rdma ']'
+ for CGROUP in cpu rdma hugetlb
+ '[' -d /sys/fs/cgroup/hugetlb ']'
+ CGROUPS+=hugetlb,
+ cgcreate -g cpu,hugetlb:/osh-libvirt
++ cat /proc/meminfo
++ grep HugePages_Total
++ tr -cd '[:digit:]'
INFO: Detected hugepage count of '108145'. Enabling hugepage settings for libvirt/qemu.
+ hp_count=108145
+ '[' 0108145 -gt 0 ']'
+ echo 'INFO: Detected hugepage count of '\''108145'\''. Enabling hugepage settings for libvirt/qemu.'
++ grep KVM_HUGEPAGES=0 /etc/default/qemu-kvm
grep: /etc/default/qemu-kvm: No such file or directory
+ '[' -n '' ']'
+ echo KVM_HUGEPAGES=1
+ '[' '!' -d /dev/hugepages ']'
+ '[' -d /sys/fs/cgroup/hugetlb ']'
++ ls '/sys/fs/cgroup/hugetlb/kubepods/hugetlb.*.limit_in_bytes'
ls: cannot access /sys/fs/cgroup/hugetlb/kubepods/hugetlb.*.limit_in_bytes: No such file or directory
+ limits=
+ echo 'ERROR: Failed to locate any hugetable limits. Did you set the correct cgroup in your values used for this chart?'
ERROR: Failed to locate any hugetable limits. Did you set the correct cgroup in your values used for this chart?
+ exit 1

Reproducibility
---------------
yes

System Configuration
--------------------
2+3
(Optional Hyperthreaded, low-latency lab yow-cgcs-wildcat-92-98 )

Branch/Pull Time/Commit
-----------------------
BUILD_ID="20190412T013000Z"

Timestamp/Logs
--------------

summary: Error locating any hugetable limits (cgroup/path issue) resulting in
- empty hypevisor list
+ empty hypervisor list
Jim Gauld (jgauld)
Changed in starlingx:
assignee: nobody → Jim Gauld (jgauld)
Revision history for this message
Jim Gauld (jgauld) wrote :

Note that Austin first reported this issue when integrating with my feature https://review.openstack.org/#/c/648511/ .

This issue did not show up on my system, but showed up on Hardware.
I think we can likely workaround this issue without having to modify the libvirt chart, by disabling cgroup controllers we don't need; as systemd by default enables them all.

EG,
mount|grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_prio,net_cls)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpuacct,cpu)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)

I am in process testing whether I can just append cgroup_disable=hugetlb grub argument.

Revision history for this message
Al Bailey (albailey1974) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :
tags: added: stx.2.0 stx.containers
tags: added: stx.retestneeded
Changed in starlingx:
importance: Undecided → Critical
status: New → Fix Released
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

closing as duplicate

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.