Master controller reboots due to kernel panic after ansible play

Bug #1831485 reported by Cristopher Lemus
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Alexander Kozyrev

Bug Description

Brief Description
-----------------
After executing ansible playbook on a fresh install, the system restarts automatically. When it comes back is unstable, not able to continue with setup

Severity
--------
Critical: Setup cannot be done.

Steps to Reproduce
------------------
Follow up wiki procedure to do a fresh install of StarlingX.
After system boot, transfering localhost.yml, execute ansible playbook. Shortly, the system will automatically reboot, without doing any other command.

Once the system is back online, we cannot continue with setup because the system complains that is not the active controller, probably caused by the lack of MGMT/OAM configuration (not done because the system reboot).

Expected Behavior
------------------
Main controller don't need to reboot until the unlock is done (after some pre-configuration).

Actual Behavior
----------------
Main controller reboots with incomplete setup.

Reproducibility
---------------
100% on Simplex, Duplex, Standard, Standard External Storage

System Configuration
--------------------
Simplex, Duplex, Standard, Standard External Storage

Branch/Pull Time/Commit
-----------------------
20190602T233000Z

Last Pass
---------
20190530T152953Z - Sanity execution passed with this ISO. with 05/31 and 06/01 images, the setup also failed.

Timestamp/Logs
--------------
Full collect of a Standard (2+2) configuration attached.

controller-0:~$ sudo grep Error /var/log/puppet/latest/puppet.log
sudo: ldap_sasl_bind_s(): Can't contact LDAP server
Password:
2019-06-03T08:40:29.300 Error: 2019-06-03 08:40:29 +0000 Execution of '/usr/sbin/vgcreate cgts-vg' returned 3: No command with matching syntax recognised. Run 'vgcreate --help' for more information.
2019-06-03T08:40:29.372 Error: 2019-06-03 08:40:29 +0000 /Stage[main]/Platform::Lvm::Vg::Cgts_vg/Volume_group[cgts-vg]/ensure: change from absent to present failed: Execution of '/usr/sbin/vgcreate cgts-vg' returned 3: No command with matching syntax recognised. Run 'vgcreate --help' for more information.
controller-0:~$

No relevant errors found on dmesg, kern.log.

fm alarm-list, ceph -s are unresponsive, also kubectl retuns the following error:

controller-0:~$ kubectl get pods --all-namespaces
Unable to connect to the server: dial tcp 192.168.206.2:6443: connect: network is unreachable
controller-0:~$

Test Activity
-------------
Sanity

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; issue is blocking sanity and appears to have been introduced in the 2019-05-31 load.

Changed in starlingx:
importance: Undecided → Critical
status: New → Triaged
assignee: nobody → Don Penney (dpenney)
tags: added: stx.2.0 stx.config
Revision history for this message
Don Penney (dpenney) wrote :
Download full text (6.0 KiB)

Reviewing the attached logs, I can see that the initial puppet manifest applications went fine. At around 2019-06-03T08:35:10 (shortly thereafter), the controller appears to go for a cold reboot. We can see a corruption in the daemon.log.1.gz file at this point, a series of null bytes, which generally points to a reboot before the disk is flushed.

We also see in kern.log a series of oom-killer events occurring in the 40 seconds prior to this cold reboot, 22 of them. Each of these is similar, with the reboot happening while writing the logs for the final one. Here's the second-to-final event:

2019-06-03T08:35:05.926 localhost kernel: warning [ 745.113838] SLUB: Unable to allocate memory on node -1 (gfp=0xd0)
2019-06-03T08:35:05.926 localhost kernel: warning [ 745.113842] cache: mm_struct(365:b3fa6deff73f10850792215e6cff5ade14eec169921acf0124b372de8e6f0502), object size: 952, buffer size: 960, default order: 3, min order: 0
2019-06-03T08:35:05.926 localhost kernel: warning [ 745.113844] node 0: slabs: 0, objs: 0, free: 0
2019-06-03T08:35:05.926 localhost kernel: warning [ 745.113845] node 1: slabs: 21, objs: 294, free: 0
2019-06-03T08:35:05.983 localhost kernel: warning [ 745.160351] runc:[0:PARENT] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-998
2019-06-03T08:35:05.983 localhost kernel: info [ 745.160355] runc:[0:PARENT] cpuset=b3fa6deff73f10850792215e6cff5ade14eec169921acf0124b372de8e6f0502 mems_allowed=0-1
2019-06-03T08:35:05.983 localhost kernel: warning [ 745.160358] CPU: 5 PID: 182776 Comm: runc:[0:PARENT] Kdump: loaded Tainted: G O ------------ T 3.10.0-957.1.3.el7.1.tis.x86_64 #1
2019-06-03T08:35:05.983 localhost kernel: warning [ 745.160360] Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.00.01.0015.110720180833 11/07/2018
2019-06-03T08:35:05.983 localhost kernel: warning [ 745.160361] Call Trace:
2019-06-03T08:35:05.983 localhost kernel: warning [ 745.160369] [<ffffffff8ec07861>] dump_stack+0x19/0x1b
2019-06-03T08:35:05.983 localhost kernel: warning [ 745.160374] [<ffffffff8ec02c9e>] dump_header+0x8e/0x23f
2019-06-03T08:35:05.983 localhost kernel: warning [ 745.160379] [<ffffffff8e592f45>] ? put_page+0x25/0x40
2019-06-03T08:35:05.983 localhost kernel: warning [ 745.160383] [<ffffffff8e587fbb>] ? find_lock_task_mm+0x6b/0xd0
2019-06-03T08:35:05.983 localhost kernel: warning [ 745.160385] [<ffffffff8e58846e>] oom_kill_process+0x24e/0x3d0
2019-06-03T08:35:05.983 localhost kernel: warning [ 745.160389] [<ffffffff8e489d97>] ? has_capability_noaudit+0x17/0x20
2019-06-03T08:35:05.983 localhost kernel: warning [ 745.160394] [<ffffffff8e5f58a1>] mem_cgroup_oom_synchronize+0x551/0x580
2019-06-03T08:35:05.983 localhost kernel: warning [ 745.160396] [<ffffffff8e5f4ca0>] ? mem_cgroup_charge_common+0x90/0x90
2019-06-03T08:35:05.983 localhost kernel: warning [ 745.160398] [<ffffffff8e588d24>] pagefault_out_of_memory+0x14/0x90
2019-06-03T08:35:05.983 localhost kernel: warning [ 745.160401] [<ffffffff8ec015d0>] mm_fault_error+0x6a/0x157
2019-06-03T08:35:05.983 localhost kernel: warning [ 745.160405] [<ffffffff8e45fe68>] __do_page_fault+0x378/0x460
2019-06-03T08:35:05...

Read more...

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :
Download full text (4.8 KiB)

Marcela Rosales and Erich Cordoba helped me to identify a list of files that changed between builds:

http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190531T013000Z/outputs/RPMS/std/controllerconfig-1.0-151.tis.x86_64.rpm
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190531T013000Z/outputs/RPMS/std/playbookconfig-1.0-2.tis.x86_64.rpm
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190531T013000Z/outputs/RPMS/std/puppet-manifests-1.0.0-89.tis.noarch.rpm
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190531T013000Z/outputs/RPMS/std/sysinv-1.0-320.tis.x86_64.rpm
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190531T013000Z/outputs/RPMS/std/workerconfig-1.0-12.tis.x86_64.rpm
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190531T013000Z/outputs/RPMS/std/workerconfig-standalone-1.0-12.tis.x86_64.rpm
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190531T013000Z/outputs/RPMS/std/workerconfig-subfunction-1.0-12.tis.x86_64.rpm

Manually downloaded them on main controller and applied on a fresh install (20190602T233000Z)

localhost:~$ sudo rpm -ivh *rpm --force --nodeps
Password:
Preparing... ################################# [100%]
Updating / installing...
   1:sysinv-1.0-320.tis ################################# [ 14%]
   2:controllerconfig-1.0-151.tis ################################# [ 29%]
   3:workerconfig-subfunction-1.0-12.t################################# [ 43%]
   4:workerconfig-standalone-1.0-12.ti################################# [ 57%]
   5:workerconfig-1.0-12.tis ################################# [ 71%]
   6:puppet-manifests-1.0.0-89.tis ################################# [ 86%]
   7:playbookconfig-1.0-2.tis ################################# [100%]

After this, we ran ansible and the system is still running and available after ~20mins (without these rpms, the system went down in ~5mins).

system commands are available:

[wrsroot@localhost ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | locked | disabled | online |
+----+--------------+-------------+----------------+-------------+--------------+
[wrsroot@localhost ~(keystone_admin)]$ system host-if-list controller-0
+--------------------------------------+------+----------+---------+------+-------+------+------+------------+---------------+
| uuid | name | class | type | vlan | ports | uses | used | attributes | data networks |
| | | | | id | | i/f | by | | |
| | | | | | | | i/f | | |
+--------------------------------------+------+----------+---------+------+-------+------+------+------...

Read more...

Revision history for this message
Don Penney (dpenney) wrote :

WRT your manual experiment, the RPMs you've installed are from the May 31st build, which you indicated also failed for you. Can you clarify whether the 20190531T013000Z ISO works for you or not?

Revision history for this message
Don Penney (dpenney) wrote :

Please also note that the workerconfig packages should not be installed on a standard controller. If you want to use this method of replacing RPMs to experiment, you should make sure the RPM is installed first and do not unintentionally install RPMs that do not belong on the node.

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Sorry for the confusion, the latest ISO that we installed and passed this stage is: 20190531T013000Z . Generated on Friday, early in the morning, is the one that we use during Friday to do Sanity test.

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

We modified kernel debug level to 8. Then, we executed ansible playbook, same behavior happened. Attached is a screenshot (image.png) of some outputs:

- Upper right corner is the serial console that captured the first kill, some IPv6 errors and memory cgroup oom.

- Bottom left corner is the output of dmesg -w.

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Attaching vmcore-dmesg.txt from /var/crash/ directory.

Revision history for this message
Erich Cordoba (ericho) wrote :

The experiment of reverting https://review.opendev.org/#/c/661849/ succeed, so something in this configuration triggers this behavior and the kernel crash.

From the attached kernel output there are three interesting messages:

SLUB: Unable to allocate memory on node -1 (gfp=0xd0)
  cache: dentry(290:40ab901bdcaa918cd1ebec7030d62197844774bf988797e32b47aa64b8558789), object size: 192, buffer size: 192, default order: 1, min order: 0
  node 0: slabs: 1, objs: 42, free: 0
  node 1: slabs: 5, objs: 210, free: 0

Not sure what's the meaning of "-1" here. Then the oom killer starts to kill some pods as the limit, notice that this process is in cpu 50:

 runc:[2:INIT] invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=-998
 runc:[2:INIT] cpuset=b3a5154f8aa591eca04e8cee5b3ee634697ff38232b0e164b00109a28bcde9cf mems_allowed=0-1
 CPU: 50 PID: 161036 Comm: runc:[2:INIT] Kdump: loaded Tainted: G O ------------ T 3.10.0-957.1.3.el7.1.tis.x86_64 #1
 Hardware name: Intel Corporation S2600WFT/S2600WFT, BIOS SE5C620.86B.00.01.0013.030920180427 03/09/2018
 Call Trace:
  [<ffffffffae007861>] dump_stack+0x19/0x1b
  [<ffffffffae002c9e>] dump_header+0x8e/0x23f
  [<ffffffffad992f45>] ? put_page+0x25/0x40
  [<ffffffffad987fbb>] ? find_lock_task_mm+0x6b/0xd0
  [<ffffffffad98846e>] oom_kill_process+0x24e/0x3d0
  [<ffffffffad889d97>] ? has_capability_noaudit+0x17/0x20
  [<ffffffffad9f58a1>] mem_cgroup_oom_synchronize+0x551/0x580
  [<ffffffffad9f4ca0>] ? mem_cgroup_charge_common+0x90/0x90
  [<ffffffffad988d24>] pagefault_out_of_memory+0x14/0x90
  [<ffffffffae0015d0>] mm_fault_error+0x6a/0x157
  [<ffffffffad85fe68>] __do_page_fault+0x378/0x460
  [<ffffffffad85ffc5>] do_page_fault+0x35/0x90
  [<ffffffffad898c54>] ? sys_setsid+0xb4/0xc0
  [<ffffffffae016638>] page_fault+0x28/0x30
 Task in /k8s-infra/kubepods/burstable/pod2a3337f7-861b-11e9-8f3d-3cfdfebedf30/b3a5154f8aa591eca04e8cee5b3ee634697ff38232b0e164b00109a28bcde9c\
f killed as a result of limit of /k8s-infra/kubepods/burstable/pod2a3337f7-861b-11e9-8f3d-3cfdfebedf30

Finally the kernel crashes with a soft lookup of in cpu 50:

Kernel panic - not syncing: softlockup: hung tasks
CPU: 50 PID: 169199 Comm: runc:[1:CHILD] Kdump: loaded Tainted: G O L ------------ T 3.10.0-957.1.3.el7.1.tis.x86_64 #1

Revision history for this message
Brent Rowsell (brent-rowsell) wrote :

Please provide your ansible playbook overrides

Revision history for this message
Brent Rowsell (brent-rowsell) wrote :

I am seeing in the logs a failure to pull the calico image.
2019-06-03T08:35:09.774 localhost kubelet[103166]: info E0603 08:35:09.774413 103166 kuberuntime_manager.go:719] init container start failed: ImagePullBackOff: Back-off pulling image "192.168.100.60/calico/cni:v3.6.2"

Are you using a private registry ? If so, please add this image and retest.

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

Our lab depends of the mirror registry. Here's the over-ride file content:

```
system_mode: duplex

dns_servers:
  - 192.168.100.60

docker_registries:
  unified: 192.168.100.60

is_secure_registry: False

external_oam_subnet: 192.168.200.0/24
external_oam_gateway_address: 192.168.200.1
external_oam_floating_address: 192.168.200.207
external_oam_node_0_address: 192.168.200.82
external_oam_node_1_address: 192.168.200.83

management_subnet: 10.10.54.0/24
```

We placed the image on the private registry that we have in our lab. However, even with that image , the system also goes for a reboot and come back with the same symptoms of not being able to authenticate or use any platform commands.

Docker images pulled:
REPOSITORY TAG IMAGE ID CREATED SIZE
192.168.100.60/starlingx/k8s-cni-sriov master-centos-stable-latest b3bb9d6b5e37 5 days ago 228MB
192.168.100.60/calico/cni v3.6.2 14f1e7286a2d 2 weeks ago 84.3MB
192.168.100.60/airshipit/armada af8a9ffd0873c2fbc915794e235dbd357f2adab1 7e106423a07d 7 weeks ago 462MB
192.168.100.60/kube-proxy v1.13.5 2ee69cad74bf 2 months ago 80.2MB
192.168.100.60/kube-apiserver v1.13.5 90332c1b9a4b 2 months ago 181MB
192.168.100.60/kube-controller-manager v1.13.5 b6b315f4f34a 2 months ago 146MB
192.168.100.60/kube-scheduler v1.13.5 c629ac1dae38 2 months ago 79.6MB
192.168.100.60/nfvpe/multus v3.2 45da14a16acc 2 months ago 500MB
192.168.100.60/kubernetes-helm/tiller v2.13.1 cb5aea7d0466 2 months ago 82.1MB
192.168.100.60/coredns 1.2.6 f59dcacceff4 7 months ago 40MB
192.168.100.60/pause 3.1 da86e6ba6ca1 17 months ago 742kB

On which log did you found the error/missing image? We might need to check if there are other missing images, or, were can we find a list of docker images used on this stage? I don't see them on http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/images-centos-stable-versioned.lst , neither on helm-chart/ files.

Revision history for this message
Don Penney (dpenney) wrote :

The kubelet logs are found in daemon.log. In the collected logs, the one Brent pointed out is in the rotated logfile daemon.log.1.gz.

I took a look at an AIO-SX controller I have booted that hasn't gone further than ansible setup, and it has the following images in docker:

gcr.io/kubernetes-helm/tiller:v2.13.1
k8s.gcr.io/coredns:1.2.6
k8s.gcr.io/kube-apiserver:v1.13.5
k8s.gcr.io/kube-controller-manager:v1.13.5
k8s.gcr.io/kube-proxy:v1.13.5
k8s.gcr.io/kube-scheduler:v1.13.5
k8s.gcr.io/pause:3.1
nfvpe/multus:v3.2
quay.io/airshipit/armada:af8a9ffd0873c2fbc915794e235dbd357f2adab1
quay.io/calico/cni:v3.6.2
quay.io/calico/kube-controllers:v3.6.2
quay.io/calico/node:v3.6.2
starlingx/k8s-cni-sriov:master-centos-stable-latest

Revision history for this message
Erich Cordoba (ericho) wrote :

I think next thing to try would be to set softlockup_panic=0 in kernel cmdline to avoid the panic, then figure it out which pods is being killed by the OOM killer and get logs from such pod.

Revision history for this message
Matt Peters (mpeters-wrs) wrote :

A couple of more surgical and separate experiments could be made to determine if any of the changes made under the https://review.opendev.org/#/c/661849/ update could be impacting it are the following.

https://opendev.org/starlingx/config/src/branch/master/playbookconfig/playbookconfig/playbooks/bootstrap/roles/bringup-essential-services/tasks/bringup_kubemaster.yml#L43
- change to net.ipv6.conf.all.forwarding = 0

https://opendev.org/starlingx/config/src/branch/master/playbookconfig/playbookconfig/playbooks/bootstrap/roles/bringup-essential-services/templates/calico-cni.yaml.j2
- replace all the Calico image version tags to v3.6.1

Revision history for this message
Brent Rowsell (brent-rowsell) wrote :

Please provide the collect from your latest reproduction

Revision history for this message
Cristopher Lemus (cjlemusc) wrote :

I verified that we had all images that Don posted 4 comments before. We were missing calico images.

With all images on our registry. I was able to do a full install of all 4 configs (sanity is being executed at the moment).

I understand that the failure, and even the kernel panic is expected because the lack of images. However a couple of things that I think will be helpful:

- With config_controller, if an image was missing, the execution of config_controller failed. It was relatively easy to identify, because the list of missing images were on puppet.log, looked like a pre-check. Is it possible to have that pre-check on Ansible? On all cases it reported that the playbook completed without errors.

- We have a list of images already on http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/, can we add these platform images as well? That'd be helpful, not just for us, for anyone trying to do an install using private registry.

Revision history for this message
Don Penney (dpenney) wrote :

WRT to the list of images, the files like images-centos-stable-versioned.lst are records of the images that have been built as part of the loadbuild, which is used as input to the helm chart application build.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

I opened https://bugs.launchpad.net/starlingx/+bug/1831664 as a follow-up to track if Ansible can return an error in this case.

This bug will be used to investigate why a kernel panic happened in this case. We wouldn't expect the host to reboot because there was an issue in one pod (oom-killer invoked on the coredns pod).

Lowering the priority given that installs can continue after the proper image is added to the registry.

Changed in starlingx:
importance: Critical → High
assignee: Don Penney (dpenney) → Alex Kozyrev (akozyrev)
Revision history for this message
Erich Cordoba (ericho) wrote :

The kernel panic occurred because the cmdline has the softlockup_panic=1 and when the soft lockup is triggered we see the panic. Still we don't know why coredns gets stuck.

Ghada Khalil (gkhalil)
tags: added: stx.distro.other
removed: stx.config
summary: - Master controller reboots after ansible play
+ Master controller reboots due to kernel panic after ansible play
Revision history for this message
Alexander Kozyrev (akozyrev) wrote :

Soft lockup in xfrm module is fixed by the following commit in Linux kernel:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/net/xfrm?id=c282222a45cb9503cbfbebfdb60491f06ae84b49

CentOS includes this patch starting kernel-3.10.0-957.10.1 according to this article:
https://access.redhat.com/solutions/3661581

There is an open review for StarlingX kernel rebase to 3.10.0-957.12.2:
https://review.opendev.org/#/c/663071/

Please retest once it is merged.

Ghada Khalil (gkhalil)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tools (master)

Fix proposed to branch: master
Review: https://review.opendev.org/664127

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tools (master)

Reviewed: https://review.opendev.org/664127
Committed: https://git.openstack.org/cgit/starlingx/tools/commit/?id=b6896898fc17459663a1027fc330ef4848eb8b1c
Submitter: Zuul
Branch: master

commit b6896898fc17459663a1027fc330ef4848eb8b1c
Author: Alex Kozyrev <email address hidden>
Date: Fri Jun 7 22:01:01 2019 -0400

    Update Docker to 18.09.6 in StarlingX

    kmem accounting feature was backported by RedHat and seems to be
    incomplete and prone to memory leaks (and possible deadlocks).
    This issue is triggered by Docker container runtime that activates
    the kmem accounting feature for any cgroup by default. As a result
    kmem accounting in runc is disabled on RHEL/CentOS starting 18.19.1
    Updating Docker version to 18.09.6 in StarlingX to fix the issue.

    Note that the client and container runtime are in separate packages
    in Docker Engine 18.09. So, adding containerd.io and docker-cli to
    fulfill the depenndencies for Docker Engine.

    Change-Id: Ib229eb7ac4db45dbdf1260c6505242147e06838f
    Closes-bug: 1831485
    Signed-off-by: Alex Kozyrev <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.