After DOR Test calico-node pods crashLoopBackoff never recovered

Bug #1844579 reported by Anujeyan Manokeran
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Joseph Richard

Bug Description

Brief Description
-----------------
   After doing DOR(Dead office Recovery) test calico-node pods crashLoopBackoff as below.
Also alarm appeared lighttpd alarm appeared as below https://bugs.launchpad.net/starlingx/+bug/1844456 .

+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | degraded |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | compute-2 | worker | unlocked | enabled | available |
| 5 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[sysadmin@controller-0 log(keystone_admin)]$
controller-0:/var/log# cat lighttpd-error.log
2019-09-17 06:27:49: (server.c.1472) server started (lighttpd/1.4.52)
2019-09-17 06:40:13: (server.c.2067) server stopped by UID = 0 PID = 1
fm alarm-list
+----------+----------------------------------------------------------------------------+------------------------------+----------+---------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+----------------------------------------------------------------------------+------------------------------+----------+---------------------+
| 400.001 | Service group web-services degraded; lighttpd(enabled-active, failed) | service_domain=controller. | major | 2019-09-17T20:53:19 |
| | | service_group=web-services. | | .183650 |
| | | host=controller-0 | | |
| | | | | |
| 100.114 | NTP address 2607:5300:60:92e7::1 is not a valid or a reachable NTP server. | host=controller-0.ntp=2607: | minor | 2019-09-17T13:56:32 |
| | | 5300:60:92e7::1 | | .157376 |
| | | | | |
| 100.114 | NTP address 2607:5300:60:3308::1 is not a valid or a reachable NTP server. | host=controller-1.ntp=2607: | minor | 2019-09-17T13:35:00 |
| | | 5300:60:3308::1 | | .785675 |
| | | | | |
+----------+----------------------------------------------------------------------------+------------------------------+----------+---------------------+
kubectl get pod -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-767467f9cf-jrx52 1/1 Running 1 4m24s dead:beef::8e22:765f:6121:eb4f controller-0 <none> <none>
calico-node-8f4l2 0/1 Running 5 13h face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
calico-node-9zwtk 1/1 Running 3 14h face::3 controller-0 <none> <none>
calico-node-d4tff 0/1 Running 4 13h face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
calico-node-hmrgr 1/1 Running 2 13h face::4 controller-1 <none> <none>
calico-node-lrbxb 0/1 Running 5 13h face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
ceph-pools-audit-1568752200-g68wf 0/1 Completed 0 24m dead:beef::a4ce:fec1:5423:e331 controller-1 <none> <none>
ceph-pools-audit-1568752500-xwrvw 0/1 Completed 0 19m dead:beef::8e22:765f:6121:eb4a controller-0 <none> <none>
ceph-pools-audit-1568753400-zw4bl 0/1 Completed 0 4m25s dead:beef::8e22:765f:6121:eb4e controller-0 <none> <none>
coredns-7cf476b5c8-4vkzr 1/1 Running 3 7h22m dead:beef::8e22:765f:6121:eb4c controller-0 <none> <none>
coredns-7cf476b5c8-qcxph 1/1 Running 5 7h53m dead:beef::a4ce:fec1:5423:e332 controller-1 <none> <none>
kube-apiserver-controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
kube-apiserver-controller-1 1/1 Running 4 13h face::4 controller-1 <none> <none>
kube-controller-manager-controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
kube-controller-manager-controller-1 1/1 Running 2 13h face::4 controller-1 <none> <none>
kube-multus-ds-amd64-5l4dj 1/1 Running 0 3m3s face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
kube-multus-ds-amd64-9n2cj 1/1 Running 0 2m48s face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
kube-multus-ds-amd64-jltqk 1/1 Running 1 7h46m face::4 controller-1 <none> <none>
kube-multus-ds-amd64-n5hb8 1/1 Running 0 2m52s face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
kube-multus-ds-amd64-p8zsf 1/1 Running 1 7h15m face::3 controller-0 <none> <none>
kube-proxy-8mc4r 1/1 Running 3 14h face::3 controller-0 <none> <none>
kube-proxy-b72qz 1/1 Running 2 13h face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
kube-proxy-g8k8n 1/1 Running 2 13h face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
kube-proxy-gbvsx 1/1 Running 2 13h face::4 controller-1 <none> <none>
kube-proxy-l5qxx 1/1 Running 1 13h face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
kube-scheduler-controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
kube-scheduler-controller-1 1/1 Running 2 13h face::4 controller-1 <none> <none>
kube-sriov-cni-ds-amd64-9hbm8 1/1 Running 0 2m33s face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
kube-sriov-cni-ds-amd64-9q8zk 1/1 Running 0 2m33s face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
kube-sriov-cni-ds-amd64-d8fhg 1/1 Running 1 7h15m face::3 controller-0 <none> <none>
kube-sriov-cni-ds-amd64-dkm9z 1/1 Running 1 7h46m face::4 controller-1 <none> <none>
kube-sriov-cni-ds-amd64-hn5qc 1/1 Running 0 2m33s face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
rbd-provisioner-65db585fd6-6w5r4 1/1 Running 4 7h53m dead:beef::a4ce:fec1:5423:e333 controller-1 <none> <none>
rbd-provisioner-65db585fd6-7kz9p 1/1 Running 3 7h22m dead:beef::8e22:765f:6121:eb4b controller-0 <none> <none>
tiller-deploy-7855f54f57-lrc6b 1/1 Running 1 7h22m face::4 controller-1 <none> <none>
 kubectl get pod -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-767467f9cf-jrx52 1/1 Running 1 32m dead:beef::8e22:765f:6121:eb4f controller-0 <none> <none>
calico-node-8f4l2 0/1 Running 12 13h face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
calico-node-9zwtk 1/1 Running 3 14h face::3 controller-0 <none> <none>
calico-node-d4tff 0/1 CrashLoopBackOff 12 13h face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
calico-node-hmrgr 1/1 Running 2 14h face::4 controller-1 <none> <none>
calico-node-lrbxb 0/1 CrashLoopBackOff 13 13h face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
ceph-pools-audit-1568754600-n8dtm 0/1 Completed 0 12m dead:beef::8e22:765f:6121:eb56 controller-0 <none> <none>
ceph-pools-audit-1568754900-dbdrq 0/1 Completed 0 7m19s dead:beef::a4ce:fec1:5423:e339 controller-1 <none> <none>
ceph-pools-audit-1568755200-pgbtq 0/1 Completed 0 2m19s dead:beef::a4ce:fec1:5423:e33a controller-1 <none> <none>
coredns-7cf476b5c8-4vkzr 1/1 Running 3 7h49m dead:beef::8e22:765f:6121:eb4c controller-0 <none> <none>
coredns-7cf476b5c8-qcxph 1/1 Running 5 8h dead:beef::a4ce:fec1:5423:e332 controller-1 <none> <none>
kube-apiserver-controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
kube-apiserver-controller-1 1/1 Running 4 14h face::4 controller-1 <none> <none>
kube-controller-manager-controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
kube-controller-manager-controller-1 1/1 Running 2 14h face::4 controller-1 <none> <none>
kube-multus-ds-amd64-5l4dj 1/1 Running 0 30m face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
kube-multus-ds-amd64-9n2cj 1/1 Running 0 30m face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
kube-multus-ds-amd64-jltqk 1/1 Running 1 8h face::4 controller-1 <none> <none>
kube-multus-ds-amd64-n5hb8 1/1 Running 0 30m face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
kube-multus-ds-amd64-p8zsf 1/1 Running 1 7h42m face::3 controller-0 <none> <none>
kube-proxy-8mc4r 1/1 Running 3 14h face::3 controller-0 <none> <none>
kube-proxy-b72qz 1/1 Running 2 13h face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
kube-proxy-g8k8n 1/1 Running 2 13h face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
kube-proxy-gbvsx 1/1 Running 2 14h face::4 controller-1 <none> <none>
kube-proxy-l5qxx 1/1 Running 1 13h face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
kube-scheduler-controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
kube-scheduler-controller-1 1/1 Running 2 14h face::4 controller-1 <none> <none>
kube-sriov-cni-ds-amd64-9hbm8 1/1 Running 0 30m face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
kube-sriov-cni-ds-amd64-9q8zk 1/1 Running 0 30m face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
kube-sriov-cni-ds-amd64-d8fhg 1/1 Running 1 7h42m face::3 controller-0 <none> <none>
kube-sriov-cni-ds-amd64-dkm9z 1/1 Running 1 8h face::4 controller-1 <none> <none>
kube-sriov-cni-ds-amd64-hn5qc 1/1 Running 0 30m face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
rbd-provisioner-65db585fd6-6w5r4 1/1 Running 4 8h dead:beef::a4ce:fec1:5423:e333 controller-1 <none> <none>
rbd-provisioner-65db585fd6-7kz9p 1/1 Running 3 7h49m dead:beef::8e22:765f:6121:eb4b controller-0 <none> <none>
tiller-deploy-7855f54f57-lrc6b 1/1 Running 1 7h49m face::4 controller-1 <none> <none>
[sysadmin@controller-0 ~(keystone_admin)]$
Severity
--------
Major
Steps to Reproduce
------------------
1. Verify health of the system. Verify for any alarms.
2. Power off all the nodes for 60 seconds
3. Power on all the nodes.
4. Verify nodes are up and available .
5. Verify pods are up .As description calico-node pods crashLoopBackoff
System Configuration
--------------------
Regular system with IPv6 configuration
Expected Behavior
------------------
No crash on calico-node pods

Actual Behavior
----------------
As per description calico-node pods crash.
Reproducibility
---------------
100% reproduce able. Seen in 2 different IPv6 labs: WCP71-75 & WCP63-66
Not yet tested with IPv4

Load
----
Build was on 2019-09-16_14-18-20
Last Pass
---------
Timestamp/Logs
--------------
2019-09-17T21:02:06
Test Activity
-------------
Regression test

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 gating - system doesn't fully recover after power outage test

description: updated
tags: added: stx.3.0 stx.containers stx.networking
Changed in starlingx:
importance: Undecided → High
assignee: nobody → Joseph Richard (josephrichard)
status: New → Triaged
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

DOR test was done on IP-33-36 (IPV4 configuration) this issue was not seen .

Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
Joseph Richard (josephrichard) wrote :

Were logs collected?

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as Incomplete as we need the logs to investigate this issue further.

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Revision history for this message
Joseph Richard (josephrichard) wrote :

From the logs:
{"log":"2019-09-17 21:04:19.272 [INFO][8] startup.go 331: Hit error connecting to datastore - retry error=Get https://[fd04::1]:443/api/v1/nodes/foo: dial tcp [fd04::1]:443: connect: network is unreachable\n","stream":"stdout","time":"2019-09-17T21:04:19.272839647Z"}

Looks to be a duplicate of https://bugs.launchpad.net/starlingx/+bug/1844192, which is fixed in the latest load.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Duplicate bug is addressed by: https://review.opendev.org/685112
Merged on 2019-10-01

Marking as Fix Released

Changed in starlingx:
status: Incomplete → Fix Released
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

DOR Verification failed this issue was reproduced in load 2019-10-21_20-00-00 in lab wcp71-75

Revision history for this message
Yang Liu (yliu12) wrote :

Removed the dup link and set this LP to confirmed as per Jeyan's comments.
Joseph also confirmed that after DOR, default route was missing from compute despite accept_ra being off.

Changed in starlingx:
status: Fix Released → Confirmed
Revision history for this message
Joseph Richard (josephrichard) wrote :

In the retest failures, dhcp fails on the compute node, and then retries in the background. Unfortunately, this results in the ifup-post not being run, and so the route6 not being run either.

This is unrelated to the issue with accept_ra (which also caused the default route to be missing) that was seen previously, but the calico logs appear the same, as they are both missing the default route.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

The workaround for this issue is lock/unlock the worker nodes where the failed pods are residing.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/692606

Changed in starlingx:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/692606
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=8a3722089d821573ce6766e6fa40d78a0fdbaded
Submitter: Zuul
Branch: master

commit 8a3722089d821573ce6766e6fa40d78a0fdbaded
Author: Joseph Richard <email address hidden>
Date: Tue Oct 29 10:54:58 2019 -0400

    Drop initscripts patch running ipv6 dhcp as daemon

    This commit rebases initscripts patch set, dropping
    run-dhclient-as-daemon-for-ipv6.patch

    Currently, ifup-eth tries running ipv6 dhclient with the one-shot
    option, and if that fails, then retries indefinitely in the background.
    That has the side-effect of causing the ifup-post script to not be run
    if the first dhclient attempt fails, which will prevent routes on that
    interface from being created. This is especially problematic in the
    case of a DOR, where the compute nodes may come up before dnsmasq is up
    on the controller.
    This is different from upstream centos, which will only try running
    dhclient with the one-shot option for ipv6.
    By reverting the initscripts patch to run as a daemon, ipv6 dhclient now
    runs as one-shot only, and if it fails, ifup-eth script exits without
    getting an address, and then the node fails to come up and reboot.
    While this may result in the compute node having an extra reboot in a DOR,
    that is preferable to the compute coming up incorrectly and requiring a
    lock/unlock to recover.

    Closes-bug: 1844579
    Change-Id: I5b7f6b7c878dc4e4737d986f11fae3301585fb1c
    Signed-off-by: Joseph Richard <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Elio Martinez (elio1979) wrote :

This problem is not present with IPV4 environment.

calico-node-8qvsg 1/1 Running 6 5d1h 10.10.70.11 controller-0 <none> <none>
calico-node-lwscd 1/1 Running 4 4d15h 10.10.70.12 controller-1 <none> <none>

###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20191122T023000Z"

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/698561

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (f/centos8)
Download full text (7.1 KiB)

Reviewed: https://review.opendev.org/698561
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=9035cd1be8aa3138691c6c99219030dfbe77ebaf
Submitter: Zuul
Branch: f/centos8

commit 4aa661ce5666220d6beb2a3a3fac987cba4feb74
Author: Martin, Chen <email address hidden>
Date: Thu Nov 21 10:28:13 2019 +0800

    Build layering
    Rebase tarball for i40e Driver
    Rebase srpm for systemd 219-67.el7
    Rebase srpm for sudo
    Rebase srpm for ntp

    Depends-On: https://review.opendev.org/#/c/695061/
    Depends-On: https://review.opendev.org/#/c/695560/
    Depends-On: https://review.opendev.org/#/c/695637/
    Depends-On: https://review.opendev.org/#/c/695983/

    Story: 2006166
    Task: 37570

    Change-Id: I7f33e0fb1319df3421318c4927d2a5675a490273
    Signed-off-by: Martin, Chen <email address hidden>

commit 5d854355d873702b78ff6aa8c6fddc025c45be2d
Author: Jim Somerville <email address hidden>
Date: Mon Nov 25 16:07:17 2019 -0500

    Uprev ntp to version 4.2.6p5-29.el7

    This solves:
    ntp: Stack-based buffer overflow in ntpq and ntpdc allows
    denial of service or code execution (CVE-2018-12327)

    See the announcement link:

    https://lists.centos.org/pipermail/centos-cr-announce/2019-August/006016.html

    for more details.

    Here we refresh the meta patches and correct the crime of
    "name of patch file differs from git format-patch". We
    also clean up the commit short logs.

    Change-Id: I263465d85f06096296fdd478a302eb110ab1259c
    Closes-Bug: 1849197
    Depends-On: https://review.opendev.org/#/c/695983
    Signed-off-by: Jim Somerville <email address hidden>

commit 11fd5d9cd48a1539b9c7a4ebc8aaad69ed24ae5b
Author: Dan Voiculeasa <email address hidden>
Date: Thu Nov 21 15:01:36 2019 +0200

    ceph-init-wrapper: Detect stuck peering OSDs and restart them

    OSDs might become stuck peering.
    Recover from such state.

    Closes-bug: 1851287

    Change-Id: I2ef1a0e93d38c3d041ee0c5c1e66a4ac42785a68
    Signed-off-by: Dan Voiculeasa <email address hidden>

commit f30cb74fef4b97721010ca9bc6a6b6dde03c4add
Author: Robin Lu <email address hidden>
Date: Fri Nov 22 11:01:27 2019 +0800

    Update sudo srpm patch for CVE bug

    To fix below CVE, we will use sudo-1.8.23-4.el7_7.1.src.rpm
    And we have to update some patches according to new srpm.
    https://lists.centos.org/pipermail/centos-announce/2019-October/023499.html

    CVE bug: CVE-2019-14287: sudo: can bypass certain policy blacklists

    Closes-Bug: 1852825
    Depends-On: https://review.opendev.org/#/c/695637/
    Change-Id: Ifc0a3423464fafce06cd504d9b427fc3433fb756
    Signed-off-by: Robin Lu <email address hidden>

commit 0231aba5cdcb96b15106591acfff280159050366
Author: Jim Somerville <email address hidden>
Date: Thu Nov 21 15:54:15 2019 -0500

    Uprev systemd to version 219-67.el7

    This solves:
    systemd: line splitting via fgets() allows for state injection
    during daemon-reexec (CVE-2018-15686)

    along with some other less critical issues. See the security
    announcement link:

   ...

Read more...

tags: added: in-f-centos8
Revision history for this message
ayyappa (mantri425) wrote :

Tested in ipv6 standard and ipv4 aio-duplex plus labs, issue is not seen in the latest build
SW_VERSION="19.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2019-12-11_21-20-16"
SRC_BUILD_ID="18"

tags: removed: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Joseph, please cherrypick this fix to the r/stx.3.0 branch at your earliest convenience.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.