IPv6: After DOR Test Service group web-services degraded alarm never cleared
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Tao Liu |
Bug Description
Brief Description
-----------------
After doing DOR(Dead office Recovery) test service group web-services degraded lighttpd failed on controller-0 as per fm alarm. Alarm never cleared.
Also observed calico-node pods crashLoopBackoff as below. https:/
system host-list
+----+-
| id | hostname | personality | administrative | operational | availability |
+----+-
| 1 | controller-0 | controller | unlocked | enabled | degraded |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | compute-2 | worker | unlocked | enabled | available |
| 5 | controller-1 | controller | unlocked | enabled | available |
+----+-
[sysadmin@
controller-
2019-09-17 06:27:49: (server.c.1472) server started (lighttpd/1.4.52)
2019-09-17 06:40:13: (server.c.2067) server stopped by UID = 0 PID = 1
fm alarm-list
+------
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+------
| 400.001 | Service group web-services degraded; lighttpd(
| | | service_
| | | host=controller-0 | | |
| | | | | |
| 100.114 | NTP address 2607:5300:
| | | 5300:60:92e7::1 | | .157376 |
| | | | | |
| 100.114 | NTP address 2607:5300:
| | | 5300:60:3308::1 | | .785675 |
| | | | | |
+------
kubectl get pod -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-
calico-node-8f4l2 0/1 Running 5 13h face::d299:
calico-node-9zwtk 1/1 Running 3 14h face::3 controller-0 <none> <none>
calico-node-d4tff 0/1 Running 4 13h face::fccf:
calico-node-hmrgr 1/1 Running 2 13h face::4 controller-1 <none> <none>
calico-node-lrbxb 0/1 Running 5 13h face::29dd:
ceph-pools-
ceph-pools-
ceph-pools-
coredns-
coredns-
kube-apiserver-
kube-apiserver-
kube-controller
kube-controller
kube-multus-
kube-multus-
kube-multus-
kube-multus-
kube-multus-
kube-proxy-8mc4r 1/1 Running 3 14h face::3 controller-0 <none> <none>
kube-proxy-b72qz 1/1 Running 2 13h face::d299:
kube-proxy-g8k8n 1/1 Running 2 13h face::29dd:
kube-proxy-gbvsx 1/1 Running 2 13h face::4 controller-1 <none> <none>
kube-proxy-l5qxx 1/1 Running 1 13h face::fccf:
kube-scheduler-
kube-scheduler-
kube-sriov-
kube-sriov-
kube-sriov-
kube-sriov-
kube-sriov-
rbd-provisioner
rbd-provisioner
tiller-
kubectl get pod -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-
calico-node-8f4l2 0/1 Running 12 13h face::d299:
calico-node-9zwtk 1/1 Running 3 14h face::3 controller-0 <none> <none>
calico-node-d4tff 0/1 CrashLoopBackOff 12 13h face::fccf:
calico-node-hmrgr 1/1 Running 2 14h face::4 controller-1 <none> <none>
calico-node-lrbxb 0/1 CrashLoopBackOff 13 13h face::29dd:
ceph-pools-
ceph-pools-
ceph-pools-
coredns-
coredns-
kube-apiserver-
kube-apiserver-
kube-controller
kube-controller
kube-multus-
kube-multus-
kube-multus-
kube-multus-
kube-multus-
kube-proxy-8mc4r 1/1 Running 3 14h face::3 controller-0 <none> <none>
kube-proxy-b72qz 1/1 Running 2 13h face::d299:
kube-proxy-g8k8n 1/1 Running 2 13h face::29dd:
kube-proxy-gbvsx 1/1 Running 2 14h face::4 controller-1 <none> <none>
kube-proxy-l5qxx 1/1 Running 1 13h face::fccf:
kube-scheduler-
kube-scheduler-
kube-sriov-
kube-sriov-
kube-sriov-
kube-sriov-
kube-sriov-
rbd-provisioner
rbd-provisioner
tiller-
[sysadmin@
Severity
--------
Major
Steps to Reproduce
------------------
1. Verify health of the system. Verify for any alarms.
2. Power off all the nodes for 60 seconds
3. Power on all the nodes.
4. Verify nodes are up and available .
5. Verify new alarms . Observed lighttpd failed alarm.
System Configuration
-------
Regular system with IPv6 configuration
Expected Behavior
------------------
No new alarms after DOR test
Actual Behavior
----------------
New alarm as per description.
Reproducibility
---------------
100% reproduce able. Seen in 2 different IPv6 labs: wcp71-75, wcp63-66
Not tested on IPv4 systems
Load
----
Build was on 2019-09-16_14-18-20
Last Pass
---------
Unknown - first time to test DOR w/ IPv6
Timestamp/Logs
--------------
2019-09-17T21:02:06
Test Activity
-------------
Regression test
summary: |
- After DOR Test Service group web-services degraded alarm + After DOR Test Service group web-services degraded alarm never cleared. |
description: | updated |
description: | updated |
Changed in starlingx: | |
assignee: | Tyler Smith (tyler.smith) → Tao Liu (tliu88) |
tags: | added: stx.retestneeded |
@Jeyan, You are reporting two issues in this bug report. Please open a separate LP for the crashed pod. Please also provide logs from the two labs that you indicated as seeing these issues.