IPv6: After DOR Test Service group web-services degraded alarm never cleared

Bug #1844456 reported by Anujeyan Manokeran
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Tao Liu

Bug Description

Brief Description
-----------------
   After doing DOR(Dead office Recovery) test service group web-services degraded lighttpd failed on controller-0 as per fm alarm. Alarm never cleared.
Also observed calico-node pods crashLoopBackoff as below. https://bugs.launchpad.net/starlingx/+bug/1844579)

system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | degraded |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | compute-2 | worker | unlocked | enabled | available |
| 5 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+
[sysadmin@controller-0 log(keystone_admin)]$

controller-0:/var/log# cat lighttpd-error.log
2019-09-17 06:27:49: (server.c.1472) server started (lighttpd/1.4.52)
2019-09-17 06:40:13: (server.c.2067) server stopped by UID = 0 PID = 1
fm alarm-list
+----------+----------------------------------------------------------------------------+------------------------------+----------+---------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+----------------------------------------------------------------------------+------------------------------+----------+---------------------+
| 400.001 | Service group web-services degraded; lighttpd(enabled-active, failed) | service_domain=controller. | major | 2019-09-17T20:53:19 |
| | | service_group=web-services. | | .183650 |
| | | host=controller-0 | | |
| | | | | |
| 100.114 | NTP address 2607:5300:60:92e7::1 is not a valid or a reachable NTP server. | host=controller-0.ntp=2607: | minor | 2019-09-17T13:56:32 |
| | | 5300:60:92e7::1 | | .157376 |
| | | | | |
| 100.114 | NTP address 2607:5300:60:3308::1 is not a valid or a reachable NTP server. | host=controller-1.ntp=2607: | minor | 2019-09-17T13:35:00 |
| | | 5300:60:3308::1 | | .785675 |
| | | | | |
+----------+----------------------------------------------------------------------------+------------------------------+----------+---------------------+

kubectl get pod -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-767467f9cf-jrx52 1/1 Running 1 4m24s dead:beef::8e22:765f:6121:eb4f controller-0 <none> <none>
calico-node-8f4l2 0/1 Running 5 13h face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
calico-node-9zwtk 1/1 Running 3 14h face::3 controller-0 <none> <none>
calico-node-d4tff 0/1 Running 4 13h face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
calico-node-hmrgr 1/1 Running 2 13h face::4 controller-1 <none> <none>
calico-node-lrbxb 0/1 Running 5 13h face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
ceph-pools-audit-1568752200-g68wf 0/1 Completed 0 24m dead:beef::a4ce:fec1:5423:e331 controller-1 <none> <none>
ceph-pools-audit-1568752500-xwrvw 0/1 Completed 0 19m dead:beef::8e22:765f:6121:eb4a controller-0 <none> <none>
ceph-pools-audit-1568753400-zw4bl 0/1 Completed 0 4m25s dead:beef::8e22:765f:6121:eb4e controller-0 <none> <none>
coredns-7cf476b5c8-4vkzr 1/1 Running 3 7h22m dead:beef::8e22:765f:6121:eb4c controller-0 <none> <none>
coredns-7cf476b5c8-qcxph 1/1 Running 5 7h53m dead:beef::a4ce:fec1:5423:e332 controller-1 <none> <none>
kube-apiserver-controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
kube-apiserver-controller-1 1/1 Running 4 13h face::4 controller-1 <none> <none>
kube-controller-manager-controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
kube-controller-manager-controller-1 1/1 Running 2 13h face::4 controller-1 <none> <none>
kube-multus-ds-amd64-5l4dj 1/1 Running 0 3m3s face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
kube-multus-ds-amd64-9n2cj 1/1 Running 0 2m48s face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
kube-multus-ds-amd64-jltqk 1/1 Running 1 7h46m face::4 controller-1 <none> <none>
kube-multus-ds-amd64-n5hb8 1/1 Running 0 2m52s face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
kube-multus-ds-amd64-p8zsf 1/1 Running 1 7h15m face::3 controller-0 <none> <none>
kube-proxy-8mc4r 1/1 Running 3 14h face::3 controller-0 <none> <none>
kube-proxy-b72qz 1/1 Running 2 13h face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
kube-proxy-g8k8n 1/1 Running 2 13h face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
kube-proxy-gbvsx 1/1 Running 2 13h face::4 controller-1 <none> <none>
kube-proxy-l5qxx 1/1 Running 1 13h face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
kube-scheduler-controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
kube-scheduler-controller-1 1/1 Running 2 13h face::4 controller-1 <none> <none>
kube-sriov-cni-ds-amd64-9hbm8 1/1 Running 0 2m33s face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
kube-sriov-cni-ds-amd64-9q8zk 1/1 Running 0 2m33s face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
kube-sriov-cni-ds-amd64-d8fhg 1/1 Running 1 7h15m face::3 controller-0 <none> <none>
kube-sriov-cni-ds-amd64-dkm9z 1/1 Running 1 7h46m face::4 controller-1 <none> <none>
kube-sriov-cni-ds-amd64-hn5qc 1/1 Running 0 2m33s face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
rbd-provisioner-65db585fd6-6w5r4 1/1 Running 4 7h53m dead:beef::a4ce:fec1:5423:e333 controller-1 <none> <none>
rbd-provisioner-65db585fd6-7kz9p 1/1 Running 3 7h22m dead:beef::8e22:765f:6121:eb4b controller-0 <none> <none>
tiller-deploy-7855f54f57-lrc6b 1/1 Running 1 7h22m face::4 controller-1 <none> <none>

 kubectl get pod -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-767467f9cf-jrx52 1/1 Running 1 32m dead:beef::8e22:765f:6121:eb4f controller-0 <none> <none>
calico-node-8f4l2 0/1 Running 12 13h face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
calico-node-9zwtk 1/1 Running 3 14h face::3 controller-0 <none> <none>
calico-node-d4tff 0/1 CrashLoopBackOff 12 13h face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
calico-node-hmrgr 1/1 Running 2 14h face::4 controller-1 <none> <none>
calico-node-lrbxb 0/1 CrashLoopBackOff 13 13h face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
ceph-pools-audit-1568754600-n8dtm 0/1 Completed 0 12m dead:beef::8e22:765f:6121:eb56 controller-0 <none> <none>
ceph-pools-audit-1568754900-dbdrq 0/1 Completed 0 7m19s dead:beef::a4ce:fec1:5423:e339 controller-1 <none> <none>
ceph-pools-audit-1568755200-pgbtq 0/1 Completed 0 2m19s dead:beef::a4ce:fec1:5423:e33a controller-1 <none> <none>
coredns-7cf476b5c8-4vkzr 1/1 Running 3 7h49m dead:beef::8e22:765f:6121:eb4c controller-0 <none> <none>
coredns-7cf476b5c8-qcxph 1/1 Running 5 8h dead:beef::a4ce:fec1:5423:e332 controller-1 <none> <none>
kube-apiserver-controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
kube-apiserver-controller-1 1/1 Running 4 14h face::4 controller-1 <none> <none>
kube-controller-manager-controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
kube-controller-manager-controller-1 1/1 Running 2 14h face::4 controller-1 <none> <none>
kube-multus-ds-amd64-5l4dj 1/1 Running 0 30m face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
kube-multus-ds-amd64-9n2cj 1/1 Running 0 30m face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
kube-multus-ds-amd64-jltqk 1/1 Running 1 8h face::4 controller-1 <none> <none>
kube-multus-ds-amd64-n5hb8 1/1 Running 0 30m face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
kube-multus-ds-amd64-p8zsf 1/1 Running 1 7h42m face::3 controller-0 <none> <none>
kube-proxy-8mc4r 1/1 Running 3 14h face::3 controller-0 <none> <none>
kube-proxy-b72qz 1/1 Running 2 13h face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
kube-proxy-g8k8n 1/1 Running 2 13h face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
kube-proxy-gbvsx 1/1 Running 2 14h face::4 controller-1 <none> <none>
kube-proxy-l5qxx 1/1 Running 1 13h face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
kube-scheduler-controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
kube-scheduler-controller-1 1/1 Running 2 14h face::4 controller-1 <none> <none>
kube-sriov-cni-ds-amd64-9hbm8 1/1 Running 0 30m face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
kube-sriov-cni-ds-amd64-9q8zk 1/1 Running 0 30m face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
kube-sriov-cni-ds-amd64-d8fhg 1/1 Running 1 7h42m face::3 controller-0 <none> <none>
kube-sriov-cni-ds-amd64-dkm9z 1/1 Running 1 8h face::4 controller-1 <none> <none>
kube-sriov-cni-ds-amd64-hn5qc 1/1 Running 0 30m face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
rbd-provisioner-65db585fd6-6w5r4 1/1 Running 4 8h dead:beef::a4ce:fec1:5423:e333 controller-1 <none> <none>
rbd-provisioner-65db585fd6-7kz9p 1/1 Running 3 7h49m dead:beef::8e22:765f:6121:eb4b controller-0 <none> <none>
tiller-deploy-7855f54f57-lrc6b 1/1 Running 1 7h49m face::4 controller-1 <none> <none>
[sysadmin@controller-0 ~(keystone_admin)]$

Severity
--------
Major
Steps to Reproduce
------------------
1. Verify health of the system. Verify for any alarms.
2. Power off all the nodes for 60 seconds
3. Power on all the nodes.
4. Verify nodes are up and available .
5. Verify new alarms . Observed lighttpd failed alarm.

System Configuration
--------------------
Regular system with IPv6 configuration

Expected Behavior
------------------
No new alarms after DOR test
Actual Behavior
----------------
New alarm as per description.

Reproducibility
---------------
100% reproduce able. Seen in 2 different IPv6 labs: wcp71-75, wcp63-66
Not tested on IPv4 systems

Load
----
Build was on 2019-09-16_14-18-20

Last Pass
---------
Unknown - first time to test DOR w/ IPv6

Timestamp/Logs
--------------
2019-09-17T21:02:06

Test Activity
-------------
Regression test

summary: - After DOR Test Service group web-services degraded alarm
+ After DOR Test Service group web-services degraded alarm never cleared.
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote : Re: After DOR Test Service group web-services degraded alarm never cleared.
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Jeyan, You are reporting two issues in this bug report. Please open a separate LP for the crashed pod. Please also provide logs from the two labs that you indicated as seeing these issues.

Changed in starlingx:
status: New → Incomplete
description: updated
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Tyler to triage

description: updated
Changed in starlingx:
assignee: nobody → Tyler Smith (tyler.smith)
status: Incomplete → Triaged
importance: Undecided → Medium
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

DOR test was done on IP-33-36 with load this issue was not seen .

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

IP-33-36 IPv4 configuration

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 gating / appears to be an IPv6 specific issue

summary: - After DOR Test Service group web-services degraded alarm never cleared.
+ IPv6: After DOR Test Service group web-services degraded alarm never
+ cleared
tags: added: stx.3.0 stx.gui
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Tyler Smith (tyler.smith) → Tao Liu (tliu88)
Revision history for this message
Tao Liu (tliu88) wrote :

A similar issue was reported in the dc lab on a sub cloud. The Service group web-services degraded alarm was never cleared. In this case, the lighttpd process was running on the controller-1 but the pid file was not created, as a result, sm determined the process was in a failed state.

2019-10-01T18:53:52.000 controller-1 sm: debug time[358816.750] log<704033> ERROR: sm[22247]: sm_service_fsm.c(1518): Failed to open pid file (/var/run/lighttpd.pid) for service (lighttpd).
2019-10-01T18:53:52.000 controller-1 sm: debug time[358817.326] log<6101> INFO: sm_alarm[22277]: sm_alarm_thread.c(638): Raising alarm (service-group-state) for node (controller-1) domain (controller) entity (web-services).
2019-10-01T18:53:52.000 controller-1 sm: debug time[358817.328] log<6102> INFO: sm_alarm[22277]: sm_alarm_thread.c(1083): Raised alarm (service-group-state) for node (controller-1) domain (controller) entity (web-services), fm_uuid=7fc34e30-6926-450e-9039-10e7e84de61b.

After adding '--make-pidfile' option to start-stop-daemon command in the lighttd initscript, the alarm was eventually cleared.

Changed in starlingx:
status: Triaged → In Progress
Yang Liu (yliu12)
tags: added: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config-files (master)

Fix proposed to branch: master
Review: https://review.opendev.org/687289

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config-files (master)

Reviewed: https://review.opendev.org/687289
Committed: https://git.openstack.org/cgit/starlingx/config-files/commit/?id=66d88788a8bc85e8b627ae44f8c5aca85e45d473
Submitter: Zuul
Branch: master

commit 66d88788a8bc85e8b627ae44f8c5aca85e45d473
Author: Tao Liu <email address hidden>
Date: Mon Oct 7 22:15:26 2019 -0400

    Fix the lighttpd process recovery failure

    When the lighttpd starts without creating a pid file, SM will restart
    the process after the audit detects the failure. The init script
    then kills the current running process and starts a new one.

    The kill method was using the default port 80, as a result the
    process that was running without a pid file was not killed. Therefore,
    ‘Service group web-services degraded’ alarm was never cleared due
    to recovery failure.

    This update changes the kill method to kill the process that running
    on the configured lighttpd port.

    Change-Id: Id6cae43e315d0def771b9f1422fc7753d57e6710
    Closes-Bug: 1844456
    Signed-off-by: Tao Liu <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

Verified in load 2019-10-21_20-00-00 . This issue was not reproduced.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.