StarlingX

IPv6: After DOR Test Service group web-services degraded alarm never cleared

Bug #1844456 reported by Anujeyan Manokeran on 2019-09-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Tao Liu

Bug Description

Brief Description
-----------------
After doing DOR(Dead office Recovery) test service group web-services degraded lighttpd failed on controller-0 as per fm alarm. Alarm never cleared.
Also observed calico-node pods crashLoopBackoff as below. https://bugs.launchpad.net/starlingx/+bug/1844579)

controller-0:/var/log# cat lighttpd-error.log
2019-09-17 06:27:49: (server.c.1472) server started (lighttpd/1.4.52)
2019-09-17 06:40:13: (server.c.2067) server stopped by UID = 0 PID = 1
fm alarm-list
+----------+----------------------------------------------------------------------------+------------------------------+----------+---------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+----------------------------------------------------------------------------+------------------------------+----------+---------------------+
| 400.001 | Service group web-services degraded; lighttpd(enabled-active, failed) | service_domain=controller. | major | 2019-09-17T20:53:19 |
| | | service_group=web-services. | | .183650 |
| | | host=controller-0 | | |
| | | | | |
| 100.114 | NTP address 2607:5300:60:92e7::1 is not a valid or a reachable NTP server. | host=controller-0.ntp=2607: | minor | 2019-09-17T13:56:32 |
| | | 5300:60:92e7::1 | | .157376 |
| | | | | |
| 100.114 | NTP address 2607:5300:60:3308::1 is not a valid or a reachable NTP server. | host=controller-1.ntp=2607: | minor | 2019-09-17T13:35:00 |
| | | 5300:60:3308::1 | | .785675 |
| | | | | |
+----------+----------------------------------------------------------------------------+------------------------------+----------+---------------------+

kubectl get pod -n kube-system -o NAME calico-kube-controlleraudit-1568752200-g68wf ceph-pools-audit-1568752500-xwrvw ceph-pools-audit-1568753400-zw4bl coredns-7cf476b5c8-4vkzr coredns-7cf476b5c8-qcxph kube-apiserver-controller-0 kube-apiserver-controller-1 kube-controller-manager--manager-ds-amd64-5l4dj kube-multus-ds-amd64-9n2cj kube-multus-ds-amd64-jltqk kube-multus-ds-amd64-n5hb8 kube-multus-ds-amd64-p8zsf kube-proxy-8mc4r kube-proxy-b72qz kube-proxy-g8k8n kube-proxy-gbvsx kube-proxy-l5qxx kube-scheduler-controller-0 kube-scheduler-controller-1 kube-sriov-cni-ds-amd64-9hbm8 kube-sriov-cni-ds-amd64-9q8zk kube-sriov-cni-ds-amd64-d8fhg kube-sriov-cni-ds-amd64-dkm9z kube-sriov-cni-ds-amd64-hn5qc rbd-provisioner-65db585fd6-6w5r4 rbd-provisioner-65db585fd6-7kz9p tiller-deploy-7855f54f57-lrc6b wide
READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
/>s-767467f9cf-jrx52 1/1 Running 1 4m24s dead:beef::8e22:765f:6121:eb4f controller-0 <none> <none>
0/1 Running 5 13h face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
1/1 Running 3 14h face::3 controller-0 <none> <none>
0/1 Running 4 13h face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
1/1 Running 2 13h face::4 controller-1 <none> <none>
0/1 Running 5 13h face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
0/1 Completed 0 24m dead:beef::a4ce:fec1:5423:e331 controller-1 <none> <none>
0/1 Completed 0 19m dead:beef::8e22:765f:6121:eb4a controller-0 <none> <none>
0/1 Completed 0 4m25s dead:beef::8e22:765f:6121:eb4e controller-0 <none> <none>
1/1 Running 3 7h22m dead:beef::8e22:765f:6121:eb4c controller-0 <none> <none>
1/1 Running 5 7h53m dead:beef::a4ce:fec1:5423:e332 controller-1 <none> <none>
1/1 Running 6 14h face::3 controller-0 <none> <none>
1/1 Running 4 13h face::4 controller-1 <none> <none>
/>controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
/>controller-1 1/1 Running 2 13h face::4 controller-1 <none> <none>
1/1 Running 0 3m3s face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
1/1 Running 0 2m48s face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
1/1 Running 1 7h46m face::4 controller-1 <none> <none>
1/1 Running 0 2m52s face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
1/1 Running 1 7h15m face::3 controller-0 <none> <none>
1/1 Running 3 14h face::3 controller-0 <none> <none>
1/1 Running 2 13h face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
1/1 Running 2 13h face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
1/1 Running 2 13h face::4 controller-1 <none> <none>
1/1 Running 1 13h face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
1/1 Running 6 14h face::3 controller-0 <none> <none>
1/1 Running 2 13h face::4 controller-1 <none> <none>
1/1 Running 0 2m33s face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
1/1 Running 0 2m33s face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
1/1 Running 1 7h15m face::3 controller-0 <none> <none>
1/1 Running 1 7h46m face::4 controller-1 <none> <none>
1/1 Running 0 2m33s face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
1/1 Running 4 7h53m dead:beef::a4ce:fec1:5423:e333 controller-1 <none> <none>
1/1 Running 3 7h22m dead:beef::8e22:765f:6121:eb4b controller-0 <none> <none>
1/1 Running 1 7h22m face::4 controller-1 <none> <none>

kubectl get pod -n kube-system -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
calico-kube-controllers-767467f9cf-jrx52 1/1 Running 1 32m dead:beef::8e22:765f:6121:eb4f controller-0 <none> <none>
calico-node-8f4l2 0/1 Running 12 13h face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
calico-node-9zwtk 1/1 Running 3 14h face::3 controller-0 <none> <none>
calico-node-d4tff 0/1 CrashLoopBackOff 12 13h face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
calico-node-hmrgr 1/1 Running 2 14h face::4 controller-1 <none> <none>
calico-node-lrbxb 0/1 CrashLoopBackOff 13 13h face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
ceph-pools-audit-1568754600-n8dtm 0/1 Completed 0 12m dead:beef::8e22:765f:6121:eb56 controller-0 <none> <none>
ceph-pools-audit-1568754900-dbdrq 0/1 Completed 0 7m19s dead:beef::a4ce:fec1:5423:e339 controller-1 <none> <none>
ceph-pools-audit-1568755200-pgbtq 0/1 Completed 0 2m19s dead:beef::a4ce:fec1:5423:e33a controller-1 <none> <none>
coredns-7cf476b5c8-4vkzr 1/1 Running 3 7h49m dead:beef::8e22:765f:6121:eb4c controller-0 <none> <none>
coredns-7cf476b5c8-qcxph 1/1 Running 5 8h dead:beef::a4ce:fec1:5423:e332 controller-1 <none> <none>
kube-apiserver-controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
kube-apiserver-controller-1 1/1 Running 4 14h face::4 controller-1 <none> <none>
kube-controller-manager-controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
kube-controller-manager-controller-1 1/1 Running 2 14h face::4 controller-1 <none> <none>
kube-multus-ds-amd64-5l4dj 1/1 Running 0 30m face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
kube-multus-ds-amd64-9n2cj 1/1 Running 0 30m face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
kube-multus-ds-amd64-jltqk 1/1 Running 1 8h face::4 controller-1 <none> <none>
kube-multus-ds-amd64-n5hb8 1/1 Running 0 30m face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
kube-multus-ds-amd64-p8zsf 1/1 Running 1 7h42m face::3 controller-0 <none> <none>
kube-proxy-8mc4r 1/1 Running 3 14h face::3 controller-0 <none> <none>
kube-proxy-b72qz 1/1 Running 2 13h face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
kube-proxy-g8k8n 1/1 Running 2 13h face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
kube-proxy-gbvsx 1/1 Running 2 14h face::4 controller-1 <none> <none>
kube-proxy-l5qxx 1/1 Running 1 13h face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
kube-scheduler-controller-0 1/1 Running 6 14h face::3 controller-0 <none> <none>
kube-scheduler-controller-1 1/1 Running 2 14h face::4 controller-1 <none> <none>
kube-sriov-cni-ds-amd64-9hbm8 1/1 Running 0 30m face::fccf:ae0b:600a:f8a4 compute-2 <none> <none>
kube-sriov-cni-ds-amd64-9q8zk 1/1 Running 0 30m face::29dd:f3c5:7eb9:1d21 compute-1 <none> <none>
kube-sriov-cni-ds-amd64-d8fhg 1/1 Running 1 7h42m face::3 controller-0 <none> <none>
kube-sriov-cni-ds-amd64-dkm9z 1/1 Running 1 8h face::4 controller-1 <none> <none>
kube-sriov-cni-ds-amd64-hn5qc 1/1 Running 0 30m face::d299:e6a8:7070:7bb0 compute-0 <none> <none>
rbd-provisioner-65db585fd6-6w5r4 1/1 Running 4 8h dead:beef::a4ce:fec1:5423:e333 controller-1 <none> <none>
rbd-provisioner-65db585fd6-7kz9p 1/1 Running 3 7h49m dead:beef::8e22:765f:6121:eb4b controller-0 <none> <none>
tiller-deploy-7855f54f57-lrc6b 1/1 Running 1 7h49m face::4 controller-1 <none> <none>
[sysadmin@controller-0 ~(keystone_admin)]$

Severity
--------
Major
Steps to Reproduce
------------------
1. Verify health of the system. Verify for any alarms.
2. Power off all the nodes for 60 seconds
3. Power on all the nodes.
4. Verify nodes are up and available .
5. Verify new alarms . Observed lighttpd failed alarm.

System Configuration
--------------------
Regular system with IPv6 configuration

Expected Behavior
------------------
No new alarms after DOR test
Actual Behavior
----------------
New alarm as per description.

Reproducibility
---------------
100% reproduce able. Seen in 2 different IPv6 labs: wcp71-75, wcp63-66
Not tested on IPv4 systems

Load
----
Build was on 2019-09-16_14-18-20

Last Pass
---------
Unknown - first time to test DOR w/ IPv6

Timestamp/Logs
--------------
2019-09-17T21:02:06

Test Activity
-------------
Regression test

See original description

Tags:

Anujeyan Manokeran (anujeyan) on 2019-09-17

summary:

- After DOR Test Service group web-services degraded alarm
+ After DOR Test Service group web-services degraded alarm never cleared.

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-09-17: Re: After DOR Test Service group web-services degraded alarm never cleared.

collect logs Edit (86.1 MiB, application/x-tar)

Anujeyan Manokeran (anujeyan) on 2019-09-17

description:

updated

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-18:

@Jeyan, You are reporting two issues in this bug report. Please open a separate LP for the crashed pod. Please also provide logs from the two labs that you indicated as seeing these issues.

Changed in starlingx:
status:	New → Incomplete

Anujeyan Manokeran (anujeyan) on 2019-09-18

description:

updated

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-20:

Assigning to Tyler to triage

description:	updated
Changed in starlingx:
assignee:	nobody → Tyler Smith (tyler.smith)
status:	Incomplete → Triaged
importance:	Undecided → Medium

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-09-20:

DOR test was done on IP-33-36 with load this issue was not seen .

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-09-20:

IP-33-36 IPv4 configuration

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-09-23:

Marking as stx.3.0 gating / appears to be an IPv6 specific issue

summary:	- After DOR Test Service group web-services degraded alarm never cleared. + IPv6: After DOR Test Service group web-services degraded alarm never + cleared
tags:	added: stx.3.0 stx.gui

Ghada Khalil (gkhalil) on 2019-09-24

Changed in starlingx:
assignee:	Tyler Smith (tyler.smith) → Tao Liu (tliu88)

Revision history for this message

Tao Liu (tliu88) wrote on 2019-10-02:

A similar issue was reported in the dc lab on a sub cloud. The Service group web-services degraded alarm was never cleared. In this case, the lighttpd process was running on the controller-1 but the pid file was not created, as a result, sm determined the process was in a failed state.

2019-10-01T18:53:52.000 controller-1 sm: debug time[358816.750] log<704033> ERROR: sm[22247]: sm_service_fsm.c(1518): Failed to open pid file (/var/run/lighttpd.pid) for service (lighttpd).
2019-10-01T18:53:52.000 controller-1 sm: debug time[358817.326] log<6101> INFO: sm_alarm[22277]: sm_alarm_thread.c(638): Raising alarm (service-group-state) for node (controller-1) domain (controller) entity (web-services).
2019-10-01T18:53:52.000 controller-1 sm: debug time[358817.328] log<6102> INFO: sm_alarm[22277]: sm_alarm_thread.c(1083): Raised alarm (service-group-state) for node (controller-1) domain (controller) entity (web-services), fm_uuid=7fc34e30-6926-450e-9039-10e7e84de61b.

After adding '--make-pidfile' option to start-stop-daemon command in the lighttd initscript, the alarm was eventually cleared.

Changed in starlingx:
status:	Triaged → In Progress

Yang Liu (yliu12) on 2019-10-02

tags:

added: stx.retestneeded

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-08: Fix proposed to config-files (master)

Fix proposed to branch: master
Review: https://review.opendev.org/687289

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-09: Fix merged to config-files (master)

Reviewed: https://review.opendev.org/687289
Committed: https://git.openstack.org/cgit/starlingx/config-files/commit/?id=66d88788a8bc85e8b627ae44f8c5aca85e45d473
Submitter: Zuul
Branch: master

commit 66d88788a8bc85e8b627ae44f8c5aca85e45d473
Author: Tao Liu <email address hidden>
Date: Mon Oct 7 22:15:26 2019 -0400

Fix the lighttpd process recovery failure

    When the lighttpd starts without creating a pid file, SM will restart
    the process after the audit detects the failure. The init script
    then kills the current running process and starts a new one.

    The kill method was using the default port 80, as a result the
    process that was running without a pid file was not killed. Therefore,
    ‘Service group web-services degraded’ alarm was never cleared due
    to recovery failure.

This update changes the kill method to kill the process that running
on the configured lighttpd port.

    Change-Id: Id6cae43e315d0def771b9f1422fc7753d57e6710
    Closes-Bug: 1844456
    Signed-off-by: Tao Liu <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-10-23:

#10

Verified in load 2019-10-21_20-00-00 . This issue was not reproduced.

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

collect logs Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.