stx-openstack in apply-failed after lock/unlock standby controller

Bug #1837581 reported by Yang Liu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Daniel Badea

Bug Description

Brief Description
-----------------
After lock/unlock controller-1, stx-openstack is in apply-failed status.

Severity
--------
Major

Steps to Reproduce
------------------
- stx-openstack is applied
- system host-lock controller-1
- system host-unlock controller-1
- wait for controller-1 to become enabled/available in system host-list
- watch for system application-list

Expected Behavior
------------------
- stx-openstack is reapplied after unlock, and it should reach applied status

Actual Behavior
----------------
- stx-openstack in apply-failed status after a few minutes

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Dedicated storage
Lab-name: wcp113-121

Branch/Pull Time/Commit
-----------------------
stx master as of 20190720T013000Z

Last Pass
---------
Same system same load. It seems to be an intermittent issue.

Timestamp/Logs
--------------
# Unlock requested
[2019-07-21 03:12:57,041] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-1'

# controller-1 available and ready
[2019-07-21 03:19:01,711] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | compute-2 | worker | unlocked | enabled | available |
| 5 | compute-3 | worker | unlocked | enabled | available |
| 6 | compute-4 | worker | unlocked | enabled | available |
| 7 | controller-1 | controller | unlocked | enabled | available |
| 8 | storage-0 | storage | unlocked | enabled | available |
| 9 | storage-1 | storage | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

# apply-failed
[2019-07-21 03:19:29,459] 301 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne application-list'
+---------------------+--------------------------------+-------------------------------+--------------------+----------+---------------------------------------------------------------------+
| application | version | manifest name | manifest file | status | progress |
+---------------------+--------------------------------+-------------------------------+--------------------+----------+---------------------------------------------------------------------+
| platform-integ-apps | 1.0-7 | platform-integration-manifest | manifest.yaml | applied | completed |
| stx-openstack | 1.0-17-centos-stable-versioned | armada-manifest | stx-openstack.yaml | applying | processing chart: osh-kube-system-ingress, overall completion: 4.0% |
+---------------------+--------------------------------+-------------------------------+--------------------+----------+---------------------------------------------------------------------+

# mariadb pod is in crashloopbackoff for a few minutes, but not sure if this is related to the apply-failed at all.
[2019-07-21 03:21:34,134] 301 DEBUG MainThread ssh.send :: Send 'kubectl get pods --all-namespaces -o wide | grep -v -e Running -e Completed'
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack mariadb-server-1 0/1 CrashLoopBackOff 2 3m29s 172.16.166.141 controller-1 <none> <none>

[2019-07-21 03:21:34,501] 301 DEBUG MainThread ssh.send :: Send 'kubectl get pods --all-namespaces -o wide | grep -v -e Running -e Completed -e NAMESPACE | awk '{system("kubectl describe pods -n "$1" "$2)}''
....
Tolerations: node.kubernetes.io/not-ready:NoExecute for 30s
                 node.kubernetes.io/unreachable:NoExecute for 30s
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning FailedScheduling 3m25s (x4 over 3m28s) default-scheduler 0/7 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had taints that the pod didn't tolerate, 5 node(s) didn't match node selector.
  Normal Scheduled 2m35s default-scheduler Successfully assigned openstack/mariadb-server-1 to controller-1
  Normal SuccessfulAttachVolume 2m35s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-bb00e842-aaf7-11e9-a776-3cfdfeac4dd8"
  Normal Pulled 86s kubelet, controller-1 Container image "registry.local:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1" already present on machine
  Normal Created 86s kubelet, controller-1 Created container
  Normal Started 85s kubelet, controller-1 Started container
  Normal Pulled 79s kubelet, controller-1 Container image "registry.local:9001/docker.io/openstackhelm/mariadb:10.2.18" already present on machine
  Normal Created 79s kubelet, controller-1 Created container
  Normal Started 79s kubelet, controller-1 Started container
  Normal Pulled 24s (x3 over 70s) kubelet, controller-1 Container image "registry.local:9001/docker.io/openstackhelm/mariadb:10.2.18" already present on machine
  Normal Created 23s (x3 over 62s) kubelet, controller-1 Created container
  Normal Started 22s (x3 over 62s) kubelet, controller-1 Started container
  Warning BackOff 5s (x3 over 39s) kubelet, controller-1 Back-off restarting failed container

Test Activity
-------------
Regression Testing

Revision history for this message
Yang Liu (yliu12) wrote :
Revision history for this message
Yang Liu (yliu12) wrote :

Logs are split into two parts. Use cat cmd to combine them.

Revision history for this message
Tyler Smith (tyler.smith) wrote :
Download full text (11.7 KiB)

While the apply is happening, ceph-mgr becomes unresponsive:

Controller-0:

2019-07-21 03:19:37,274 103010 WARNING mgr-restful-plugin REST API ping failed: reason=HTTPSConnectionPool(host='controller-0', port=5001): Read timed out. (read timeout=15)
2019-07-21 03:19:37,275 103010 INFO mgr-restful-plugin REST API ping failure count=0
2019-07-21 03:19:45,713 102997 INFO mgr-restful-plugin Restful plugin does not respond but failure count is within acceptable limits: ceph_mgr=0 < 3, ping=0 < 5. Report status OK
2019-07-21 03:19:45,713 102997 WARNING mgr-restful-plugin Failed to send response back. request=status, response=OK, reason=[Errno 32] Broken pipe
2019-07-21 03:19:45,714 103010 INFO mgr-restful-plugin Run command: /usr/bin/ceph fsid
2019-07-21 03:20:00,737 102997 INFO mgr-restful-plugin Restful plugin does not respond but failure count is within acceptable limits: ceph_mgr=0 < 3, ping=1 < 5. Report status OK
2019-07-21 03:20:00,738 102997 WARNING mgr-restful-plugin Failed to send response back. request=status, response=OK, reason=[Errno 32] Broken pipe
2019-07-21 03:20:00,738 102997 INFO mgr-restful-plugin Stop monitor with SIGTERM to process group 103010
2019-07-21 03:20:04,009 103010 WARNING mgr-restful-plugin REST API ping failed: reason=HTTPSConnectionPool(host='controller-0', port=5001): Read timed out. (read timeout=15)
2019-07-21 03:20:04,009 103010 INFO mgr-restful-plugin REST API ping failure count=1
2019-07-21 03:20:05,743 102997 INFO mgr-restful-plugin Stop monitor with SIGKILL to process group 103010
2019-07-21 03:20:05,746 102997 INFO mgr-restful-plugin Monitor stopped: pid=103010
2019-07-21 03:20:05,746 102997 INFO mgr-restful-plugin Remove service pid file: path=/var/run/ceph/mgr-restful-plugin.pid
2019-07-21 03:20:05,746 102997 INFO mgr-restful-plugin Close service socket and remove file: path=/var/run/ceph/mgr/mgr-restful-plugin.socket
2019-07-21 03:20:05,746 102997 INFO mgr-restful-plugin Release service lock: path=/var/run/ceph/mgr/mgr-restful-plugin.lock
2019-07-21 03:20:06,256 1360341 WARNING mgr-restful-plugin Disable urllib3 certifcates check
2019-07-21 03:20:06,256 1360341 INFO mgr-restful-plugin Take service lock: path=/var/run/ceph/mgr/mgr-restful-plugin.lock
2019-07-21 03:20:06,340 1360341 INFO mgr-restful-plugin Create service socket
2019-07-21 03:20:06,341 1360341 INFO mgr-restful-plugin Remove existing socket files
2019-07-21 03:20:06,341 1360341 INFO mgr-restful-plugin Bind service socket: path=/var/run/ceph/mgr/mgr-restful-plugin.socket
2019-07-21 03:20:06,341 1360341 INFO mgr-restful-plugin Update service pid file: path=/var/run/ceph/mgr-restful-plugin.pid
2019-07-21 03:20:06,341 1360341 INFO mgr-restful-plugin Start monitor loop
2019-07-21 03:20:06,343 1360416 INFO mgr-restful-plugin Run command: /usr/bin/ceph fsid
2019-07-21 03:20:06,625 1360416 INFO mgr-restful-plugin Run command: /usr/bin/ceph auth get mgr.controller-0 -o /var/run/ceph/mgr/ceph-controller-0/keyring
2019-07-21 03:20:06,901 1360416 INFO mgr-restful-plugin Stop unmanaged running ceph-mgr processes
2019-07-21 03:20:07,004 1360416 INFO mgr-restful-plugin Start ceph-mgr daemon
2019-07-21 03:20:22,021 1360416 INFO mgr-restful-plugin Run ...

tags: added: stx.regression
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 / medium priority - intermittent issue affecting openstack application apply stability.

tags: added: stx.2.0 stx.storage
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Daniel Badea (daniel.badea)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/673817

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/673817
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=d409d78cc1e8d2cb3a9d3b15ef641656487fd078
Submitter: Zuul
Branch: master

commit d409d78cc1e8d2cb3a9d3b15ef641656487fd078
Author: Daniel Badea <email address hidden>
Date: Wed Jul 31 12:47:33 2019 +0000

    ceph: mgr-restful-plugin restarts on controller unlock

    When standby controller is unlocked its mgr-restful-plugin
    service starts and generates node specific self-signed
    certificates to be used by the restful plugin. This operation
    triggers a restart of the "active" mgr restful plugin
    which in turn causes Ceph REST API requests to fail.

    This failure is handled on the active controller by
    restarting the service. This happens while stx-openstack
    is reapplied and is the reason why mariadb pod fails to start.

    Change ceph-mgr and restful plugin config and startup
    procedure so a secondary ceph-mgr service doesn't disrupt
    the active one.

    Closes-Bug: 1837581
    Change-Id: Id8e5e56d48669498202ed319a9aad68365b51f23
    Signed-off-by: Daniel Badea <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.