Brief Description
-----------------
While the system was in application-apply state, sysinv-conductor was observed to fail. This resulted in apply being stuck in 'retrieving docker images' state.
Severity
--------
Major
Steps to Reproduce
------------------
1. Run system application-apply stx-openstack
2. tail /var/log/sysinv.log
3. Observe sysinv-conductor failure
4. Now system is stuck in 'retrieving docker images' state
Need to investigate why sysinv-conductor failed as well as how to recover from failure.
Expected Behavior
------------------
The system should not experience sysinv-conductor failure. If failure does occur, it should be possible to recover from it.
Actual Behavior
----------------
Failure observed and recovery not automatic.
Reproducibility
---------------
Reproducible.
System Configuration
--------------------
AIO-DX system (in particular Supermicro hardware)
Branch/Pull Time/Commit
-----------------------
Designer build provided on April 22 (rebased to latest s/w) - includes ceph mimic upversioned code
Timestamp/Logs
--------------
Around the time of the sysinv-conductor death there are some logs in kern.log which suggest really sluggish response time for etcd:
2019-04-23T14:18:12.071 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.071519 W | etcdserver: read-only range request "key:\"/registry/jobs\" range_end:\"/registry/jobt\" count_only:true " with result "range_response_count:0 size:6" took too long (758.4762ms) to execute
2019-04-23T14:18:12.071 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.071630 W | etcdserver: read-only range request "key:\"/registry/apiregistration.k8s.io/apiservices\" range_end:\"/registry/apiregistration.k8s.io/apiservicet\" count_only:true " with result "range_response_count:0 size:8" took too long (830.395705ms) to execute
2019-04-23T14:18:12.074 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.071696 W | etcdserver: read-only range request "key:\"/registry/pods/kube-system/\" range_end:\"/registry/pods/kube-system0\" " with result "range_response_count:14 size:35010" took too long (739.355553ms) to execute
2019-04-23T14:18:12.075 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.071739 W | etcdserver: read-only range request "key:\"/registry/volumeattachments\" range_end:\"/registry/volumeattachmentt\" count_only:true " with result "range_response_count:0 size:6" took too long (538.18197ms) to execute
2019-04-23T14:18:12.075 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.071903 W | etcdserver: request "header:<ID:7587837798489357492 > txn:<compare:<target:MOD key:\"/registry/minions/controller-0\" mod_revision:95343 > success:<request_put:<key:\"/registry/minions/controller-0\" value_size:6434 >> failure:<request_range:<key:\"/registry/minions/controller-0\" > >>" with result "size:18" took too long (101.433392ms) to execute
2019-04-23T14:18:12.075 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.072054 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:456" took too long (552.215347ms) to execute
2019-04-23T14:18:12.075 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.072117 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " with result "range_response_count:1 size:438" took too long (371.086371ms) to execute
2019-04-23T14:18:12.075 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.072154 W | etcdserver: read-only range request "key:\"/registry/replicasets\" range_end:\"/registry/replicasett\" count_only:true " with result "range_response_count:0 size:8" took too long (230.526471ms) to execute
2019-04-23T14:18:12.256 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.255991 W | etcdserver: read-only range request "key:\"/registry/pods/kube-system/tiller-deploy-744cffbb75-nk7bf\" " with result "range_response_count:1 size:1595" took too long (145.25869ms) to execute
2019-04-23T14:18:18.366 controller-0 kernel: warning [67557.839251] iptables-in-dropped: IN=eno1 OUT= MAC=ff:ff:ff:ff:ff:ff:5c:f3:fc:d7:cf:53:08:00 SRC=0.0.0.0 DST=255.255.255.255 LEN=338 TOS=0x10 PREC=0x00 TTL=128 ID=0 PROTO=UDP SPT=68 DPT=67 LEN=318
2019-04-23T14:18:19.824 controller-0 kernel: info [67559.297232] docker0: port 1(vethac15f92) entered disabled state
2019-04-23T14:18:19.851 controller-0 kernel: info [67559.323975] docker0: port 1(vethac15f92) entered disabled state
2019-04-23T14:18:19.855 controller-0 kernel: info [67559.327950] device vethac15f92 left promiscuous mode
2019-04-23T14:18:19.855 controller-0 kernel: info [67559.327954] docker0: port 1(vethac15f92) entered disabled state
Here’s the hung process:
2019-04-23 14:16:31.435 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1 download started from local registry
2019-04-23 14:16:33.350 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1 is not available in local registry, download started from public/private registry
2019-04-23 14:16:38.354 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/openstackhelm/openvswitch:v2.8.1 download succeeded in 60 seconds
2019-04-23 14:16:38.427 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/starlingx/stx-cinder:master-centos-stable-latest download started from local registry
2019-04-23 14:16:40.326 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/starlingx/stx-cinder:master-centos-stable-latest is not available in local registry, download started from public/private registry
2019-04-23 14:16:49.874 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/osixia/keepalived:1.4.5 download succeeded in 70 seconds
2019-04-23 14:17:10.095 108969 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_start_2019-04-23-14-17-10 patch
2019-04-23 14:17:10.095 108969 INFO sysinv.api.controllers.v1.host [-] controller-0 1. delta_handle ['availability']
2019-04-23 14:17:10.229 108969 INFO sysinv.api.controllers.v1.host [-] host controller-0 ihost_patch_end_2019-04-23-14-17-10 patch
2019-04-23 14:17:20.915 108074 INFO oslo_service.service [-] Caught SIGTERM, stopping children
2019-04-23 14:17:20.916 108074 INFO oslo.service.wsgi [-] Stopping WSGI server.
2019-04-23 14:17:20.916 108074 INFO oslo_service.service [-] Waiting on 1 children to exit
2019-04-23 14:17:20.918 108969 INFO oslo.service.wsgi [-] Stopping WSGI server.
2019-04-23 14:17:20.944 108074 INFO oslo_service.service [-] Child 108969 exited with status 0
2019-04-23 14:17:20.950 108074 INFO oslo_service.service [-] Caught SIGTERM, stopping children
2019-04-23 14:17:20.950 108074 INFO oslo.service.wsgi [-] Stopping WSGI server.
2019-04-23 14:17:20.951 108971 INFO oslo.service.wsgi [-] Stopping WSGI server.
2019-04-23 14:17:20.957 108074 INFO oslo_service.service [-] Waiting on 1 children to exit
2019-04-23 14:17:20.957 108074 INFO oslo_service.service [-] Child 108971 exited with status 0
2019-04-23 14:17:20.976 107974 INFO sysinv.openstack.common.service [-] Caught SIGTERM, exiting
2019-04-23 14:17:24.195 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.9.0 download succeeded in 106 seconds
2019-04-23 14:17:54.135 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/starlingx/stx-horizon:master-centos-stable-latest download succeeded in 136 seconds
2019-04-23 14:17:54.217 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/docker:17.07.0 download started from local registry
2019-04-23 14:17:54.313 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/starlingx/stx-nova:master-centos-stable-latest download started from local registry
2019-04-23 14:17:55.367 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/prom/memcached-exporter:v0.4.1 download succeeded in 136 seconds
2019-04-23 14:17:56.535 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/docker:17.07.0 is not available in local registry, download started from public/private registry
2019-04-23 14:17:56.618 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/starlingx/stx-nova:master-centos-stable-latest is not available in local registry, download started from public/private registry
2019-04-23 14:18:07.669 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/rabbitmq:3.7.13-management download succeeded in 149 seconds
2019-04-23 14:18:12.402 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/quay.io/external_storage/rbd-provisioner:v2.1.1-k8s1.11 download succeeded in 154 seconds
2019-04-23 14:18:12.431 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/starlingx/stx-libvirt:master-centos-stable-latest download started from local registry
2019-04-23 14:18:14.476 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/starlingx/stx-libvirt:master-centos-stable-latest is not available in local registry, download started from public/private registry
Sysinv-conductor died according to SM logs
2019-04-23T14:18:18.000 controller-0 sm: debug time[67547.465] log<1161> INFO: sm[92977]: sm_service_fsm.c(1451): Service (sysinv-conductor) process failure, pid=107974, exit_code=-65533.
Suspicion is the sysinv failure is tied to recent sanity failures. This LP will be used to focus on recovery when sysinv dies during stx-openstack apply.