Application-apply stuck in retrieving images state if sysinv crashes during apply

Bug #1826047 reported by Maria Yousaf
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
chen haochuan

Bug Description

Brief Description
-----------------
While the system was in application-apply state, sysinv-conductor was observed to fail. This resulted in apply being stuck in 'retrieving docker images' state.

Severity
--------
Major

Steps to Reproduce
------------------
1. Run system application-apply stx-openstack
2. tail /var/log/sysinv.log
3. Observe sysinv-conductor failure
4. Now system is stuck in 'retrieving docker images' state

Need to investigate why sysinv-conductor failed as well as how to recover from failure.

Expected Behavior
------------------
The system should not experience sysinv-conductor failure. If failure does occur, it should be possible to recover from it.

Actual Behavior
----------------
Failure observed and recovery not automatic.

Reproducibility
---------------
Reproducible.

System Configuration
--------------------
AIO-DX system (in particular Supermicro hardware)

Branch/Pull Time/Commit
-----------------------
Designer build provided on April 22 (rebased to latest s/w) - includes ceph mimic upversioned code

Timestamp/Logs
--------------

Around the time of the sysinv-conductor death there are some logs in kern.log which suggest really sluggish response time for etcd:

2019-04-23T14:18:12.071 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.071519 W | etcdserver: read-only range request "key:\"/registry/jobs\" range_end:\"/registry/jobt\" count_only:true " with result "range_response_count:0 size:6" took too long (758.4762ms) to execute
2019-04-23T14:18:12.071 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.071630 W | etcdserver: read-only range request "key:\"/registry/apiregistration.k8s.io/apiservices\" range_end:\"/registry/apiregistration.k8s.io/apiservicet\" count_only:true " with result "range_response_count:0 size:8" took too long (830.395705ms) to execute
2019-04-23T14:18:12.074 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.071696 W | etcdserver: read-only range request "key:\"/registry/pods/kube-system/\" range_end:\"/registry/pods/kube-system0\" " with result "range_response_count:14 size:35010" took too long (739.355553ms) to execute
2019-04-23T14:18:12.075 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.071739 W | etcdserver: read-only range request "key:\"/registry/volumeattachments\" range_end:\"/registry/volumeattachmentt\" count_only:true " with result "range_response_count:0 size:6" took too long (538.18197ms) to execute
2019-04-23T14:18:12.075 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.071903 W | etcdserver: request "header:<ID:7587837798489357492 > txn:<compare:<target:MOD key:\"/registry/minions/controller-0\" mod_revision:95343 > success:<request_put:<key:\"/registry/minions/controller-0\" value_size:6434 >> failure:<request_range:<key:\"/registry/minions/controller-0\" > >>" with result "size:18" took too long (101.433392ms) to execute
2019-04-23T14:18:12.075 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.072054 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-controller-manager\" " with result "range_response_count:1 size:456" took too long (552.215347ms) to execute
2019-04-23T14:18:12.075 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.072117 W | etcdserver: read-only range request "key:\"/registry/services/endpoints/kube-system/kube-scheduler\" " with result "range_response_count:1 size:438" took too long (371.086371ms) to execute
2019-04-23T14:18:12.075 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.072154 W | etcdserver: read-only range request "key:\"/registry/replicasets\" range_end:\"/registry/replicasett\" count_only:true " with result "range_response_count:0 size:8" took too long (230.526471ms) to execute
2019-04-23T14:18:12.256 controller-0 forward-journal[97311]: info 2019-04-23 14:18:12.255991 W | etcdserver: read-only range request "key:\"/registry/pods/kube-system/tiller-deploy-744cffbb75-nk7bf\" " with result "range_response_count:1 size:1595" took too long (145.25869ms) to execute
2019-04-23T14:18:18.366 controller-0 kernel: warning [67557.839251] iptables-in-dropped: IN=eno1 OUT= MAC=ff:ff:ff:ff:ff:ff:5c:f3:fc:d7:cf:53:08:00 SRC=0.0.0.0 DST=255.255.255.255 LEN=338 TOS=0x10 PREC=0x00 TTL=128 ID=0 PROTO=UDP SPT=68 DPT=67 LEN=318
2019-04-23T14:18:19.824 controller-0 kernel: info [67559.297232] docker0: port 1(vethac15f92) entered disabled state
2019-04-23T14:18:19.851 controller-0 kernel: info [67559.323975] docker0: port 1(vethac15f92) entered disabled state
2019-04-23T14:18:19.855 controller-0 kernel: info [67559.327950] device vethac15f92 left promiscuous mode
2019-04-23T14:18:19.855 controller-0 kernel: info [67559.327954] docker0: port 1(vethac15f92) entered disabled state

Here’s the hung process:

2019-04-23 14:16:31.435 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1 download started from local registry
2019-04-23 14:16:33.350 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/quay.io/stackanetes/kubernetes-entrypoint:v0.3.1 is not available in local registry, download started from public/private registry
2019-04-23 14:16:38.354 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/openstackhelm/openvswitch:v2.8.1 download succeeded in 60 seconds
2019-04-23 14:16:38.427 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/starlingx/stx-cinder:master-centos-stable-latest download started from local registry
2019-04-23 14:16:40.326 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/starlingx/stx-cinder:master-centos-stable-latest is not available in local registry, download started from public/private registry
2019-04-23 14:16:49.874 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/osixia/keepalived:1.4.5 download succeeded in 70 seconds

2019-04-23 14:17:10.095 108969 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_start_2019-04-23-14-17-10 patch
2019-04-23 14:17:10.095 108969 INFO sysinv.api.controllers.v1.host [-] controller-0 1. delta_handle ['availability']
2019-04-23 14:17:10.229 108969 INFO sysinv.api.controllers.v1.host [-] host controller-0 ihost_patch_end_2019-04-23-14-17-10 patch
2019-04-23 14:17:20.915 108074 INFO oslo_service.service [-] Caught SIGTERM, stopping children
2019-04-23 14:17:20.916 108074 INFO oslo.service.wsgi [-] Stopping WSGI server.
2019-04-23 14:17:20.916 108074 INFO oslo_service.service [-] Waiting on 1 children to exit
2019-04-23 14:17:20.918 108969 INFO oslo.service.wsgi [-] Stopping WSGI server.
2019-04-23 14:17:20.944 108074 INFO oslo_service.service [-] Child 108969 exited with status 0
2019-04-23 14:17:20.950 108074 INFO oslo_service.service [-] Caught SIGTERM, stopping children
2019-04-23 14:17:20.950 108074 INFO oslo.service.wsgi [-] Stopping WSGI server.
2019-04-23 14:17:20.951 108971 INFO oslo.service.wsgi [-] Stopping WSGI server.
2019-04-23 14:17:20.957 108074 INFO oslo_service.service [-] Waiting on 1 children to exit
2019-04-23 14:17:20.957 108074 INFO oslo_service.service [-] Child 108971 exited with status 0

2019-04-23 14:17:20.976 107974 INFO sysinv.openstack.common.service [-] Caught SIGTERM, exiting
2019-04-23 14:17:24.195 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.9.0 download succeeded in 106 seconds
2019-04-23 14:17:54.135 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/starlingx/stx-horizon:master-centos-stable-latest download succeeded in 136 seconds
2019-04-23 14:17:54.217 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/docker:17.07.0 download started from local registry
2019-04-23 14:17:54.313 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/starlingx/stx-nova:master-centos-stable-latest download started from local registry
2019-04-23 14:17:55.367 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/prom/memcached-exporter:v0.4.1 download succeeded in 136 seconds
2019-04-23 14:17:56.535 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/docker:17.07.0 is not available in local registry, download started from public/private registry
2019-04-23 14:17:56.618 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/starlingx/stx-nova:master-centos-stable-latest is not available in local registry, download started from public/private registry
2019-04-23 14:18:07.669 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/rabbitmq:3.7.13-management download succeeded in 149 seconds
2019-04-23 14:18:12.402 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/quay.io/external_storage/rbd-provisioner:v2.1.1-k8s1.11 download succeeded in 154 seconds
2019-04-23 14:18:12.431 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/starlingx/stx-libvirt:master-centos-stable-latest download started from local registry
2019-04-23 14:18:14.476 107974 INFO sysinv.conductor.kube_app [-] Image 192.168.204.2:9001/docker.io/starlingx/stx-libvirt:master-centos-stable-latest is not available in local registry, download started from public/private registry

Sysinv-conductor died according to SM logs
2019-04-23T14:18:18.000 controller-0 sm: debug time[67547.465] log<1161> INFO: sm[92977]: sm_service_fsm.c(1451): Service (sysinv-conductor) process failure, pid=107974, exit_code=-65533.

Revision history for this message
Frank Miller (sensfan22) wrote :

Suspicion is the sysinv failure is tied to recent sanity failures. This LP will be used to focus on recovery when sysinv dies during stx-openstack apply.

Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
Frank Miller (sensfan22) wrote :

Marking as release gating; the system should recover on an application-apply failure such that a 2nd attempt to apply is possible even if stuck in 'retrieving docker images' state.

Changed in starlingx:
assignee: nobody → Cindy Xie (xxie1)
Revision history for this message
Frank Miller (sensfan22) wrote :

Assigning to Cindy to identify a developer who can work on improving recovery when an application-apply fails leaving the system in 'retrieving docker images' state.

Changed in starlingx:
status: New → Triaged
tags: added: stx.2.0 stx.retestneeded
Changed in starlingx:
assignee: Cindy Xie (xxie1) → chen haochuan (martin1982)
Ghada Khalil (gkhalil)
tags: added: stx.containers
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/664521

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (master)

Change abandoned by chen haochuan (<email address hidden>) on branch: master
Review: https://review.opendev.org/664521

Revision history for this message
chen haochuan (martin1982) wrote :
Changed in starlingx:
status: In Progress → Confirmed
status: Confirmed → Invalid
status: Invalid → Opinion
status: Opinion → Invalid
Cindy Xie (xxie1)
Changed in starlingx:
status: Invalid → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by chen haochuan (<email address hidden>) on branch: master
Review: https://review.opendev.org/664521

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Duplicate bug is fixed in master and the r/stx.2.0 branch. Marking this as Fix Released to match.

Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

This was marked duplicate to LP-1833323 which has already been retested and closed.
Closing this one.

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.