[Containers] application-apply stx-openstack stuck

Bug #1814142 reported by Jose Perez Carranza
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Austin Sun

Bug Description

Title
-----
Application-apply stx-openstack stuck in random pods until timeout of 30 min is reached.

Brief Description
-----------------
When executing `system application-apply` at some point gets freezed in random pods until timeout of 30 min is reached, I discovered that killing armada process and start again the “system application-apply” the application advance until again is freezed in a random pod. After some cycles the application is successfully completed.

Severity
--------
Provide the severity of the defect.
Major

Steps to Reproduce
------------------
Follow the steps to setup a StarlingX deployment as described here [1]

1 https://wiki.openstack.org/wiki/StarlingX/Containers/Installation

Expected Behavior
------------------
command `system application-apply stx-openstack` should be completed successfully

Actual Behavior
----------------
At some point gets freezed in random pods until timeout of 30 min is reached.

Reproducibility
---------------
100% Reproducible

Workaround
----------------
Kill armada process and start again the “system application-apply” the application advance until again is freezed in a random pod. After some cycles the application is successfully completed.

System Configuration
--------------------
- Simplex configured with containers
- Virtual Environment (libvirt)

Branch/Pull Time/Commit
-----------------------
Master
- ISO : stx-2019-01-31.iso [2]
   2- http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190131T060000Z/outputs/iso/bootimage.iso

Timestamp/Logs
--------------
Logs attached
- Config.log = System configuration
- conainters_sysinv.log = system inventory log
- stx-openstack-apply.log = armada partial log showing the failure
- pods.log = status of the pods at moment when system is freezed

Revision history for this message
Jose Perez Carranza (jgperezc) wrote :
Ghada Khalil (gkhalil)
tags: added: stx.containers
Revision history for this message
Ghada Khalil (gkhalil) wrote :

This bug is failing as the reporter's lab is unable to access the public docker registry.

There is currently a story in progress to allow user's to specify their own registry.
https://storyboard.openstack.org/#!/story/2004711

Assigning to the prime for the story to mark as Fix Released when this issue is addressed.

tags: added: stx.2019.05
Changed in starlingx:
importance: Undecided → High
assignee: nobody → Mingyuan Qi (myqi)
status: New → Triaged
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; related to the containers feature

Revision history for this message
Mingyuan Qi (myqi) wrote :
Revision history for this message
Jose Perez Carranza (jgperezc) wrote :

Applying below workaround:

1. kubectl -n kube-system edit configmap coredns
2. remove "loop" line and save
3. kubectl -n kube-system delete rs {your-coredns-replicaset-name}

A behavior is observed, always is freeze at 40% on glance pod until 30 mis timeout is reached, below are the info on the logs:

- There are some cycles while boostrap pod is on CrashLoopBackOff before the timeout is reached.

====================
openstack glance-bootstrap-htn9q 0/1 CrashLoopBackOff 5 7m51s
====================

- The sysinv.log shows:

=============
2019-02-18 15:47:08.578 44 DEBUG armada.handlers.k8s [-] Watch event MODIFIED on job glance-db-sync _watch_job_completion /usr/local/lib/python3.5/site-packages/armada/handlers/k8s.py:523
2019-02-18 15:47:08.578 44 DEBUG armada.handlers.k8s [-] Job glance-db-sync complete (spec.completions=1, status.succeeded=1) _watch_job_completion /usr/local/lib/python3.5/site-packages/armada/handlers/k8s.py:536
2019-02-18 15:47:19.870 44 DEBUG armada.handlers.k8s [-] Watch event MODIFIED on job glance-ks-endpoints _watch_job_completion /usr/local/lib/python3.5/site-packages/armada/handlers/k8s.py:523
2019-02-18 15:47:19.870 44 DEBUG armada.handlers.k8s [-] Job glance-ks-endpoints complete (spec.completions=1, status.succeeded=1) _watch_job_completion /usr/local/lib/python3.5/site-packages/armada/handlers/k8s.py:536
2019-02-18 15:47:25.070 44 DEBUG armada.handlers.k8s [-] Watch event MODIFIED on job glance-ks-user _watch_job_completion /usr/local/lib/python3.5/site-packages/armada/handlers/k8s.py:523
2019-02-18 15:47:25.070 44 DEBUG armada.handlers.k8s [-] Job glance-ks-user complete (spec.completions=1, status.succeeded=1) _watch_job_completion /usr/local/lib/python3.5/site-packages/armada/handlers/k8s.py:536
2019-02-18 15:47:33.262 44 DEBUG armada.handlers.k8s [-] Watch event MODIFIED on job glance-storage-init _watch_job_completion /usr/local/lib/python3.5/site-packages/armada/handlers/k8s.py:523
2019-02-18 15:47:33.263 44 DEBUG armada.handlers.k8s [-] Job glance-storage-init complete (spec.completions=1, status.succeeded=1) _watch_job_completion /usr/local/lib/python3.5/site-packages/armada/handlers/k8s.py:536
2019-02-18 15:55:07.438 44 DEBUG armada.handlers.k8s [-] Watch event MODIFIED on job glance-bootstrap _watch_job_completion /usr/local/lib/python3.5/site-packages/armada/handlers/k8s.py:523
2019-02-18 15:55:07.439 44 DEBUG armada.handlers.k8s [-] Watch event MODIFIED on job glance-bootstrap _watch_job_completion /usr/local/lib/python3.5/site-packages/armada/handlers/k8s.py:523
============

Revision history for this message
Austin Sun (sunausti) wrote :

Hi Jose :
   THanks is glance-bootstrap pod still existing ?

Frank Miller (sensfan22)
Changed in starlingx:
assignee: Mingyuan Qi (myqi) → Austin Sun (sunausti)
Revision history for this message
Austin Sun (sunausti) wrote :

We have to change glance charts to WA this when behind proxy.
Steps:
1) tar xvf helm-charts-manifest-no-tests.tgz
2) tar xvf charts/glance-0.1.0.tgz
3) vim charts/glance/templates/bin/_bootstrap.sh.tpl +25
   Add your proxy addr for curl command
Should be like “{ curl --proxy <proxy-addr> --fail -sSL -O {{ .source_url }}{{ .image_file }};
4) re-package glance-0.1.0.tgz
5) md5sum glance-0.1.0.tgz and update checksum.md5
6) re-package helm-charts-manifest-no-tests-proxy.tgz with the updated glance-0.1.0.tgz

Austin Sun (sunausti)
Changed in starlingx:
status: Triaged → Opinion
Revision history for this message
Jose Perez Carranza (jgperezc) wrote :

so for this Issue we have 2 WA already, so I think the proper fix should aim to attack both WA's.

==========================================================================
1. coredns WA

Steps:
1. kubectl -n kube-system edit configmap coredns
2. remove "loop" line and save
3. kubectl -n kube-system delete rs {your-coredns-replicaset-name}
==========================================================================
2. Glance WA

Steps:
1) tar xvf helm-charts-manifest-no-tests.tgz
2) tar xvf charts/glance-0.1.0.tgz
3) vim charts/glance/templates/bin/_bootstrap.sh.tpl +25
   Add your proxy addr for curl command
Should be like “{ curl --proxy <proxy-addr> --fail -sSL -O {{ .source_url }}{{ .image_file }};
4) re-package glance-0.1.0.tgz
5) md5sum glance-0.1.0.tgz and update checksum.md5
6) re-package helm-charts-manifest-no-tests-proxy.tgz with the updated glance-0.1.0.tgz

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-config (master)

Reviewed: https://review.openstack.org/638271
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=28766a8d43f579fb027f4152c3f6586418e1eb9d
Submitter: Zuul
Branch: master

commit 28766a8d43f579fb027f4152c3f6586418e1eb9d
Author: Irina Mihai <email address hidden>
Date: Wed Feb 20 21:11:56 2019 +0000

    Prevent download and creation of default Cirros glance image

    - downloading the Cirros image fails in glance-bootstrap if
      the hardcoded requested image is not found
    - to workaround this issue, we disable the download and creation
      of the Cirros image in glance-bootstrap through the overrides
      -> this has no other impact as the image can be created after
         the chart's installation using "openstack image create"

    Change-Id: I418eb236f5eceb0124eb73787fe12e2f0aa2d9e1
    Closes-Bug: 1814142
    Signed-off-by: Irina Mihai <email address hidden>

Changed in starlingx:
status: Opinion → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (f/stein)

Fix proposed to branch: f/stein
Review: https://review.openstack.org/638470

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-config (f/stein)
Download full text (4.1 KiB)

Reviewed: https://review.openstack.org/638470
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=cb45f9b3bdf68cbd6d6d21ccbe31d279dd4d05d1
Submitter: Zuul
Branch: f/stein

commit 28766a8d43f579fb027f4152c3f6586418e1eb9d
Author: Irina Mihai <email address hidden>
Date: Wed Feb 20 21:11:56 2019 +0000

    Prevent download and creation of default Cirros glance image

    - downloading the Cirros image fails in glance-bootstrap if
      the hardcoded requested image is not found
    - to workaround this issue, we disable the download and creation
      of the Cirros image in glance-bootstrap through the overrides
      -> this has no other impact as the image can be created after
         the chart's installation using "openstack image create"

    Change-Id: I418eb236f5eceb0124eb73787fe12e2f0aa2d9e1
    Closes-Bug: 1814142
    Signed-off-by: Irina Mihai <email address hidden>

commit 53b9e4661561c85aabe802d098e79c1c099e6bec
Author: SidneyAn <email address hidden>
Date: Mon Feb 18 22:03:57 2019 +0800

    retry func iconfig_update_file when host personality is None

    when we run "system dns-modify" command, the command will response after
    sysinv-db was updated, and file "/etc/resolv.conf" will be updated
    asynchronously by another process "sysinv-agent". Once the attr
    "_ihost_personality" of agent is None(initial value), it will not update
    file "/etc/resolv.conf" and will not inform sysinv client also,
    which will lead command dns-modify failed silently.

    This patch will retry function iconfig_update_file by which sysinv-agent
    update file "/etc/resolv.conf" when attr "_ihost_personality" is None.

    Closes-bug: 1812269

    Change-Id: I3a0437750a53607c04932c1b9b818e83903bb28b
    Signed-off-by: SidneyAn <email address hidden>

commit 0ce137a99a5fe04490dc23d2574beb6b1adbf343
Author: Kristine Bujold <email address hidden>
Date: Mon Feb 18 12:56:25 2019 -0500

    Move gnocchi and ceilometer static configs

    Move all gnocchi and ceilometer static configurations from the
    overrides to the Armada manifest.

    This is being done so we have a consistent way of managing
    containerized openstack configurations. Static configurations will
    be located in the Armada manifest and dynamic configuration will be
    located in the overrides files.

    Story: 2003909
    Task: 29535

    Change-Id: Ieab861cb1751146b70f722e70b8f89d81c0ed9a5
    Signed-off-by: Kristine Bujold <email address hidden>

commit 99e86fc151d326f4c26f9005d8cf84028078261c
Author: Kristine Bujold <email address hidden>
Date: Fri Feb 15 15:35:40 2019 -0500

    Move heat static configs to Armada manifest

    Move all heat static configurations from the overrides to the
    Armada manifest.

    This is being done so we have a consistent way of managing
    containerized openstack configurations. Static configurations will
    be located in the Armada manifest and dynamic configuration will be
    located in the overrides files.

    Story: 2003909
    Task: 29455

    Change-Id: Ie35b1696b9fce0458db724fc8163d5d181e0768a
    Sig...

Read more...

tags: added: in-f-stein
Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.