stx-openstack apply-failed because of osh-openstack-rabbitmq install issue

Bug #1928949 reported by Alexandru Dimofte
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Critical
Dan Voiculeasa

Bug Description

Brief Description
-----------------
StarlingX installation failed during Provisioning(no Sanity test executed) on Standard baremetal RC5.0 and Master. Stx-openstack apply-failed because of:
 Error while installing release osh-openstack-rabbitmq: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "release osh-openstack-rabbitmq failed: timed out waiting for the condition"
        debug_error_string = "{"created":"@1621431871.635155914","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release osh-openstack-rabbitmq failed: timed out waiting for the condition","grpc_status":2}"

Pod issue:
openstack osh-openstack-rabbitmq-cluster-wait-ht2ws 0/1 Init:0/2 0 45m

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
Install latest RC5.0 image

Expected Behavior
------------------
Stx should install fine

Actual Behavior
----------------
Stx installation failed:

2021-05-19 13:44:31.635 649 ERROR armada.handlers.tiller [-] [chart=openstack-rabbitmq]: Error while installing release osh-openstack-rabbitmq: grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "release osh-openstack-rabbitmq failed: timed out waiting for the condition"
        debug_error_string = "{"created":"@1621431871.635155914","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release osh-openstack-rabbitmq failed: timed out waiting for the condition","grpc_status":2}"
>
2021-05-19 13:44:31.635 649 ERROR armada.handlers.tiller Traceback (most recent call last):
2021-05-19 13:44:31.635 649 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 473, in install_release
2021-05-19 13:44:31.635 649 ERROR armada.handlers.tiller metadata=self.metadata)
2021-05-19 13:44:31.635 649 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 533, in __call__
2021-05-19 13:44:31.635 649 ERROR armada.handlers.tiller return _end_unary_response_blocking(state, call, False, None)
2021-05-19 13:44:31.635 649 ERROR armada.handlers.tiller File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
2021-05-19 13:44:31.635 649 ERROR armada.handlers.tiller raise _Rendezvous(state, None, None, deadline)
2021-05-19 13:44:31.635 649 ERROR armada.handlers.tiller grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2021-05-19 13:44:31.635 649 ERROR armada.handlers.tiller status = StatusCode.UNKNOWN
2021-05-19 13:44:31.635 649 ERROR armada.handlers.tiller details = "release osh-openstack-rabbitmq failed: timed out waiting for the condition"
2021-05-19 13:44:31.635 649 ERROR armada.handlers.tiller debug_error_string = "{"created":"@1621431871.635155914","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release osh-openstack-rabbitmq failed: timed out waiting for the condition","grpc_status":2}"
2021-05-19 13:44:31.635 649 ERROR armada.handlers.tiller >
2021-05-19 13:44:31.635 649 ERROR armada.handlers.tiller
2021-05-19 13:44:31.636 649 DEBUG armada.handlers.tiller [-] [chart=openstack-rabbitmq]: Helm getting release status for release=osh-openstack-rabbitmq, version=0 get_release_status /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:539
2021-05-19 13:44:31.803 649 DEBUG armada.handlers.tiller [-] [chart=openstack-rabbitmq]: GetReleaseStatus= name: "osh-openstack-rabbitmq"
info {
  status {
    code: FAILED
  }
  first_deployed {
    seconds: 1621430071
    nanos: 418993684
  }
  last_deployed {
    seconds: 1621430071
    nanos: 418993684
  }
  Description: "Release \"osh-openstack-rabbitmq\" failed: timed out waiting for the condition"
}
namespace: "openstack"
 get_release_status /usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py:547
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada [-] Chart deploy [openstack-rabbitmq] failed: armada.exceptions.tiller_exceptions.ReleaseException: Failed to Install release: osh-openstack-rabbitmq - Tiller Message: b'Release "osh-openstack-rabbitmq" failed: timed out waiting for the condition'
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada Traceback (most recent call last):
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 473, in install_release
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada metadata=self.metadata)
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 533, in __call__
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada return _end_unary_response_blocking(state, call, False, None)
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/grpc/_channel.py", line 467, in _end_unary_response_blocking
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada raise _Rendezvous(state, None, None, deadline)
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with:
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada status = StatusCode.UNKNOWN
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada details = "release osh-openstack-rabbitmq failed: timed out waiting for the condition"
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada debug_error_string = "{"created":"@1621431871.635155914","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"release osh-openstack-rabbitmq failed: timed out waiting for the condition","grpc_status":2}"
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada >
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada During handling of the above exception, another exception occurred:
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada Traceback (most recent call last):
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 225, in handle_result
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada result = get_result()
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 236, in <lambda>
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada if (handle_result(chart, lambda: deploy_chart(chart))):
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 214, in deploy_chart
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada chart, cg_test_all_charts, prefix, known_releases)
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/chart_deploy.py", line 239, in execute
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada timeout=timer)
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada File "/usr/local/lib/python3.6/dist-packages/armada/handlers/tiller.py", line 486, in install_release
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada raise ex.ReleaseException(release, status, 'Install')
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada armada.exceptions.tiller_exceptions.ReleaseException: Failed to Install release: osh-openstack-rabbitmq - Tiller Message: b'Release "osh-openstack-rabbitmq" failed: timed out waiting for the condition'
2021-05-19 13:44:31.804 649 ERROR armada.handlers.armada
2021-05-19 13:44:31.805 649 ERROR armada.handlers.armada [-] Chart deploy(s) failed: ['openstack-rabbitmq']
2021-05-19 13:44:32.474 649 INFO armada.handlers.lock [-] Releasing lock
2021-05-19 13:44:32.480 649 ERROR armada.cli [-] Caught internal exception: armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['openstack-rabbitmq']
2021-05-19 13:44:32.480 649 ERROR armada.cli Traceback (most recent call last):
2021-05-19 13:44:32.480 649 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/__init__.py", line 38, in safe_invoke
2021-05-19 13:44:32.480 649 ERROR armada.cli self.invoke()
2021-05-19 13:44:32.480 649 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 213, in invoke
2021-05-19 13:44:32.480 649 ERROR armada.cli resp = self.handle(documents, tiller)
2021-05-19 13:44:32.480 649 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/lock.py", line 81, in func_wrapper
2021-05-19 13:44:32.480 649 ERROR armada.cli return future.result()
2021-05-19 13:44:32.480 649 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 425, in result
2021-05-19 13:44:32.480 649 ERROR armada.cli return self.__get_result()
2021-05-19 13:44:32.480 649 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
2021-05-19 13:44:32.480 649 ERROR armada.cli raise self._exception
2021-05-19 13:44:32.480 649 ERROR armada.cli File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
2021-05-19 13:44:32.480 649 ERROR armada.cli result = self.fn(*self.args, **self.kwargs)
2021-05-19 13:44:32.480 649 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/cli/apply.py", line 256, in handle
2021-05-19 13:44:32.480 649 ERROR armada.cli return armada.sync()
2021-05-19 13:44:32.480 649 ERROR armada.cli File "/usr/local/lib/python3.6/dist-packages/armada/handlers/armada.py", line 252, in sync
2021-05-19 13:44:32.480 649 ERROR armada.cli raise armada_exceptions.ChartDeployException(failures)
2021-05-19 13:44:32.480 649 ERROR armada.cli armada.exceptions.armada_exceptions.ChartDeployException: Exception deploying charts: ['openstack-rabbitmq']
2021-05-19 13:44:32.480 649 ERROR armada.cli
command terminated with exit code 1

Reproducibility
---------------
I don't have yet this info

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
RC5.0 and Master

Last Pass
---------
yesterday

Timestamp/Logs
--------------
will be attached

Test Activity
-------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Austin Sun (sunausti) wrote :

This issue is because the osh-openstack-rabbitmq-rabbitmq-0 is keeping crash.

controller-0:~$ kubectl logs -n openstack osh-openstack-rabbitmq-rabbitmq-0
++ echo osh-openstack-rabbitmq-rabbitmq-0
++ awk -F - '{print $NF}'
+ POD_INCREMENT=0
+ '[' 0 -eq 0 ']'
+ exec rabbitmq-server
Failed to create thread: Resource temporarily unavailable (11)

another found:
controller-0 build info is not same as controller-1 , controller-0 is based on 5.0 , but controller-1 is in master. it is very strange.

Revision history for this message
Austin Sun (sunausti) wrote :

and for controller-1 , the kubelet have included changes related pod pid limited
https://review.opendev.org/c/starlingx/ansible-playbooks/+/791292/3

suspect this will impact rabbitmq pod behavior.

Revision history for this message
Austin Sun (sunausti) wrote :

once change /etc/sysconfig/kubelet --pod-max-pids to 2000, restart kubelet, the rabbitmq pod on controller-1 is starting and running healthy.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Austin / Alexandru,
The commit mentioned above is not in the r/stx.5.0 branch. This doesn't explain why this issue would be seen in the rc.5.0 build.

Changed in starlingx:
importance: Undecided → Critical
tags: added: stx.5.0 stx.6.0 stx.distro.openstack
tags: added: stx.containers
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Austin has indicated that in the sanity lab controller-1 is running an stx master load while controller-0 is running r/stx.5.0. This would point to a lab setup issue and is perhaps the reason Alexandru is reporting the issue on r/stx.5.0.

@Alexandru, please investigate this and re-run the sanity with the r/stx.5.0 builds.

Revision history for this message
Frank Miller (sensfan22) wrote :

There are 2 issues reported in this LP. One is due to controller-1 not booting the right and the 2nd is due to rabbitmq not recovering due to the pod limit issue. Assigning this LP to Dan to address the 2nd issue which needs to be fixed on the master branch only.

Changed in starlingx:
assignee: nobody → Dan Voiculeasa (dvoicule)
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/792565

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/792582

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/792565
Committed: https://opendev.org/starlingx/config/commit/8423e70fd04f07bbf6a22eb83d45c719663b0c51
Submitter: "Zuul (22348)"
Branch: master

commit 8423e70fd04f07bbf6a22eb83d45c719663b0c51
Author: Dan Voiculeasa <email address hidden>
Date: Fri May 21 12:56:47 2021 +0300

    Fix pod max pids service parameter default value

    Openstack installation fails for rabbit-mq pods.

    Change the approach of how the default value is selected.
    Document recommended minimum values for apps instead of using them.
    Select the default value as high as possible, protecting against a
    rogue pod, protecting against platform slowdowns created by high number
    of processes in the system, but low enough such that platform is still
    responsive even on older hardware.
    User is free to decrease the limit to increase the degree of protection
    against slowdowns.

    Initially it was observed that openstack pods reach ~450 processes
    in steady state.
    New tests show even with the 2/3 extra room, 750 pid limit is not
    sufficient when deploying rabbit-mq pods. But 2000 is.
    Recommended minimum value for openstack pods pid limit becomes 2000.

    Partial-Bug: 1928949
    Related-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I0d66173e2247fae15eda1ad0e83c7bcf858f0369

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/792582
Committed: https://opendev.org/starlingx/stx-puppet/commit/4a9710600d7dfd1c12d92695dbfedff619ce482a
Submitter: "Zuul (22348)"
Branch: master

commit 4a9710600d7dfd1c12d92695dbfedff619ce482a
Author: Dan Voiculeasa <email address hidden>
Date: Fri May 21 14:03:30 2021 +0300

    Change pod pid limit default value

    Change the behavior of kubernetes pod pid limit in case the service
    parameter is missing.

    The initial change(I10c1684fe3145e0a46b011f8e87f7a23557ddd4a) proposed a
    value to protect the system by default in case the service parameter was
    missing. The value was aligned with what was believed to work for
    StarlingX apps. Some apps, openstack for example, are upstream and
    StarlingX doesn't control changes inside them. Instead of maintaining
    the value initially proposed here, change the approach.

    Change the behaviour to use the maximum value for the service parameter
    by default.

    Partial-Bug: 1928949
    Related-Bug: 1928353
    Signed-off-by: Dan Voiculeasa <email address hidden>
    Change-Id: I0f776d9a8be57363475b926242a6fa7192addd56

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Alexandru confirmed that this issue is only applicable to stx master. Once the lab config was fixed to boot the stx.5.0 load on both controllers, this was not seen on the 5.0 loads. Removing the stx.5.0 label.

tags: removed: stx.5.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as Fix Released. The code changes above should address the issue reported on the stx master branch.

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/794611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (f/centos8)
Download full text (147.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/config/+/794906
Committed: https://opendev.org/starlingx/config/commit/75758b37a5a23c8811355b67e2a430a1713cd85b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 9e420d9513e5fafb1df4d29567bc299a9e04d58d
Author: Bin Qian <email address hidden>
Date: Mon May 31 14:45:52 2021 -0400

    Add more logging to run docker login

    Add error log for running docker login. The new log could
    help identify docker login failure.

    Closes-Bug: 1930310
    Change-Id: I8a709fb6665de8301fbe3022563499a92b2a0211
    Signed-off-by: Bin Qian <email address hidden>

commit 31c77439d2cea590dfcca13cfa646522665f8686
Author: albailey <email address hidden>
Date: Fri May 28 13:42:42 2021 -0500

    Fix controller-0 downgrade failing to kill ceph

    kill_ceph_storage_monitor tried to manipulate a pmon
    file that does not exist in an AIO-DX environment.

    We no longer invoke kill_ceph_storage_monitor in an
    AIO SX or DX env.

    This allows: "system host-downgrade controller-0"
    to proceed in an AIO-DX environment where that second
    controller (controller-0) was upgraded.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I633853f75317736084feae96b5b849c601204c13

commit 0dc99eee608336fe01b58821ea404286371f1408
Author: albailey <email address hidden>
Date: Fri May 28 11:05:43 2021 -0500

    Fix file permissions failure during duplex upgrade abort

    When issuing a downgrade for controller-0 in a duplex upgrade
    abort and rollback scenario, the downgrade command was failing
    because the sysinv API does not have root permissions to set
    a file flag.
    The fix is to use RPC so the conductor can create the flag
    and allow the downgrade for controller-0 to get further.

    Partial-Bug: 1929884
    Signed-off-by: albailey <email address hidden>
    Change-Id: I913bcad73309fe887a12cbb016a518da93327947

commit 7ef3724dad173754e40b45538b1cc726a458cc1c
Author: Chen, Haochuan Z <email address hidden>
Date: Tue May 25 16:16:29 2021 +0800

    Fix bug rook-ceph provision with multi osd on one host

    Test case:
    1, deploy simplex system
    2, apply rook-ceph with below override value
    value.yaml
    cluster:
      storage:
        nodes:
        - name: controller-0
          devices:
          - name: sdb
          - name: sdc
    3, reboot

    Without this fix, only osd pod could launch successfully after boot
    as vg start with ceph could not correctly add in sysinv-database

    Closes-bug: 1929511

    Change-Id: Ia5be599cd168d13d2aab7b5e5890376c3c8a0019
    Signed-off-by: Chen, Haochuan Z <email address hidden>

commit 23505ba77d76114cf8a0bf833f9a5bcd05bc1dd1
Author: Angie Wang <email address hidden>
Date: Tue May 25 18:49:21 2021 -0400

    Fix issue in partition data migration script

    The created partition dictonary partition_map is not
    an ordered dict so we need to sort it by its key -
    device node when iterating it to adjust the device
    nodes/paths for user created extra partitions to ensure
    the number of device node...

tags: added: in-f-centos8
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on config (f/centos8)

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793696

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Chuck Short <email address hidden>" on branch: f/centos8
Review: https://review.opendev.org/c/starlingx/config/+/793460

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.