DC system bootstrap failing because of RabbitMQ having timeout issues

Bug #1995518 reported by Samuel Presa Toledo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Samuel Presa Toledo

Bug Description

Brief Description
-----------------
[1] describes a 10s timeout behavior running "rabbitmqctl wait". Analyzing
the rabbit logs from a Distributed Cloud system that was presenting rabbit
startup issues, it looks the startup time for rabbit was taking around
8.5s. Based on the 10s timeout behavior issue from [1], the rabbit service
stop working after reboot controller-0 from a DC system triggered by
deployment manager execution.

This was observed in some DC Systems.

1 - https://github.com/rabbitmq/rabbitmq-server-release/pull/129#issue-599125985

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
sm-unmanage service rabbit
sm-manage service rabbit
sm-restart service rabbit

Expected Behavior
------------------
DC system installed successfully

Actual Behavior
----------------
When DC system needs to be rebooted, the rabbit service is not initialized correctly.

Reproducibility
---------------
Intermittent
Issue ocurred in some DC systems 100% of times. In other DC systems this does not happen at all

System Configuration
--------------------
DC System

Timestamp/Logs
--------------
2022-10-28T14:29:01.110 controller-0 OCF_stx.rabbitmq-server(rabbit)[84843]: err ERROR: Unexpected return from rabbitmqctl -n rabbit@localhost wait /var/run/rabbitmq/rabbitmq.pid: 75

Test Activity
-------------
Regression

Workaround
----------
Run sm-unmanage service rabbit
Wait until the processes related to rabbit processes are gone or kill them (ps aux | grep rabbit)
Run sm-manage service rabbit

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config-files (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config-files (master)

Reviewed: https://review.opendev.org/c/starlingx/config-files/+/863427
Committed: https://opendev.org/starlingx/config-files/commit/3d3e721a3cb403731b6c0a69c712d0222de92c71
Submitter: "Zuul (22348)"
Branch: master

commit 3d3e721a3cb403731b6c0a69c712d0222de92c71
Author: Samuel Toledo <email address hidden>
Date: Wed Nov 2 15:07:45 2022 -0300

    Add timeout parameter in rabbit_wait function

    [1] describes a 10s timeout behavior running "rabbitmqctl wait".
    Analyzing the rabbit logs from a Distributed Cloud system that was
    presenting rabbit startup issues, it looks the startup time for rabbit
    was taking around 8.5s. Based on the 10s timeout behavior issue from
    [1], the rabbit service stop working after reboot controller-0 from a
    DC system triggered by deployment manager execution.

    This review adds the "timeout" parameter in the "rabbitmqctl wait"
    command enabling again a clean installation.

    NOTE: this issue was observed in DC Systems.

    1 - https://github.com/rabbitmq/rabbitmq-server-release/pull/129#
    issue-599125985

    Test Plan:
    PASS - Run sm-restart service rabbit successfully check rabbit was
      running as expected. This test was used to recreate the bug.
    PASS - Reboot the host successfully and check rabbit was running as
      expected.

    Regression test:
    PASS - Install and bootstrap AIO-SX
    PASS - Install and bootstrap AIO-DX
    PASS - Install and bootstrap DC

    Closes-bug: 1995518

    Signed-off-by: Samuel Toledo <email address hidden>
    Change-Id: I2117205c0fcb5d92d30ee30ac280abcb66205d19

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.8.0 stx.config
Changed in starlingx:
assignee: nobody → Samuel Presa Toledo (spresato)
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.