Swact in duplex system failed and escalated to a controller reboot

Bug #2018346 reported by Adriano Oliveira
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Adriano Oliveira

Bug Description

Brief Description
-----------------
Under heavy load, in rare occasions, the swact operation takes a long time, around 37 minutes.

Severity
--------
Major

Steps to Reproduce
------------------
Only seen when system ran for around 15 days under heavy load.

Expected Behavior
------------------
Swact operation should always complete within 10-20 seconds.

Actual Behavior
----------------
Swact operation took 37 minutes

Reproducibility
---------------
Intermittent (only seen in some environments)

System Configuration
--------------------
Two node system

Timestamp/Logs
--------------
2023-01-20T06:29:05.000 controller-1 fmManager: info { "event_log_id" : "200.021", "reason_text" : "controller-1 manual 'controller switchover' request", "entity_instance_id" : "host=controller-1.command=swact", "severity" : "not-applicable", "state" : "msg", "timestamp" : "2023-01-20 06:29:05.960524" }

Test Activity
-------------
Feature Testing

Workaround
----------
Lock/Unlock the failing node

Changed in starlingx:
assignee: nobody → Adriano Oliveira (aoliveir)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config-files (master)

Reviewed: https://review.opendev.org/c/starlingx/config-files/+/882181
Committed: https://opendev.org/starlingx/config-files/commit/0dee292bef5028a21c0c4a556d16ad1af3c5e188
Submitter: "Zuul (22348)"
Branch: master

commit 0dee292bef5028a21c0c4a556d16ad1af3c5e188
Author: Adriano Oliveira <email address hidden>
Date: Wed May 3 16:30:53 2023 -0400

    Replace lsof by ss in RabbitMQ ocf script

    It has been noted on heavy load test conditions that lsof
    can hang for a considerable time and cause timeouts on the
    RabbitMQ stop path triggered from Service Manager on a
    swact scenario.

    To avoid that, both netstat or ss commands could be used to
    check for listening process on the amqp port (5672).

    The ss command has been chosen since man page of netstat mark
    it as obsolete and points ss as replacement for the major part
    of it.

    Also, note that ss uses Netlink which uses socket API.

    Closes-Bug: 2018346

    Test Plan:

    PASS: Verify, using ss, the listening amqp socket
    PASS: Verify AIO-DX is properly deployed
    PASS: Restart RabbitMQ service successfully using sm-restart
    PASS: Swact successfully on DX system
    PASS: Lock/unlock successfully

    Change-Id: I929b2a1b7a61eb70154c00177aa0b7f2fc46890a
    Signed-off-by: Adriano Oliveira <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.config stx.ha
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.