Mtce heartbeat log flooding when cluster host network is set to lo in IPv6

Bug #1884585 reported by Eric MacDonald
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Multicase heartbeat messaging is seen to fail when the cluster host network is over the 'lo' interface.
The resulting failure causes hbsAgent log flooding that leads to frequent log rotation.

Severity
--------
Minor: hbsAgent.log flooding

Steps to Reproduce
------------------
Configure AIO SX controller cluster interface on 'lo' interface in IPv6 config.

Expected Behavior
------------------
Normal operation.

Actual Behavior
----------------
Socket errors that lead to log flooding.

Reproducibility
---------------
Frequent, could be 100% but exact percentage unknown.
Has been reproduced multiple times internally.

System Configuration
--------------------
AIO SX IPv6

Branch/Pull Time/Commit
-----------------------
stx 3.0

Last Pass
---------
Did this test scenario pass previously? If so, please indicate the load/pull time info of the last pass.
Use this section to also indicate if this is a new test scenario.

Timestamp/Logs
--------------
2020-06-22T18:36:39.129 [89995.4027710] controller-0 hbsAgent --- msgClass.cpp ( 737) write :Error : Failed to send with errno=101
2020-06-22T18:36:39.129 [89995.4027711] controller-0 hbsAgent hbs hbsAgent.cpp ( 956) hbs_pulse_request :Error : Failed to send Pulse request: 1964712:cgts pulse req:controller-0 to ff05::1b:2.2106 (rc:-1 ; 101:Network is unreachable)
2020-06-22T18:36:39.234 [89995.4027712] controller-0 hbsAgent --- msgClass.cpp ( 737) write :Error : Failed to send with errno=101
2020-06-22T18:36:39.234 [89995.4027713] controller-0 hbsAgent hbs hbsAgent.cpp ( 956) hbs_pulse_request :Error : Failed to send Pulse request: 1964713:cgts pulse req:controller-0 to ff05::1b:2.2106 (rc:-1 ; 101:Network is unreachable)

Test Activity
-------------
Evaluation

Workaround
----------
Not needed, hbsAgent has log rotation.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Issue not observed in AIO SX IPv4 systems.

Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 - would be nice to fix to avoid log clutter

tags: added: stx.metal
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.5.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/737863

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/737863
Committed: https://git.openstack.org/cgit/starlingx/metal/commit/?id=55d5f43edb171206debc29710b86c6b26855442e
Submitter: Zuul
Branch: master

commit 55d5f43edb171206debc29710b86c6b26855442e
Author: Eric MacDonald <email address hidden>
Date: Wed Jun 24 15:53:33 2020 -0400

    Fix heartbeat messaging when interface is set to 'lo'

    Maintenance heartbeat service should not be multicast
    messaging over an 'lo' interface which in IPv6 leads
    to socket failures, log flooding and the inability to
    detect and report pmond process failure.

    To fix that this update
     - configures pulse messaging to unicast for monitored
       networks configured as 'lo'.
     - prevents heartbeating over the cluster network if both
       it and the management network are both configured on
       the 'lo' interface.
     - improves logging to avoid flooding in the presence of
       socket setup or access errors.
     - stops logging netlink events (interface state changes)
       on unmonitored network interfaces.
     - maintains heartbeat disabled state until the management
       network is up.
     - modifies hbsAgent socket failure handling and its pmon
       conf file so that a persistent socket failure during
       startup is alarmed as an hbsAgent process failure.

    Test Plan:

    PASS: Verify logging over system install and socket errors
    PASS: Verify unicast messaging when cluster is set to 'lo'
    PASS: Verify no cluster network heartbeat when it and mgmnt
          are set to 'lo'.

    Regression:

    PASS: Verify heartbeat messaging and cluster info
    PASS: Verify pmond process failure alarm management
    PASS: Verify heartbeat failure detection and graceful recovery
    PASS: Verify AIO SX IPv6 system install and run
    PASS: Verify AIO DX IPv6 system install and run
    PASS: Verify Standard IPv6 system install and run
    PASS: Verify Storage system IPv6 install and run
    PASS: Verify Storage system IPv4 install and run
    PASS: Verify MNFA handling in IPv6 storage system

    Change-Id: I5a2a0b2dee0c690617c4e0b0e2ab8b1172b2dc49
    Closes-Bug: 1884585
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Changing the release tag to stx.4.0 since the fix made it in for that release.

tags: added: stx.4.0
removed: stx.5.0
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Issue also prevents the heartbeat service from monitoring and therefor alarming the pmond process if it fails.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

This following merged update fixes this issue.

https://review.opendev.org/c/starlingx/metal/+/737863

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.