DC subcloud mtclient logs repeatedly printing same lines frequently

Bug #1897334 reported by Anujeyan Manokeran
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Eric MacDonald

Bug Description

Brief Description
---------------------------------
         Distributed cloud sub cloud mtclient logs are showing below lines of entry printed very frequently. These logs should be verified and remove the frequency .

orker
2020-09-25T11:41:11.461 [100486.29133] controller-0 mtcClient --- mtcNodeComp.cpp ( 595) _self_identify : Info : Identity : {"hostname":"controller-0","personality":"controller,worker","mgmt_ip":"fd01:13::3","cluster_host_ip":"aefd::2","mgmt_mac":"3c:fd:fe:b5:7e:ec"
2020-09-25T11:41:11.461 [100486.29134] controller-0 mtcClient msg mtcCompMsg.cpp ( 903) send_mtcAlive_msg :Error : mtc_client_tx_socket_c1_clstr not ok
2020-09-25T11:41:16.461 [100486.29135] controller-0 mtcClient --- mtcNodeComp.cpp (1299) daemon_service_run : Warn : calling setup_clstr_tx_sockets (auto-recovery)
2020-09-25T11:41:16.461 [100486.29136] controller-0 mtcClient --- msgClass.cpp (1057) msgClassTx : Info : Creating vlan49 socket on port 2101 with address: controller-0-cluster-host
2020-09-25T11:41:16.461 [100486.29137] controller-0 mtcClient --- mtcNodeComp.cpp ( 595) _self_identify : Info : Identity : {"hostname":"controller-0","personality":"controller,worker","mgmt_ip":"fd01:13::3","cluster_host_ip":"aefd::2","mgmt_mac":"3c:fd:fe:b5:7e:ec"
2020-09-25T11:41:16.461 [100486.29138] controller-0 mtcClient msg mtcCompMsg.cpp ( 903) send_mtcAlive_msg :Error : mtc_client_tx_socket_c1_clstr not ok
2020-09-25T11:41:21.461 [100486.29139] controller-0 mtcClient --- mtcNodeComp.cpp (1299) daemon_service_run : Warn : calling setup_clstr_tx_sockets (auto-recovery)
2020-09-25T11:41:21.461 [100486.29140] controller-0 mtcClient --- msgClass.cpp (1057) msgClassTx : Info : Creating vlan49 socket on port 2101 with address: controller-0-cluster-host
2020-09-25T11:41:21.462 [100486.29141] controller-0 mtcClient --- mtcNodeComp.cpp ( 595) _self_identify : Info : Identity : {"hostname":"controller-0","personality":"controller,worker","mgmt_ip":"fd01:13::3","cluster_host_ip":"aefd::2","mgmt_mac":"3c:fd:fe:b5:7e:ec"
2020-09-25T11:41:21.462 [100486.29142] controller-0 mtcClient msg mtcCompMsg.cpp ( 903) send_mtcAlive_msg :Error : mtc_client_tx_socket_c1_clstr not ok
2020-09-25T11:41:26.462 [100486.29143] controller-0 mtcClient --- mtcNodeComp.cpp (1299) daemon_service_run : Warn : calling setup_clstr_tx_sockets (auto-recovery)
2020-09-25T11:41:26.462 [100486.29144] controller-0 mtcClient --- msgClass.cpp (1057) msgClassTx : Info : Creating vlan49 socket on port 2101 with address: controller-0-cluster-host
2020-09-25T11:41:26.462 [100486.29145] controller-0 mtcClient --- mtcNodeComp.cpp ( 595) _self_identify : Info : Identity : {"hostname":"controller-0","personality":"controller,worker","mgmt_ip":"fd01:13::3","cluster_host_ip":"aefd::2","mgmt_mac":"3c:fd:fe:b5:7e:ec"
2020-09-25T11:41:26.462 [100486.29146] controller-0 mtcClient msg mtcCompMsg.cpp ( 903) send_mtcAlive_msg :Error : mtc_client_tx_socket_c1_clstr not ok
2020-09-25T11:41:31.462 [100486.29147] controller-0 mtcClient --- mtcNodeComp.cpp (1299) daemon_service_run : Warn : calling setup_clstr_tx_sockets (auto-recovery)
2020-09-25T11:41:31.462 [100486.29148] controller-0 mtcClient --- msgClass.cpp (1057) msgClassTx : Info : Creating vlan49 socket on port 2101 with address: controller-0-cluster-host
2020-09-25T11:41:31.462 [100486.29149] controller-0 mtcClient --- mtcNodeComp.cpp ( 595) _self_identify : Info : Identity : {"hostname":"controller-0","personality":"controller,worker","mgmt_ip":"fd01:13::3","cluster_host_ip":"aefd::2","mgmt_mac":"3c:fd:fe:b5:7e:ec"
2020-09-25T11:41:31.462 [100486.29150] controller-0 mtcClient msg mtcCompMsg.cpp ( 903) send_mtcAlive_msg :Error : mtc_client_tx_socket_c1_clstr not ok
2020-09-25T11:41:36.462 [100486.29151] controller-0 mtcClient --- mtcNodeComp.cpp (1299) daemon_service_run : Warn : calling setup_clstr_tx_sockets (auto-recovery)
2020-09-25T11:41:36.462 [100486.29152] controller-0 mtcClient --- msgClass.cpp (1057) msgClassTx : Info : Creating vlan49 socket on port 2101 with address: controller-0-cluster-host
2020-09-25T11:41:36.463 [100486.29153] controller-0 mtcClient --- mtcNodeComp.cpp ( 595) _self_identify : Info : Identity : {"hostname":"controller-0","personality":"controller,worker","mgmt_ip":"fd01:13::3","cluster_host_ip":"aefd::2","mgmt_mac":"3c:fd:fe:b5:7e:ec"
2020-09-25T11:41:36.463 [100486.29154] controller-0 mtcClient msg mtcCompMsg.cpp ( 903) send_mtcAlive_msg :Error : mtc_client_tx_socket_c1_clstr not ok
2020-09-25T11:41:41.463 [100486.29155] controller-0 mtcClient --- mtcNodeComp.cpp (1299) daemon_service_run : Warn : calling setup_clstr_tx_sockets (auto-recovery)
2020-09-25T11:41:41.463 [100486.29156] controller-0 mtcClient --- msgClass.cpp (1057) msgClassTx : Info : Creating vlan49 socket on port 2101 with address: controller-0-cluster-host
2020-09-25T11:41:41.463 [100486.29157] controller-0 mtcClient --- mtcNodeComp.cpp ( 595) _self_identify : Info : Identity : {"hostname":"controller-0","personality":"controller,worker","mgmt_ip":"fd01:13::3","cluster_host_ip":"aefd::2","mgmt_mac":"3c:fd:fe:b5:7e:ec"
2020-09-25T11:41:41.463 [100486.29158] controller-0 mtcClient msg mtcCompMsg.cpp ( 903) send_mtcAlive_msg :Error : mtc_client_tx_socket_c1_clstr not ok
2020-09-25T11:41:46.463 [100486.29159] controller-0 mtcClient --- mtcNodeComp.cpp (1299) daemon_service_run : Warn : calling setup_clstr_tx_sockets (auto-recovery)
2020-09-25T11:41:46.463 [100486.29160] controller-0 mtcClient --- msgClass.cpp (1057) msgClassTx : Info : Creating vlan49 socket on port 2101 with address: controll_tx_socket_c1_clstr not ok

Steps to Reproduce
------------------
1. Install central cloud and sub cloud as per installation procedure.

System Configuration
--------------------
DC-3 with subclouds

Expected Behavior
------------------
No too frequently printing same lines writing in short time
Actual Behavior
----------------
As description says.
Reproducibility
---------------
Seen in all the DC labs

Load
----
2020-09-21_20-00-06
Last Pass
---------
Timestamp/Logs
--------------
orker
2020-09-25T11:41:11.461 [100486.29133] controller-0 mtcClient --- mtcNodeComp.cpp ( 595) _self_identify : Info : Identity : {"hostname":"controller-0","personality":"controller,worker","mgmt_ip":"fd01:13::3","cluster_host_ip":"aefd::2","mgmt_mac":"3c:fd:fe:b5:7e:ec"
2020-09-25T11:41:11.461 [100486.29134] controller-0 mtcClient msg mtcCompMsg.cpp ( 903) send_mtcAlive_msg :Error : mtc_client_tx_socket_c1_clstr not ok
2020-09-25T11:41:16.461 [100486.29135] controller-0 mtcClient --- mtcNodeComp.cpp (1299) daemon_service_run : Warn : calling setup_clstr_tx_sockets (auto-recovery)
2020-09-25T11:41:16.461 [100486.29136] controller-0 mtcClient --- msgClass.cpp (1057) msgClassTx : Info : Creating vlan49 socket on port 2101 with address: controller-0-cluster-host
2020-09-25T11:41:16.461 [100486.29137] controller-0 mtcClient --- mtcNodeComp.cpp ( 595) _self_identify : Info : Identity : {"hostname":"controller-0","personality":"controller,worker","mgmt_ip":"fd01:13::3","cluster_host_ip":"aefd::2","mgmt_mac":"3c:fd:fe:b5:7e:ec"
2020-09-25T11:41:16.461 [100486.29138] controller-0 mtcClient msg mtcCompMsg.cpp ( 903) send_mtcAlive_msg :Error : mtc_client_tx_socket_c1_clstr not ok
2020-09-25T11:41:21.461 [100486.29139] controller-0 mtcClient --- mtcNodeComp.cpp (1299) daemon_service_run : Warn : calling setup_clstr_tx_sockets (auto-recovery)
2020-09-25T11:41:21.461 [100486.29140] controller-0 mtcClient --- msgClass.cpp (1057) msgClassTx : Info : Creating vlan49 socket on port 2101 with address: controller-0-cluster-host
2020-09-25T11:41:21.462 [100486.29141] controller-0 mtcClient --- mtcNodeComp.cpp ( 595) _self_identify : Info : Identity : {"hostname":"controller-0","personality":"controller,worker","mgmt_ip":"fd01:13::3","cluster_host_ip":"aefd::2","mgmt_mac":"3c:fd:fe:b5:7e:ec"
2020-09-25T11:41:21.462 [100486.29142] controller-0 mtcClient msg mtcCompMsg.cpp ( 903) send_mtcAlive_msg :Error : mtc_client_tx_socket_c1_clstr not ok
2020-09-25T11:41:26.462 [100486.29143] controller-0 mtcClient --- mtcNodeComp.cpp (1299) daemon_service_run : Warn : calling setup_clstr_tx_sockets (auto-recovery)
2020-09-25T11:41:26.462 [100486.29144] controller-0 mtcClient --- msgClass.cpp (1057) msgClassTx : Info : Creating vlan49 socket on port 2101 with address: controller-0-cluster-host
2020-09-25T11:41:26.462 [100486.29145] controller-0 mtcClient --- mtcNodeComp.cpp ( 595) _self_identify : Info : Identity : {"hostname":"controller-0","personality":"controller,worker","mgmt_ip":"fd01:13::3","cluster_host_ip":"aefd::2","mgmt_mac":"3c:fd:fe:b5:7e:ec"
2020-09-25T11:41:26.462 [100486.29146] controller-0 mtcClient msg mtcCompMsg.cpp ( 903) send_mtcAlive_msg :Error : mtc_client_tx_socket_c1_clstr not ok
2020-09-25T11:41:31.462 [100486.29147] controller-0 mtcClient --- mtcNodeComp.cpp (1299) daemon_service_run : Warn : calling setup_clstr_tx_sockets (auto-recovery)
2020-09-25T11:41:31.462 [100486.29148] controller-0 mtcClient --- msgClass.cpp (1057) msgClassTx : Info : Creating vlan49 socket on port 2101 with address: controller-0-cluster-host
2020-09-25T11:41:31.462 [100486.29149] controller-0 mtcClient --- mtcNodeComp.cpp ( 595) _self_identify : Info : Identity : {"hostname":"controller-0","personality":"controller,worker","mgmt_ip":"fd01:13::3","cluster_host_ip":"aefd::2","mgmt_mac":"3c:fd:fe:b5:7e:ec"
2020-09-25T11:41:31.462 [100486.29150] controller-0 mtcClient msg mtcCompMsg.cpp ( 903) send_mtcAlive_msg :Error : mtc_client_tx_socket_c1_clstr not ok
2020-09-25T11:41:36.462 [100486.29151] controller-0 mtcClient --- mtcNodeComp.cpp (1299) daemon_service_run : Warn : calling setup_clstr_tx_sockets (auto-recovery)
2020-09-25T11:41:36.462 [100486.29152] controller-0 mtcClient --- msgClass.cpp (1057) msgClassTx : Info : Creating vlan49 socket on port 2101 with address: controller-0-cluster-host
2020-09-25T11:41:36.463 [100486.29153] controller-0 mtcClient --- mtcNodeComp.cpp ( 595) _self_identify : Info : Identity : {"hostname":"controller-0","personality":"controller,worker","mgmt_ip":"fd01:13::3","cluster_host_ip":"aefd::2","mgmt_mac":"3c:fd:fe:b5:7e:ec"
2020-09-25T11:41:36.463 [100486.29154] controller-0 mtcClient msg mtcCompMsg.cpp ( 903) send_mtcAlive_msg :Error : mtc_client_tx_socket_c1_clstr not ok
2020-09-25T11:41:41.463 [100486.29155] controller-0 mtcClient --- mtcNodeComp.cpp (1299) daemon_service_run : Warn : calling setup_clstr_tx_sockets (auto-recovery)
2020-09-25T11:41:41.463 [100486.29156] controller-0 mtcClient --- msgClass.cpp (1057) msgClassTx : Info : Creating vlan49 socket on port 2101 with address: controller-0-cluster-host
2020-09-25T11:41:41.463 [100486.29157] controller-0 mtcClient --- mtcNodeComp.cpp ( 595) _self_identify : Info : Identity : {"hostname":"controller-0","personality":"controller,worker","mgmt_ip":"fd01:13::3","cluster_host_ip":"aefd::2","mgmt_mac":"3c:fd:fe:b5:7e:ec"
2020-09-25T11:41:41.463 [100486.29158] controller-0 mtcClient msg mtcCompMsg.cpp ( 903) send_mtcAlive_msg :Error : mtc_client_tx_socket_c1_clstr not ok
2020-09-25T11:41:46.463 [100486.29159] controller-0 mtcClient --- mtcNodeComp.cpp (1299) daemon_service_run : Warn : calling setup_clstr_tx_sockets (auto-recovery)
2020-09-25T11:41:46.463 [100486.29160] controller-0 mtcClient --- msgClass.cpp (1057) msgClassTx : Info : Creating vlan49 socket on port 2101 with address: controll_tx_socket_c1_clstr not ok

Test Activity
-------------
Regression

Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :
tags: added: stx.retestneeded
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Ghada Khalil (gkhalil)
tags: added: stx.metal
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Would be nice to fix to have cleaner logs, but will not gate the next stx release

summary: - DC subcloud mtclient mtc clinet logs repeatedly printing same lines
- frequently
+ DC subcloud mtclient logs repeatedly printing same lines frequently
Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/metal/+/787398

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (master)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/787398
Committed: https://opendev.org/starlingx/metal/commit/7539d36c3f01a338acfa449204c6034dc43f45df
Submitter: "Zuul (22348)"
Branch: master

commit 7539d36c3f01a338acfa449204c6034dc43f45df
Author: Eric MacDonald <email address hidden>
Date: Wed Apr 21 10:12:30 2021 -0400

    Prevent mtcClient from sending to uninitialized socket in AIO SX

    The mtcClient will perform a socket reinit if it detects a socket
    failure. The mtcClient also avoids setting up its controller-1
    cluster network socket for the AIO SX system type ; because there
    is no controller-1 provisioned.

    Most AIO SX systems have the management/cluster networks set to
    the 'loopback' interface. However, when an AIO SX system is setup
    with its management and cluster networks on physical interfaces,
    with or without vlan, the mtcAlive send message utility will try
    to send to the uninitialized controller-1 cluster socket. This
    leads to a socket error that triggers a socket reinitialization
    loop which causes log flooding.

    This update adds a check to the mtcAlive send utility to avoid
    sending mtcAlive to controller-1 for AIO SX system type where
    there is no controller-1 provisioned; no send,no error,no flood.

    Since this update needed to add a system type check, this update
    also implemented a system type definition rename from CPE to AIO.
    Other related definitions and comments were also changed to make
    the code base more understandable and maintainable

    Test Plan:

    PASS: Verify AIO SX with mgmnt/clstr on physical (failure mode)
    PASS: Verify AIO SX Install with mgmnt/clstr on 'lo'
    PASS: Verify AIO SX Lock msg and ack over mgmnt and clstr
    PASS: Verify AIO SX locked-disabled-online state
    PASS: Verify mtcClient clstr socket error detect/auto-recovery (fit)
    PASS: Verify mtcClient mgmnt socket error detect/auto-recovery (fit)

    Regression:

    PASS: Verify AIO SX Lock and Unlock (lazy reboot)
    PASS: Verify AIO DX and DC install with pv regression and sanity
    PASS: Verify Standard system install with pv regression and sanity

    Change-Id: I658d33a677febda6c0e3fcb1d7c18e5b76cb3762
    Closes-Bug: 1897334
    Signed-off-by: Eric MacDonald <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to metal (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/c/starlingx/metal/+/792250

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to metal (f/centos8)
Download full text (34.9 KiB)

Reviewed: https://review.opendev.org/c/starlingx/metal/+/792250
Committed: https://opendev.org/starlingx/metal/commit/6c2905e665ceeebfa7717c9cbccc1c277d10966b
Submitter: "Zuul (22348)"
Branch: f/centos8

commit 5942a56ec6f0b265ca6d1c8c800fe84c4a22860f
Author: Eric MacDonald <email address hidden>
Date: Thu May 13 15:57:43 2021 +0000

    Revert "Align partitions created by kickstarters"

    This reverts commit 0e89acc83c616741952a068a3ff07ba91440eff8.

    Reason for revert: Review should have been abandoned rather than merged.

    Change-Id: I95f1e151183f122d93b834ab2a785736e5a8ef12
    Closes-Bug: 1928341

commit c7c341b198e79bb98f443c7c07f671c6387075af
Author: Don Penney <email address hidden>
Date: Fri May 7 08:56:06 2021 -0400

    Add /pxeboot/grubx64.efi symlink for UEFI pxeboot

    UEFI pxeboot with shim.efi looks for the grubx64.efi in the tftpboot
    root directory. This update creates a symlink to the
    /pxeboot/EFI/grubx64.efi file in /pxeboot.

    Change-Id: Iabf8ec89d0af6e6b1a62e20159ecdfa16729444e
    Partial-Bug: 1927730
    Signed-off-by: Don Penney <email address hidden>

commit ce7529964932a9fd1cc10ce18dbe11e89ee02223
Author: Eric MacDonald <email address hidden>
Date: Wed May 5 19:05:55 2021 -0400

    Fix enabling heartbeat of self from the peer controller

    This issue only occurs over an hbsAgent process restart
    where the ready event response does not include the
    heartbeat start of the peer controller.

    This update reverts a small code change that was
    introduced by the following update.

    https://review.opendev.org/c/starlingx/metal/+/788495

    Remove the my_hostname gate introduced at line 1267 of
    mtcCtrlMsg.cpp because it prevents enabling heartbeat
    of self by the peer controller.

    Change-Id: Id72c35f25e2a5231a8a8363a35a81e042f00085e
    Closes-Bug: 1922584
    Signed-off-by: Eric MacDonald <email address hidden>

commit 48978d804d6f22130d0bd8bd17f361441024bc6c
Author: Eric MacDonald <email address hidden>
Date: Wed Apr 28 09:39:19 2021 -0400

    Improved maintenance handling of spontaneous active controller reboot

    Performing a forced reboot of the active controller sometimes
    results in a second reboot of that controller. The cause of the
    second reboot was due to its reported uptime in the first mtcAlive
    message, following the reboot, as greater than 10 minutes.

    Maintenance has a long standing graceful recovery threshold of
    10 minutes. Meaning that if a host looses heartbeat and enters
    Graceful Recovery, if the uptime value extracted from the first
    mtcAlive message following the recovery of that host exceeds 10
    minutes, then maintenance interprets that the host did not reboot.
    If a host goes absent for longer than this threshold then for
    reasons not limited to security, maintenance declares the host
    as 'failed' and force re-enables it through a reboot.

    With the introduction of containers and addition of new features
    over the last few releases, boot times on some servers are
    approaching the 10 minute threshold an...

tags: added: in-f-centos8
Ghada Khalil (gkhalil)
tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.