Intermitent Coredump into snmpSubAgent component

Bug #2015408 reported by Agustin Carranza
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Agustin Carranza

Bug Description

Brief Description

Coredump has been identified into snmpSubAgent component.

The issue is intermittent and has only be detected on Dx and Std environment.

Severity

<Minor: System/Feature is usable with minor issue>

Steps to Reproduce

After configuring SNMP following
https://docs.starlingx.io/fault-mgmt/kubernetes/enabling-snmp-support.html

1.- perform any snmp get/walk/bulk request.

2.- perform swact to controller-1 and repeat previous operation

3.- A coredump for snmpAgent is detected at ....

Note: if the coredump is not found, repeat the previous steps until the file is generated.

Expected Behavior

The system shall respond to snmp requests WITHOUT a coredump.

Actual Behavior

The system responds to snmp request WITH a coredump.

Reproducibility

Intermitent, sometime happen, sometime does not happen.

System Configuration

AIO-DX

Last Pass

Timestamp/Logs

The problem looks to ocurrs while performing snmpget request at follow

[2022-11-24 23:35:43,843] 416 INFO MainThread snmp_helper._execute_snmp_command:: Result WRS-ALARM-MIB::wrsAlarmActiveAlarmId.1 = No Such Object available on this agent at this OID
[2022-11-24 23:35:53,852] 409 INFO MainThread snmp_helper._execute_snmp_command:: Executing snmp cmd snmpget -v2c -c guanacloud [2620:10a:a001:a103::1085]:161 WRS-ALARM-MIB::wrsAlarmActiveAlarmId.1
[2022-11-24 23:35:53,852] 410 INFO MainThread snmp_helper._execute_snmp_command:: Retry 5/30 to give time to pods to get up appropriately.
[2022-11-24 23:35:54,072] 416 INFO MainThread snmp_helper._execute_snmp_command:: Result WRS-ALARM-MIB::wrsAlarmActiveAlarmId.1 = No Such Object available on this agent at this OID
[2022-11-24 23:36:04,080] 409 INFO MainThread snmp_helper._execute_snmp_command:: Executing snmp cmd snmpget -v2c -c guanacloud [2620:10a:a001:a103::1085]:161 WRS-ALARM-MIB::wrsAlarmActiveAlarmId.1
[2022-11-24 23:36:04,081] 410 INFO MainThread snmp_helper._execute_snmp_command:: Retry 6/30 to give time to pods to get up appropriately.
[2022-11-24 23:36:04,249] 416 INFO MainThread snmp_helper._execute_snmp_command:: Result WRS-ALARM-MIB::wrsAlarmActiveAlarmId.1 = No Such Object available on this agent at this OID
[2022-11-24 23:36:14,260] 409 INFO MainThread snmp_helper._execute_snmp_command:: Executing snmp cmd snmpget -v2c -c guanacloud [2620:10a:a001:a103::1085]:161 WRS-ALARM-MIB::wrsAlarmActiveAlarmId.1
[2022-11-24 23:36:14,260] 410 INFO MainThread snmp_helper._execute_snmp_command:: Retry 7/30 to give time to pods to get up appropriately.
[2022-11-24 23:36:20,390] 416 INFO MainThread snmp_helper._execute_snmp_command:: Result Timeout: No Response from [2620:10a:a001:a103::1085]:161.
[2022-11-24 23:36:30,400] 409 INFO MainThread snmp_helper._execute_snmp_command:: Executing snmp cmd snmpget -v2c -c guanacloud [2620:10a:a001:a103::1085]:161 WRS-ALARM-MIB::wrsAlarmActiveAlarmId.1
[2022-11-24 23:36:30,400] 410 INFO MainThread snmp_helper._execute_snmp_command:: Retry 8/30 to give time to pods to get up appropriately.
[2022-11-24 23:36:36,652] 416 INFO MainThread snmp_helper._execute_snmp_command:: Result Timeout: No Response from [2620:10a:a001:a103::1085]:161.
[2022-11-24 23:36:46,660] 409 INFO MainThread snmp_helper._execute_snmp_command:: Executing snmp cmd snmpget -v2c -c guanacloud [2620:10a:a001:a103::1085]:161 WRS-ALARM-MIB::wrsAlarmActiveAlarmId.1
[2022-11-24 23:36:46,660] 410 INFO MainThread snmp_helper._execute_snmp_command:: Retry 9/30 to give time to pods to get up appropriately.
[2022-11-24 23:36:46,916] 416 INFO MainThread snmp_helper._execute_snmp_command:: Result WRS-ALARM-MIB::wrsAlarmActiveAlarmId.1 = STRING: "200.001"
snmpget operation is requested as follow

[2022-11-24 23:36:04,249] 416 INFO MainThread snmp_helper._execute_snmp_command:: Result WRS-ALARM-MIB::wrsAlarmActiveAlarmId.1 = No Such Object available on this agent at this OID
the TC is considered pass after the response is validated

[2022-11-24 23:36:46,916] 416 INFO MainThread snmp_helper._execute_snmp_command:: Result WRS-ALARM-MIB::wrsAlarmActiveAlarmId.1 = STRING: "200.001"
But the coredump is detected here

[2022-11-24 23:37:11,380] 534 DEBUG MainThread ssh.exec_cmd:: Executing command...
[2022-11-24 23:37:11,380] 541 DEBUG MainThread ssh.exec_cmd:: Running command: ls -l --time-style=+%Y-%m-%d_%H-%M-%S /var/lib/systemd/coredump/ | grep --color=never -v total
[2022-11-24 23:37:11,381] 351 DEBUG MainThread ssh.send :: Send 'ls -l --time-style=+%Y-%m-%d_%H-%M-%S /var/lib/systemd/coredump/ | grep --color=never -v total'
[2022-11-24 23:37:11,431] 548 DEBUG MainThread ssh.exec_cmd:: Expecting .*controller-0[:| ].*\$ in prompt
[2022-11-24 23:37:11,436] 473 DEBUG MainThread ssh.expect :: Output:
-rw-r----- 1 root root 435509 2022-11-24_23-36-08 core.snmpSubAgent.0.01387990b64e47cf805d1bebb03d7d14.3364306.1669332967000000.zst

Alarms

+--------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------+----------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------+----------------+----------+----------------------------+
| 4feda790-89c2-46a3-975e-0835c95a96c7 | 300.005 | | CGCSAuto | critical | 2022-11-24T23:30:03.703709 |
| 630d66b2-f7ff-49d6-96e0-a695796d0bf2 | 250.001 | compute-1 Configuration is out-of-date. (applied: 0e49008c-3d0f-4773-b0d2-98ce60cebc89 target: 4a89fae9-f80c-426a-8945-92d6669cce5b) | host=compute-1 | major | 2022-11-24T18:56:23.613180 |
| 06e5972a-c89b-4bb8-8a51-4f3599b03245 | 200.001 | compute-0 was administratively locked to take it out-of-service. | host=compute-0 | warning | 2022-11-24T18:53:32.658091 |
+--------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------------------+----------------+----------+----------------------------+
Test Activity

 Regression Testing

Workaround

Not identified

Changed in starlingx:
assignee: nobody → Agustin Carranza (acarranz)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to snmp-armada-app (master)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to snmp-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/snmp-armada-app/+/879689
Committed: https://opendev.org/starlingx/snmp-armada-app/commit/8a6171fce0abdc84fc1488bb78b40a5046fdc4ef
Submitter: "Zuul (22348)"
Branch: master

commit 8a6171fce0abdc84fc1488bb78b40a5046fdc4ef
Author: Agustin Carranza <email address hidden>
Date: Wed Apr 5 17:23:15 2023 -0300

    Refresh DB session after swact in fm-subagent

    After running a swact command, the snmp pod crashed and generated a
    coredump. That behavior was due to an invalid reference for the DB
    handler.
    This change adds a new function to renew the session after the
    first failure (this happens after swact) in the connection.
    During the retries, the session in renewed and the reference is
    restored. So the coredump is now avoided.

    Test plan
    PASS: * Deploy a multinode configuration (e.g. AIO-DX).
          * Install the SNMP application.
          * Perform a snmpget/snmpwalk/snmpbulk operation related to FM,
            to the floating IP, e.g. WRS-ALARM-MIB::wrsAlarmActiveAlarmId.1
          * Perform a swact operation.
          * Repeat the snmp operation.
          * Both operations succeed and no coredumps are found in the
            filesystem of all controllers.

    Closes-bug: 2015408

    Signed-off-by: Agustin Carranza <email address hidden>
    Change-Id: Idf5e8717c4f9605a3a7c1eaee1eb87e38bc305e4

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to root (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/root/+/880335

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to snmp-armada-app (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to root (master)

Reviewed: https://review.opendev.org/c/starlingx/root/+/880335
Committed: https://opendev.org/starlingx/root/commit/a70958b4a64b14b462d765e2df78f5bc91c7e4c1
Submitter: "Zuul (22348)"
Branch: master

commit a70958b4a64b14b462d765e2df78f5bc91c7e4c1
Author: Agustin Carranza <email address hidden>
Date: Thu Apr 13 12:19:02 2023 -0300

    Update static tag for stx-fm-subagent image

    The content of the stx-fm-subagent image has been updated due to the
    fix of a bug:
    https://review.opendev.org/c/starlingx/snmp-armada-app/+/879689

    It is updated the yaml file with new tag using existing
    master-debian-stable-20230412T060000Z.0 tag.

    Partial-Bug: 2015408

    Change-Id: I81a77bdc3427b70080f87eff97f09c795b4e0863
    Signed-off-by: Agustin Carranza <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to snmp-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/snmp-armada-app/+/880347
Committed: https://opendev.org/starlingx/snmp-armada-app/commit/196330bfd511cb1fe92fadf5147e8d44eb9b0cfc
Submitter: "Zuul (22348)"
Branch: master

commit 196330bfd511cb1fe92fadf5147e8d44eb9b0cfc
Author: Agustin Carranza <email address hidden>
Date: Thu Apr 13 14:20:40 2023 -0300

    Update stx-fm-subagent image tag to stx.9.0-v1.0.1

    This change updates the stx-fm-subagent image tag to stx.9.0-v1.0.1
    due to the fix of the following bug:
    https://review.opendev.org/c/starlingx/snmp-armada-app/+/879689

    Test Plan:
    PASS: Apply snmp app with new tags introduced. Verify pod is up and
    running. Describe pod and verify that the new tag is used.

    Depends-on: https://review.opendev.org/c/starlingx/root/+/880335

    Closes-bug: 2015408

    Signed-off-by: Agustin Carranza <email address hidden>
    Change-Id: I383a747867a8460702e199c928789638b7257ed6

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.9.0 stx.fault
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.