After pull data cable on the compute, no alarm has triggered

Bug #1834512 reported by sathish subramanian
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
ChenjieXu

Bug Description

Brief Description
-----------------
 When i removed data cable from compute's and checking on controller-0 with fm
 alarm-list is not getting any alarm.

 Please suggest me, how can I get alarm triggered, after removing data
 cable on compute.

Severity
--------
Provide the severity of the defect.
<Minor: System/Feature is usable with minor issue>

Steps to Reproduce
------------------
 1. Go to remove manually data cable from computes.
  a. Example: eno3 , eno4 data cable removed
  $ system host-port-list compute-0
  +--------------------------------------+------+----------+--------------+
  | uuid | name | type | pci address |
  +--------------------------------------+------+----------+--------------+
  | 5cd6efe4-ef13-48b1-aee9-d3aa5271e677 | eno1 | ethernet | 0000:03:00.0 |
  | 5d96704c-dec2-4f4a-b950-57c30efc5894 | eno2 | ethernet | 0000:03:00.1 |
  | 783b90fd-a959-491a-a9f6-3c177a074e85 | eno3 | ethernet | 0000:03:00.2 |
  | 690d57c5-70dc-43f8-9262-e2483d8875aa | eno4 | ethernet | 0000:03:00.3 |
  +--------------------------------------+------+----------+--------------+

 2. Go to Controller-0 and check alarm list
  $ fm alarm-list
  $ fm event-list
 3. No alarm's triggered related to removed data cable

Note:
1. Captured dmesg,alarm list. please find the attached log.
2. Pods status are healthy
3. if lock/unlock getting relevant alarm_id
4. All host are active state

Expected Behavior
------------------
 After removing data cable on compute, the relevant alarm should be triggered

Reproducibility
---------------
Reproducible

System Configuration
--------------------
 - MN-External (2+2+2)
 - Bare metal

Branch/Pull Time/Commit
-----------------------
BUILD_DATE="2019-06-21 01:30:00 +0000"

Last Pass
--------------
Not passed

Revision history for this message
sathish subramanian (sathis5x) wrote :
description: updated
Ghada Khalil (gkhalil)
tags: added: stx.regression
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Currently, starlingx does not have the capability to raise alarms when data links are pulled. The test-case you are running is outdated. There is a story in the networking backlog that tracks the introduction of this functionality:
https://storyboard.openstack.org/#!/story/2002948

However, this story is currently not committed for an stx release.

Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
tags: added: stx.networking
Changed in starlingx:
assignee: nobody → Forrest Zhao (forrest.zhao)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Forrest -- we should discuss in the networking team whether this is something we want to add in stx.3.0. We'll need to get input from the networking TLs

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

Seeing the same issue in manual regression in stx

HW lab: PV-1
"20190731T013000Z"

When the data port is down on the worker node eg. compute-5 there is no data port failure alarm being triggered

Expectation:
1. data port failure alarm for that compute
eg. 300.001 'Data' Port failed.
2. hypervisor status should become 'disabled' for that host
3. instances that can be scheduled should migrate from the compute
4. the host availability state should be 'degraded' due to the data interface alarm.

Changed in starlingx:
assignee: Forrest Zhao (forrest.zhao) → ChenjieXu (midone)
Revision history for this message
Gongjun Song (songgongjun) wrote :

I use VM and HOST to reproduce this bug and have the same result , I don't get any alarm after the following command:
    $ fm alarm-list

VM installed the simplex environment and used the image of July 15, 2019.
HOST installed the duplex environment and used the image of July 15, 2019.

Revision history for this message
Forrest Zhao (forrest.zhao) wrote :

According to the feedback from Victor Rodriguez:"I was wondering if we could discuss the possibility to change the priority to the launchpad since some use cases might want to measure/test the network link failure detection.", we decide to change the priority to high and will fix it in STX 3.0.

Changed in starlingx:
importance: Low → High
Ghada Khalil (gkhalil)
tags: added: stx.3.0
Revision history for this message
Juan Carlos Alonso (juancarlosa) wrote :

Is there any progress on the implementation of this issue?

Revision history for this message
ChenjieXu (midone) wrote :

Hi all,

The interfaces for OAM, MGMT and CLUSTER-HOST has been monitored by collectd. After pulling the cables for those interfaces, the alarms exist. And after plugging the cables, the alarms will disappear. The alarms are listed below:

| 100.107 | 'OAM' interface failed | host=controller-1.interface=oam | critical | 2019-09-12T00:23:55 |
| 100.106 | 'OAM' port failed | host=controller-1.port=b95a080e-810e-4d54-a3c1-aa969c20b780 | major | 2019-09-12T00:23:55 |
| 100.111 | 'CLUSTER-HOST' interface failed | host=controller-1.interface=cluster- | critical | 2019-09-12T00:21: |
| | | host | | 46 |
| | | | | |
| 100.109 | 'MGMT' interface failed | host=controller-1.interface=mgmt | critical | 2019-09-12T00:21: |
| | | | | 46 |
| | | | | |
| 100.108 | 'MGMT' port failed | host=controller-1.port= | major | 2019-09-12T00:21: |
| | | d81e3b28-c9d8-4093-9528-7825abd1dcc2 | | 46 |
| | | | | |
| 100.110 | 'CLUSTER-HOST' port failed | host=controller-1.port= | major | 2019-09-12T00:21: |
| | | d81e3b28-c9d8-4093-9528-7825abd1dcc2 | | 46 |

Revision history for this message
ChenjieXu (midone) wrote :

As aligned with Matt,

The below commands can be used to retrieve the interfaces state:
   ovs-ofctl dump-ports-desc $ovs_bridge
   ovs-appctl bond/show $bond_name
   ovs-appctl lacp/show $bond_name

I will submit a patch to implement this function.

ChenjieXu (midone)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to monitoring (master)

Fix proposed to branch: master
Review: https://review.opendev.org/687025

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to monitoring (master)

Reviewed: https://review.opendev.org/687025
Committed: https://git.openstack.org/cgit/starlingx/monitoring/commit/?id=92233ff68ecbfe08ed2d69cf4abf947d785261f5
Submitter: Zuul
Branch: master

commit 92233ff68ecbfe08ed2d69cf4abf947d785261f5
Author: chenjie1 <email address hidden>
Date: Tue Oct 8 04:58:24 2019 +0800

    OVS collectd interface/port state monitoring

    Use collectd to monitor OVS interface and port. Some
    host interfaces will be added to OVS port. Only these
    ports and interfaces will be monitored.

    Change-Id: Icbf3d4c47afd177392f023720c114783332b143b
    Story: #2002948
    Task: #22944
    Closes-Bug: #1834512
    Signed-off-by: Chenjie Xu <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Elio Martinez (elio1979) wrote :

Using virtual environment, i was able to get the alarm related to the node where i brought down the data interface

+----------+--------------------------------------------------------------------------------------+-----------------------------+----------+---------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+--------------------------------------------------------------------------------------+-----------------------------+----------+---------------------+
| 300.004 | 'br-phy0' port failed | host=compute-0.port=br-phy0 | critical | 2019-11-30T17:12:55 |
| | | | | .094458 |
Tested on :
###
### StarlingX
### Built from master
###

OS="centos"
SW_VERSION="19.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20191122T023000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="327"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-11-22 02:30:00 +0000"

Can we check that alarm raised in the mtclog as well?

Revision history for this message
ChenjieXu (midone) wrote :

Hi Elio,

Which mtclog you are asking? Could you please give the path of the mtclog?

Maybe you are asking how to check the alarm in the log file for performance testing? If so, you can try following log files:
   /var/log/mtcAgent.log
   /var/log/fm-manager.log
   /var/log/fm-event.log

Revision history for this message
Elio Martinez (elio1979) wrote :

mtcAgent.log, since we are considering the log for performance.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers