StarlingX

After pull data cable on the compute, no alarm has triggered

Bug #1834512 reported by sathish subramanian on 2019-06-27

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	High	ChenjieXu

Bug Description

Brief Description
-----------------
When i removed data cable from compute's and checking on controller-0 with fm
alarm-list is not getting any alarm.

Please suggest me, how can I get alarm triggered, after removing data
cable on compute.

Severity
--------
Provide the severity of the defect.
<Minor: System/Feature is usable with minor issue>

Steps to Reproduce
------------------
1. Go to remove manually data cable from computes.
  a. Example: eno3 , eno4 data cable removed
  $ system host-port-list compute-0
  +--------------------------------------+------+----------+--------------+
  | uuid | name | type | pci address |
  +--------------------------------------+------+----------+--------------+
  | 5cd6efe4-ef13-48b1-aee9-d3aa5271e677 | eno1 | ethernet | 0000:03:00.0 |
  | 5d96704c-dec2-4f4a-b950-57c30efc5894 | eno2 | ethernet | 0000:03:00.1 |
  | 783b90fd-a959-491a-a9f6-3c177a074e85 | eno3 | ethernet | 0000:03:00.2 |
  | 690d57c5-70dc-43f8-9262-e2483d8875aa | eno4 | ethernet | 0000:03:00.3 |
  +--------------------------------------+------+----------+--------------+

2. Go to Controller-0 and check alarm list
$ fm alarm-list
$ fm event-list
3. No alarm's triggered related to removed data cable

Note:
1. Captured dmesg,alarm list. please find the attached log.
2. Pods status are healthy
3. if lock/unlock getting relevant alarm_id
4. All host are active state

Expected Behavior
------------------
After removing data cable on compute, the relevant alarm should be triggered

Reproducibility
---------------
Reproducible

System Configuration
--------------------
- MN-External (2+2+2)
- Bare metal

Branch/Pull Time/Commit
-----------------------
BUILD_DATE="2019-06-21 01:30:00 +0000"

Last Pass
--------------
Not passed

See original description

Tags:

Revision history for this message

sathish subramanian (sathis5x) wrote on 2019-06-27:

alarmlist.zip Edit (45.9 KiB, application/zip)

description:

updated

Ghada Khalil (gkhalil) on 2019-06-27

tags:

added: stx.regression

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-28:

Currently, starlingx does not have the capability to raise alarms when data links are pulled. The test-case you are running is outdated. There is a story in the networking backlog that tracks the introduction of this functionality:
https://storyboard.openstack.org/#!/story/2002948

However, this story is currently not committed for an stx release.

Changed in starlingx:
importance:	Undecided → Low
status:	New → Triaged
tags:	added: stx.networking
Changed in starlingx:
assignee:	nobody → Forrest Zhao (forrest.zhao)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-06-28:

Assigning to Forrest -- we should discuss in the networking team whether this is something we want to add in stx.3.0. We'll need to get input from the networking TLs

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-07:

Seeing the same issue in manual regression in stx

HW lab: PV-1
"20190731T013000Z"

When the data port is down on the worker node eg. compute-5 there is no data port failure alarm being triggered

Expectation:
1. data port failure alarm for that compute
eg. 300.001 'Data' Port failed.
2. hypervisor status should become 'disabled' for that host
3. instances that can be scheduled should migrate from the compute
4. the host availability state should be 'degraded' due to the data interface alarm.

Forrest Zhao (forrest.zhao) on 2019-08-08

Changed in starlingx:
assignee:	Forrest Zhao (forrest.zhao) → ChenjieXu (midone)

Revision history for this message

Gongjun Song (songgongjun) wrote on 2019-08-09:

I use VM and HOST to reproduce this bug and have the same result , I don't get any alarm after the following command:
$ fm alarm-list

VM installed the simplex environment and used the image of July 15, 2019.
HOST installed the duplex environment and used the image of July 15, 2019.

Revision history for this message

Forrest Zhao (forrest.zhao) wrote on 2019-08-15:

According to the feedback from Victor Rodriguez:"I was wondering if we could discuss the possibility to change the priority to the launchpad since some use cases might want to measure/test the network link failure detection.", we decide to change the priority to high and will fix it in STX 3.0.

Changed in starlingx:
importance:	Low → High

Ghada Khalil (gkhalil) on 2019-08-16

tags:

added: stx.3.0

Revision history for this message

Juan Carlos Alonso (juancarlosa) wrote on 2019-09-03:

Is there any progress on the implementation of this issue?

Revision history for this message

ChenjieXu (midone) wrote on 2019-09-12:

Hi all,

The interfaces for OAM, MGMT and CLUSTER-HOST has been monitored by collectd. After pulling the cables for those interfaces, the alarms exist. And after plugging the cables, the alarms will disappear. The alarms are listed below:

Hi all,

| 100.107  | 'OAM' interface failed                                                                                                                                    | host=controller-1.interface=oam                                                           | critical | 2019-09-12T00:23:55        |
| 100.106  | 'OAM' port failed                                                                                                                                         | host=controller-1.port=b95a080e-810e-4d54-a3c1-aa969c20b780                               | major    | 2019-09-12T00:23:55        |
| 100.111  | 'CLUSTER-HOST' interface failed                                                              | host=controller-1.interface=cluster- | critical | 2019-09-12T00:21: |
|          |                                                                                              | host                                 |          | 46                |
|          |                                                                                              |                                      |          |                   |
| 100.109  | 'MGMT' interface failed                                                                      | host=controller-1.interface=mgmt     | critical | 2019-09-12T00:21: |
|          |                                                                                              |                                      |          | 46                |
|          |                                                                                              |                                      |          |                   |
| 100.108  | 'MGMT' port failed                                                                           | host=controller-1.port=              | major    | 2019-09-12T00:21: |
|          |                                                                                              | d81e3b28-c9d8-4093-9528-7825abd1dcc2 |          | 46                |
|          |                                                                                              |                                      |          |                   |
| 100.110  | 'CLUSTER-HOST' port failed                                                                   | host=controller-1.port=              | major    | 2019-09-12T00:21: |
|          |                                                                                              | d81e3b28-c9d8-4093-9528-7825abd1dcc2 |          | 46                |

Revision history for this message

ChenjieXu (midone) wrote on 2019-09-18:

As aligned with Matt,

The below commands can be used to retrieve the interfaces state:
   ovs-ofctl dump-ports-desc $ovs_bridge
   ovs-appctl bond/show $bond_name
   ovs-appctl lacp/show $bond_name

I will submit a patch to implement this function.

ChenjieXu (midone) on 2019-09-18

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-07: Fix proposed to monitoring (master)

#10

Fix proposed to branch: master
Review: https://review.opendev.org/687025

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-13: Fix merged to monitoring (master)

#11

Reviewed: https://review.opendev.org/687025
Committed: https://git.openstack.org/cgit/starlingx/monitoring/commit/?id=92233ff68ecbfe08ed2d69cf4abf947d785261f5
Submitter: Zuul
Branch: master

commit 92233ff68ecbfe08ed2d69cf4abf947d785261f5
Author: chenjie1 <email address hidden>
Date: Tue Oct 8 04:58:24 2019 +0800

OVS collectd interface/port state monitoring

    Use collectd to monitor OVS interface and port. Some
    host interfaces will be added to OVS port. Only these
    ports and interfaces will be monitored.

    Change-Id: Icbf3d4c47afd177392f023720c114783332b143b
    Story: #2002948
    Task: #22944
    Closes-Bug: #1834512
    Signed-off-by: Chenjie Xu <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Elio Martinez (elio1979) wrote on 2019-11-30:

#12

Using virtual environment, i was able to get the alarm related to the node where i brought down the data interface

OS="centos"
SW_VERSION="19.09"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="20191122T023000Z"

JOB="STX_build_master_master"
<email address hidden>"
BUILD_NUMBER="327"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-11-22 02:30:00 +0000"

Can we check that alarm raised in the mtclog as well?

Revision history for this message

ChenjieXu (midone) wrote on 2019-12-02:

#13

Hi Elio,

Which mtclog you are asking? Could you please give the path of the mtclog?

Maybe you are asking how to check the alarm in the log file for performance testing? If so, you can try following log files:
   /var/log/mtcAgent.log
   /var/log/fm-manager.log
   /var/log/fm-event.log