IPv6: 100.113 alarm "'DATA-NETWORK0' interface failed" raised on hosts with two data interfaces after installation

Bug #1862651 reported by Peng Peng
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Thomas Gao

Bug Description

Brief Description
-----------------
After Distributed Cloud installed,
100.113 alarm "'DATA-NETWORK0' interface failed"
100.112 alarm "'DATA-NETWORK1' port failed"
were raised

Severity
--------
Major

Steps to Reproduce
------------------
Install DC labs

Expected Behavior
------------------
no 100.113 and 112 alarms raised

Actual Behavior
----------------
100.113 and 112 alarms raised

Reproducibility
---------------
Reproducible

System Configuration
--------------------
DC system all labs

Lab-name: wcp_80-91

Branch/Pull Time/Commit
-----------------------
"2020-02-06_00-10-00"

Last Pass
---------
stx.3.0

Timestamp/Logs
--------------
system controller:
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list --uuid
+--------------------------------------+-------+---------------------------------------------------------------------+---------------------------------------+----------+-------------------+
| UUID | Alarm | Reason Text | Entity ID | Severity | Time Stamp |
| | ID | | | | |
+--------------------------------------+-------+---------------------------------------------------------------------+---------------------------------------+----------+-------------------+
| bc1ade8a-5dce-4b17-9f93-2f6b72067e74 | 100. | 'DATA-NETWORK1' interface failed | host=controller-1.interface=data- | critical | 2020-02-06T15:58: |
| | 113 | | network1 | | 47 |
| | | | | | |
| 6601a0ee-7dc6-4bc2-b46a-ae868e78fa0a | 100. | 'DATA-NETWORK1' port failed | host=controller-1.port=aab53a28-357b- | major | 2020-02-06T15:58: |
| | 112 | | 4b23-a9bb-c49afbc7dd45 | | 47 |
| | | | | | |
| 5123e01c-7ea4-4307-a6f1-ac6131136980 | 200. | controller-1 experienced a configuration failure. | host=controller-1 | critical | 2020-02-06T15:52: |
| | 011 | | | | 34.959153 |
| | | | | | |
| 2c828b54-401c-48d9-a22b-456b235ce239 | 100. | 'DATA-NETWORK1' interface failed | host=controller-0.interface=data- | critical | 2020-02-06T15:24: |
| | 113 | | network1 | | 25 |
| | | | | | |
| 44aaffdf-edc0-4684-9142-a9dc004045a5 | 100. | 'DATA-NETWORK1' port failed | host=controller-0.port= | major | 2020-02-06T15:24: |
| | 112 | | 02ac70b0-be2b-4540-a0b9-b34823a99eff | | 25 |
| | | | | | |
+--------------------------------------+-------+---------------------------------------------------------------------+---------------------------------------+----------+-------------------+

subcloud6:
[sysadmin@controller-0 ~(keystone_admin)]$ fm alarm-list
+----------+----------------------------------+--------------------------------------+----------+-------------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+----------------------------------+--------------------------------------+----------+-------------------+
| 100.113 | 'DATA-NETWORK1' interface failed | host=controller-0.interface=data- | critical | 2020-02-07T18:41: |
| | | network1 | | 09 |
| | | | | |
| 100.112 | 'DATA-NETWORK1' port failed | host=controller-0.port= | major | 2020-02-07T18:41: |
| | | 294d22b7-bfea-4871-9e5c-ad901f68c4b6 | | 09 |
| | | | | |
+----------+----------------------------------+--------------------------------------+----------+-------------------+

Test Activity
-------------
installation

Revision history for this message
Peng Peng (ppeng) wrote :
Ghada Khalil (gkhalil)
description: updated
tags: added: stx.distcloud
tags: added: stx.4.0
tags: removed: stx.4.0
Peng Peng (ppeng)
tags: added: stx.retestneeded
Revision history for this message
Yang Liu (yliu12) wrote :

I noticed this issue on a non-DC system, where two data networks are configured, and is displaying the same alarms on one of the data ports. So I think this is not DC specific.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.4.0 / medium priority - Yang confirmed that this appears to be an issue w/ systems having multiple data ports. It's not only a stale alarm; one of the ports is actually down.

tags: added: stx.networking
removed: stx.distcloud
tags: added: stx.4.0
Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Ghada Khalil (gkhalil)
Revision history for this message
Yang Liu (yliu12) wrote : Re: IPv6: 100.113 alarm "'DATA-NETWORK0' interface failed" raised on hosts with two data ports after installation
Download full text (7.1 KiB)

This is seen on regular system (wfp3-7) as well:

On this system compute-0 has only 1 data interface configured, while compute-1 and compute-2 have two data interfaces configured. Only compute-1 and compute-2 have the data interface/port alarms raised.

sys[sysadmin@controller-1 ~(keystone_admin)]$ system host-if-list compute-0
+--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-------------------------+---------------------------+
| uuid | name | class | type | vlan id | ports | uses i/f | used by i/f | attributes |
+--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-------------------------+---------------------------+
| 1502d4f4-ede9-4a63-9b33-cb4f9d51983d | cluster0 | platform | vlan | 170 | [] | [u'pxeboot0'] | [] | MTU=1500 |
| 40172c0b-3056-4db8-9e26-19f7a3ac0a26 | data0 | data | ethernet | None | [u'enp24s0f0'] | [] | [] | MTU=1500,accelerated=True |
| a099d953-e183-490b-ad49-90214fe77bec | mgmt0 | platform | vlan | 66 | [] | [u'pxeboot0'] | [] | MTU=1500 |
| cedd2f11-9870-4680-866b-24779a1521b8 | pxeboot0 | platform | ethernet | None | [u'enp134s0f0'] | [] | [u'mgmt0', u'cluster0'] | MTU=9216 |
+--------------------------------------+----------+----------+----------+---------+-----------------+---------------+-------------------------+---------------------------+
[sysadmin@controller-1 ~(keystone_admin)]$
[sysadmin@controller-1 ~(keystone_admin)]$ system host-if-list compute-1
+--------------------------------------+----------+-----------+----------+---------+-----------------+---------------+-------------------+---------------------------+
| uuid | name | class | type | vlan id | ports | uses i/f | used by i/f | attributes |
+--------------------------------------+----------+-----------+----------+---------+-----------------+---------------+-------------------+---------------------------+
| 2abf22a9-9d1c-47e3-a65c-04ef8e967865 | mgmt0 | platform | vlan | 66 | [] | [u'pxeboot0'] | [] | MTU=1500 |
| 54b2c780-2105-49ff-8585-8eb2061555df | data1 | data | ethernet | None | [u'enp24s0f1'] | [] | [] | MTU=1500,accelerated=True |
| 6f36762a-2935-4b0f-ab5d-da672f540658 | data0 | data | ethernet | None | [u'enp24s0f0'] | [] | [] | MTU=1500,accelerated=True |
| 71b48e6c-42ec-40f0-8ec6-f76abb9be416 | sriov1 | pci-sriov | ethernet | None | [u'enp134s0f1'] | [] | [] | MTU=1500 |
| 94cdee14-7d8a-4814-ad9f-8f65c4970101 | sriov0 | pci-sriov | ethernet | None | [u'enp134s0f0'] | [] | [] | MTU=1500 |
| b4864e66-be0b-4380-a7da-c859361238b9 | c...

Read more...

summary: - IPv6 DC: 100.113 alarm "'DATA-NETWORK0' interface failed" raised at both
- system controller and subclouds after installation
+ IPv6: 100.113 alarm "'DATA-NETWORK0' interface failed" raised on hosts
+ with two data ports after installation
summary: IPv6: 100.113 alarm "'DATA-NETWORK0' interface failed" raised on hosts
- with two data ports after installation
+ with two data interfaces after installation
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Ghada Khalil (gkhalil) → Thomas Gao (tgao)
Thomas Gao (tgao)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Thomas Gao (tgao) wrote :

Steps to reproduce the alarm-raising scenario:

On wp3-7, assume compute-1 and compute-2 are causing the alarms 100.112, 100.113 to raise.

[sysadmin@controller-1 ~(keystone_admin)]$ sudo -S -u postgres psql -d fm -c "select uuid, alarm_id, entity_instance_id, reason_text from alarm;"
                 uuid | alarm_id | entity_instance_id | reason_text
--------------------------------------+----------+----------------------------------------+--------------------------------------
 2238d41c-b2fe-423a-8f0b-6c4fa50d5bab | 100.112 | host=compute-2.port=enp24s0f1 | 'DATA-NETWORK0' port failed
 0b88e7e8-50b3-4e75-88fb-26c85b3a085d | 100.113 | host=compute-2.interface=data-network0 | 'DATA-NETWORK0' interface failed
 57d10b89-3c77-464d-96af-ad32c6a699a7 | 100.112 | host=compute-1.port=enp24s0f1 | 'DATA-NETWORK0' port failed
 ce79cfd3-3198-49f8-9489-5a17f52543d2 | 100.113 | host=compute-1.interface=data-network0 | 'DATA-NETWORK0' interface failed
[sysadmin@controller-1 ~(keystone_admin)]$ fm alarm-delete 57d10b89-3c77-464d-96af-ad32c6a699a7
[sysadmin@controller-1 ~(keystone_admin)]$ fm alarm-delete ce79cfd3-3198-49f8-9489-5a17f52543d2
[sysadmin@controller-1 ~(keystone_admin)]$ fm alarm-list
#verify the two alarms are gone

compute-1:~$ sudo systemctl restart collectd.service

[sysadmin@controller-1 ~(keystone_admin)]$ fm alarm-list
#verify the two alarms are back

compute-1:~$ cat /var/log/daemon.log | grep collectd > ~/collectd.log

If you do a tail quickly you should see something like this:

2020-05-08T19:23:44.000 compute-1 collectd[92218]: info interface plugin Link Status Query Response:2:
2020-05-08T19:23:44.000 compute-1 collectd[92218]: info interface plugin enp24s0f1 100.112 port alarm raised
2020-05-08T19:23:44.000 compute-1 collectd[92218]: info interface plugin data-network0 100.113 critical interface alarm raised
2020-05-08T19:23:44.000 compute-1 collectd[92218]: info interface plugin mgmt 100% ; link one 'enp136s0f0' went Up at 2020-05-05 20:10:41
2020-05-08T19:23:44.000 compute-1 collectd[92218]: info interface plugin cluster-host 100% ; link one 'enp136s0f0' went Up at 2020-05-05 20:10:41
2020-05-08T19:23:44.000 compute-1 collectd[92218]: info interface plugin data-network0 0% ; link one 'enp24s0f1' went Down at 2020-05-05 20:10:41
2020-05-08T19:23:44.000 compute-1 collectd[92218]: info interface plugin data-network1 100% ; link one 'enp24s0f0' went Up at 2020-05-05 20:10:41

Revision history for this message
Difu Hu (difuhu) wrote :

Issue was reproduced on
Lab: DC-1
Load: 2020-05-15_20-00-00

Log added on:
https://files.starlingx.kube.cengn.ca/launchpad/1862651

Revision history for this message
Thomas Gao (tgao) wrote :

This appears to be a lab configuration issue.

Changed in starlingx:
status: In Progress → Invalid
Revision history for this message
Peng Peng (ppeng) wrote :

Not seen this issue recently

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.