After uncontrolled swact "Alarm id 100.104 File System threshold exceeded" was not seen on new active controller

Bug #1814334 reported by Anujeyan Manokeran
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Eric MacDonald

Bug Description

Bug Description : File System threshold exceeded; 80%, actual 81% major alarm was generated on active controller(controller-0) and after the uncontrolled swact alarm was not displaying on controller-1.

Before the swact below alarm and samples are seen

Data base sample
--------------
time host instance type type_instance value
1549046510149831000 controller-1 root percent_bytes used 46.05934143066406
1549046507409139000 compute-1 root percent_bytes used 27.08646011352539
1549046498961410000 controller-0 root percent_bytes used 81.13135528564453
1549046494668312000 compute-0 root percent_bytes used 27.009244918823242

]$ fm alarm-list
+----------+-------------------------------------------------+------------------------+----------+----------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+-------------------------------------------------+------------------------+----------+----------------+
| 100.104 | File System threshold exceeded; 80%, actual 81% | host=controller-0. | major | 2019-02-01T18: |
| | | filesystem=/ | | 11:38.834446 |
| | | | | |
+----------+-------------------------------------------------+------------------------+----------+----------------+

After the swact below alarm and samples are seen
 File system remains same
controller-0:/var/log$ df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda3 20027216 16270376 2716456 86% /
devtmpfs 49343708 0 49343708 0% /dev
tmpfs 49362692 0 49362692 0% /dev/shm
tmpfs 49362692 11516 49351176 1% /run
tmpfs 49362692 0 49362692 0% /sys/fs/cgroup
tmpfs 1048576 176 1048400 1% /tmp
/dev/mapper/cgts--vg-gnocchi--lv 4947584 120952 4548104 3% /opt/gnocchi
/dev/mapper/cgts--vg-img--conversions--lv 10190100 36896 9612532 1% /opt/img-conversions
/dev/mapper/cgts--vg-backup--lv 41153760 49176 38991048 1% /opt/backups
/dev/mapper/cgts--vg-scratch--lv 8126904 586920 7104172 8% /scratch
/dev/sda2 487634 114065 343873 25% /boot
/dev/mapper/cgts--vg-log--lv 7932336 720432 6785920 10% /var/log

Sample data on data base
--------------
time host instance type type_instance value
1549046510149831000 controller-1 root percent_bytes used 46.05934143066406
1549046507409139000 compute-1 root percent_bytes used 27.08646011352539
1549046498961410000 controller-0 root percent_bytes used 81.13135528564453
1549046494668312000 compute-0 root percent_bytes used 27.009244918823242

Controller-1 fm-manager.log shows that alarm was deleted.

2019-02-01T19:38:48.346 fmMsgServer.cpp(414): Send response for create log, uuid:(32e1ab61-2f83-4d82-95bf-00d5bc09a2bd) (0)
2019-02-01T19:38:48.999 fmMsgServer.cpp(486): Deleted alarm: (200.005) (host=controller-0.network=Management)
2019-02-01T19:38:48.999 fmMsgServer.cpp(503): Response to delete fault: 0
2019-02-01T19:38:49.039 fmMsgServer.cpp(486): Deleted alarm: (200.009) (host=controller-0.network=Infrastructure)
2019-02-01T19:38:49.039 fmMsgServer.cpp(503): Response to delete fault: 0
2019-02-01T19:38:49.319 fmMsgServer.cpp(486): Deleted alarm: (100.104) (host=controller-0.filesystem=/)
2019-02-01T19:38:49.319 fmMsgServer.cpp(503): Response to delete fault: 0
2019-02-01T19:38:50.078 fmMsgServer.cpp(500): Deleted alarm failed: (100.106) (host=controller-0.port=enp10s0f0) (FM_ERR_ENTITY_NOT_FOUND)
2019-02-01T19:38:50.078 fmMsgServer.cpp(503): Response to delete fault: 10
2019-02-01T19:38:50.118 fmMsgServer.cpp(500): Deleted alarm failed: (100.107) (host=controller-0.interface=oam) (FM_ERR_ENTITY_NOT_FOUND)
2019-02-01T19:38:50.118 fmMsgServer.cpp(503): Response to delete fault: 10
2019-02-01T19:38:50.118 fmMsgServer.cpp(398): Raising Alarm/Log, (401.001) (service_domain=controller.service_group=vim-services.host=controller-0)
2019-02-01T19:38:

Severity
--------
Major

Steps to Reproduce
------------------
1. Generate the alarm by filling up the disk on active controller
2. sudo reboot active.
3. Verify alarm on new active controller

Expected Behavior
------------------
Alarm showing on new active controller

Actual Behavior
----------------
As per description

Reproducibility
---------------

System Configuration
--------------------
storage system

Branch/Pull Time/Commit
-----------------------
StarlingX_Upstream_build release branch build as of 2019-01-24_20-18-00"

Timestamp/Logs
--------------
2019-01-24_20-18-00

description: updated
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating - issue related to collectd feature

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
tags: added: stx.2019.05 stx.metal
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-metal (master)

Change abandoned by Eric MacDonald (<email address hidden>) on branch: master
Review: https://review.openstack.org/634496

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-metal (master)

Fix proposed to branch: master
Review: https://review.openstack.org/642854

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-metal (master)

Change abandoned by Eric MacDonald (<email address hidden>) on branch: master
Review: https://review.openstack.org/642854
Reason: Rmon is being removed from the load an an update only days away.

Ken Young (kenyis)
Changed in starlingx:
assignee: Eric MacDonald (rocksolidmtce) → Cindy Xie (xxie1)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → chen haochuan (martin1982)
Revision history for this message
chen haochuan (martin1982) wrote :

confirm not reproduce on latest build. Rmon is already removed, alarm notification raised from collectd.

Ken Young (kenyis)
tags: added: stx.2.0
removed: stx.2019.05
Changed in starlingx:
status: In Progress → Fix Committed
status: Fix Committed → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
chen haochuan (martin1982) wrote :

confirm not reproduced on latest code. I build latest code and deploy in VM with duplex

Changed in starlingx:
status: In Progress → Confirmed
status: Confirmed → Invalid
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: chen haochuan (martin1982) → Eric MacDonald (rocksolidmtce)
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

This issue is fixed by the following feature update.

https://review.openstack.org/#/c/643739/

Changed in starlingx:
status: Invalid → Fix Released
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

Verified in load "20190506T233000Z"

tags: removed: stx.2.0 stx.metal stx.retestneeded
Ghada Khalil (gkhalil)
tags: added: stx.2.0 stx.metal
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.