collectd: monitoring incorrect CPU list after AIO-DX install

Bug #1837424 reported by Bart Wensley
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Bin Qian

Bug Description

Brief Description
-----------------
After installing an AIO-DX system, the collectd on controller-1 is monitoring the wrong CPU list. This means CPU alarms for controller-1 are not being raised when they should.

Severity
--------
Major: user will not see CPU usage alarms for controller-1

Steps to Reproduce
------------------
Install an AIO-DX system. Cause high CPU usage on controller-1.

Expected Behavior
------------------
The collectd on controller-1 should be monitoring the platform CPUs.

Actual Behavior
----------------
The collectd on controller-1 is monitoring all CPUs:
2019-07-19T19:13:29.201 controller-1 collectd[12353]: info platform cpu usage plugin init function for controller-1
2019-07-19T19:13:29.203 controller-1 collectd[12353]: info platform cpu usage plugin has found 36 cpus total
2019-07-19T19:13:29.203 controller-1 collectd[12353]: info platform cpu usage plugin monitoring 36 cpus [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]

I suspect this might be because the /etc/platform/worker_reserved.conf file on controller-1 was updated after collectd was started. Collectd was last started here:
2019-07-19T19:13:26.711 controller-1 collectd[12353]: info plugin_load: plugin "network" successfully loaded.

But it looks like the file was updated after that:
[root@controller-1 ~(keystone_admin)]# stat /etc/platform/worker_reserved.conf
  File: \u2018/etc/platform/worker_reserved.conf\u2019
  Size: 3229 Blocks: 8 IO Block: 4096 regular file
Device: 823h/2083d Inode: 796754 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2019-07-21 19:17:28.806585159 +0000
Modify: 2019-07-19 19:16:15.893321091 +0000
Change: 2019-07-19 19:16:15.895321091 +0000
Birth: -

This may have something to do with recent changes to update the list of platform CPUs on the fly.

Eric MacDonald (collectd SME) indicated two options to fix the problem:
1. Force restart of collectd if the reserved file is changed
2. Have collectd re-read the reserved file every monitor interval

Option 1 is preferred as option 2 might involve additional enhancements to the collectd plugin to handle or deal with on the fly core allocation changes and conflict over a potentially changing file.

Reproducibility
---------------
Unsure

System Configuration
--------------------
AIO-DX (two node system)

Branch/Pull Time/Commit
-----------------------
Designer built load:
BUILD_DATE="2019-07-19 09:53:25 -0500"

Last Pass
---------
Unsure

Timestamp/Logs
--------------
Collect logs will be attached

Test Activity
-------------
Developer testing

Revision history for this message
Bart Wensley (bartwensley) wrote :
Revision history for this message
Bart Wensley (bartwensley) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.2.0 - cpu alarms not monitored as expected on controller-1

tags: added: stx.2.0 stx.metal
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per Brent, we want to proceed with option 1.

Changed in starlingx:
assignee: Eric MacDonald (rocksolidmtce) → Bin Qian (bqian20)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/672344

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/672344
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=ba7423c911ea5ce50bf3795ad479ffe322fb4841
Submitter: Zuul
Branch: master

commit ba7423c911ea5ce50bf3795ad479ffe322fb4841
Author: Bin Qian <email address hidden>
Date: Tue Jul 23 15:02:26 2019 -0400

    Restart collectd at the end of configuring cpu

    Restart collectd after configuring cpu to ensure collectd loads
    updated configuration

    Closes-Bug: 1837424
    Change-Id: I10e0f431dfd01637f38319d506559aa3927f11ff
    Signed-off-by: Bin Qian <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.