Configuration out-of-date alarms on storage nodes since fresh install

Bug #1838652 reported by Yang Liu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Wei Zhou

Bug Description

Brief Description
-----------------
Configuration out-of-date alarms are not cleared on storage nodes after fresh install

Severity
--------
Minor

Steps to Reproduce
------------------
Install and configure a storage system

Expected Behavior
------------------
- system is alarm free after all hosts are unlocked and available

Actual Behavior
----------------
- All hosts are unlocked and available, but the config out-of-date status on storage nodes are not cleared
Work around: After manual lock/unlock, the alarm is cleared.

Reproducibility
---------------
Likely intermittent.
This issue was not seen on 20190728T013000Z load on multiple storage systems including this system. But we don't have other data yet on 20190801T013000Z where the issue is seen.

System Configuration
--------------------
Dedicated storage
Lab-name: wcp_7-12

Branch/Pull Time/Commit
-----------------------
stx master as of 20190801T013000Z

Last Pass
---------
20190728T013000Z on same system

Timestamp/Logs
--------------

[sysadmin@controller-1 ~(keystone_admin)]$ fm event-list -q "entity_instance_id=storage-1;event_log_id=250.001"
+-------------------+-------+--------------+-----------------------------------------+--------------------+----------+
| Time Stamp | State | Event Log ID | Reason Text | Entity Instance ID | Severity |
+-------------------+-------+--------------+-----------------------------------------+--------------------+----------+
| 2019-08-01T07:40: | set | 250.001 | storage-1 Configuration is out-of-date. | host=storage-1 | major |
| 35.977782 | | | | | |
| | | | | | |
| 2019-08-01T07:40: | set | 250.001 | storage-1 Configuration is out-of-date. | host=storage-1 | major |
| 35.797176 | | | | | |
| | | | | | |
| 2019-08-01T07:34: | set | 250.001 | storage-1 Configuration is out-of-date. | host=storage-1 | major |
| 14.464493 | | | | | |
| | | | | | |
| 2019-08-01T07:34: | set | 250.001 | storage-1 Configuration is out-of-date. | host=storage-1 | major |
| 10.889443 | | | | | |
| | | | | | |
+-------------------+-------+--------------+-----------------------------------------+--------------------+----------+

Test Activity
-------------
Sanity

Revision history for this message
Yang Liu (yliu12) wrote :
Revision history for this message
Frank Miller (sensfan22) wrote :

Seen only once and the workaround is a lock/unlock. As such marking as low priority.

Changed in starlingx:
status: New → Triaged
importance: Undecided → Low
tags: added: stx.helpwanted
Revision history for this message
Yang Liu (yliu12) wrote :

Here's the history of the storage config-out-of-date issue. It is seen on the same storage system on following loads:
20190804T233000Z
20190804T013000Z
20190801T013000Z

This issue was not seen on same storage system with 20190728T233000Z load.
Unfortunately I don't have history for previous loads to confirm the frequency.

Revision history for this message
Frank Miller (sensfan22) wrote :

Moving priority up to Medium since issue is seen multiple times. Appears to have been introduced by a commit or change sometime after July 28. Assigning to the storage PL to triage and determine a prime to investigate this issue.

Changed in starlingx:
importance: Low → Medium
Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Cindy Xie (xxie1)
Cindy Xie (xxie1)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → chen haochuan (martin1982)
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Frank Miller (sensfan22) wrote :

Marking this stx.2.0 gating since this has been seen multiple times.

tags: added: stx.2.0
Revision history for this message
chen haochuan (martin1982) wrote :
Download full text (6.2 KiB)

There is such exception in all nodes' log file

2019-08-01 07:40:35.883 31972 INFO sysinv.agent.manager [-] iscsi initiator name = iqn.1994-05.com.redhat:6aa17f74fa21
2019-08-01 07:40:36.024 31972 INFO sysinv.agent.manager [-] Sysinv Agent platform update by host: {'config_applied': '81394040-4457-42aa-ab56-611953a7500c', 'first_report': True, 'availability': 'available', 'iscsi_initiator_name': 'iqn.1994-05.com.redhat:6aa17f74fa21'}
2019-08-01 09:04:56.939 31972 ERROR sysinv.openstack.common.rpc.common [-] Failed to consume message from queue: [Errno 104] Connection reset by peer
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common Traceback (most recent call last):
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/impl_kombu.py", line 564, in ensure
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common return method(*args, **kwargs)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/impl_kombu.py", line 644, in _consume
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common return self.connection.drain_events(timeout=timeout)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 301, in drain_events
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common return self.transport.drain_events(self.connection, **kwargs)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 103, in drain_events
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common return connection.drain_events(**kwargs)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 464, in drain_events
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common return self.blocking_read(timeout)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 469, in blocking_read
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common return self.on_inbound_frame(frame)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 68, in on_frame
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common callback(channel, method_sig, buf, None)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 473, in on_inbound_method
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common method_sig, payload, content,
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 142, in dispatch_method
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common listener(...

Read more...

Revision history for this message
Yang Liu (yliu12) wrote :

Looking at the time stamps from above exceptions, they should be unrelated to this issue.
Note that the config out-of-date alarm was raised against both storage nodes and never cleared itself.
The time stamp for those alarms to be raised are around 7:40am. The reason storage-0's alarm was cleared was because a manual lock/unlock was performed on storage-0 around 8:53.

[2019-08-01 08:28:27,174] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------+----------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------+----------------+----------+----------------------------+
| 680035bf-9314-4b6c-8df4-b908de9f5a7f | 250.001 | storage-0 Configuration is out-of-date. | host=storage-0 | major | 2019-08-01T07:41:15.129265 |
| e35c641b-1335-4700-b34e-f54dc2eb2564 | 250.001 | storage-1 Configuration is out-of-date. | host=storage-1 | major | 2019-08-01T07:40:35.977782 |
+--------------------------------------+------

Changed in starlingx:
assignee: chen haochuan (martin1982) → nobody
assignee: nobody → chen haochuan (martin1982)
Revision history for this message
Anujeyan Manokeran (anujeyan) wrote :

This issue was seen in IP20-27 load 2019-08-09_20-59-00.

Revision history for this message
Cindy Xie (xxie1) wrote :

According to Wei Zhou and Martin Chen, this seems like an issue caused by patch: https://review.opendev.org/#/c/673336/. Wei and Martin are working together for a fix.

Wei Zhou (wzhou007)
Changed in starlingx:
assignee: chen haochuan (martin1982) → Wei Zhou (wzhou007)
Wei Zhou (wzhou007)
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/676262

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/676262
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=1ab1aba4b0b90fd9205ebbc14847796636e5725b
Submitter: Zuul
Branch: master

commit 1ab1aba4b0b90fd9205ebbc14847796636e5725b
Author: Wei Zhou <email address hidden>
Date: Tue Aug 13 14:04:45 2019 -0400

    Configuration out-of-date alarms on storage nodes since fresh install

    After fresh install of a Standard configuration with dedicated storage
    nodes, the storage node configuration out-of-date alarms are not
    cleared. This regression is caused by commit
    https://review.opendev.org/#/c/673336/. In check_unlock_storage()
    function a call to restore_ceph_config() was replaced by
    update_ceph_osd_config() when osdmap is empty. This will generate a
    config uuid and invoke config_apply_runtime_manifest before the
    .initial_config_complete flag file is created on the storage node.
    In this case sysinv-agent won't send response to sysinv-conductor to
    clear the configuration out-of-date alarm. The solution is to remove
    the code that calls update_ceph_osd_config() in check_unlock_storage()
    because ceph config is restored and osds are created automatically
    when storage node is unlocked.

    Following tests are performed:
      1. Install a storage lab and verify that ceph is running ok without
         storage node configuration out-of-date alarm
      2. Backup and restore a standard storage configuration

    Change-Id: If05adc5d30f7ea73cc5e630b4508be0351e4f6e4
    Closes-Bug: 1838652
    Signed-off-by: Wei Zhou <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Wei, please cherrypick the changes to the r/stx.2.0 branch

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (r/stx.2.0)

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/676526

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (r/stx.2.0)

Reviewed: https://review.opendev.org/676526
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=cb0edea899af9234f92d25dfa1540069132b7e9a
Submitter: Zuul
Branch: r/stx.2.0

commit cb0edea899af9234f92d25dfa1540069132b7e9a
Author: Wei Zhou <email address hidden>
Date: Tue Aug 13 14:04:45 2019 -0400

    Configuration out-of-date alarms on storage nodes since fresh install

    After fresh install of a Standard configuration with dedicated storage
    nodes, the storage node configuration out-of-date alarms are not
    cleared. This regression is caused by commit
    https://review.opendev.org/#/c/673336/. In check_unlock_storage()
    function a call to restore_ceph_config() was replaced by
    update_ceph_osd_config() when osdmap is empty. This will generate a
    config uuid and invoke config_apply_runtime_manifest before the
    .initial_config_complete flag file is created on the storage node.
    In this case sysinv-agent won't send response to sysinv-conductor to
    clear the configuration out-of-date alarm. The solution is to remove
    the code that calls update_ceph_osd_config() in check_unlock_storage()
    because ceph config is restored and osds are created automatically
    when storage node is unlocked.

    Following tests are performed:
      1. Install a storage lab and verify that ceph is running ok without
         storage node configuration out-of-date alarm
      2. Backup and restore a standard storage configuration

    Change-Id: If05adc5d30f7ea73cc5e630b4508be0351e4f6e4
    Closes-Bug: 1838652
    Signed-off-by: Wei Zhou <email address hidden>
    (cherry picked from commit 1ab1aba4b0b90fd9205ebbc14847796636e5725b)

Ghada Khalil (gkhalil)
tags: added: in-r-stx20
Revision history for this message
Yang Liu (yliu12) wrote :

This issue is no longer seen in recent dedicated storage sanity.
Load: stx2.0 2019-08-19_20-59-00

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.