Bug #1838652 “Configuration out-of-date alarms on storage nodes ...” : Bugs : StarlingX

Revision history for this message

Yang Liu (yliu12) wrote on 2019-08-01:

#1

ALL_NODES_20190801.133526.tar Edit (97.3 MiB, application/x-tar)

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-08-02:

#2

Seen only once and the workaround is a lock/unlock. As such marking as low priority.

Changed in starlingx:
status:	New → Triaged
importance:	Undecided → Low
tags:	added: stx.helpwanted

Revision history for this message

Yang Liu (yliu12) wrote on 2019-08-06:

#3

Here's the history of the storage config-out-of-date issue. It is seen on the same storage system on following loads:
20190804T233000Z
20190804T013000Z
20190801T013000Z

This issue was not seen on same storage system with 20190728T233000Z load.
Unfortunately I don't have history for previous loads to confirm the frequency.

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-08-06:

#4

Moving priority up to Medium since issue is seen multiple times. Appears to have been introduced by a commit or change sometime after July 28. Assigning to the storage PL to triage and determine a prime to investigate this issue.

Changed in starlingx:
importance:	Low → Medium

Frank Miller (sensfan22) on 2019-08-06

Changed in starlingx:
assignee:	nobody → Cindy Xie (xxie1)

Cindy Xie (xxie1) on 2019-08-07

Changed in starlingx:
assignee:	Cindy Xie (xxie1) → chen haochuan (martin1982)

Numan Waheed (nwaheed) on 2019-08-07

tags:

added: stx.retestneeded

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-08-07:

#5

Marking this stx.2.0 gating since this has been seen multiple times.

tags:

added: stx.2.0

Revision history for this message

chen haochuan (martin1982) wrote on 2019-08-08:

#6

Download full text (6.2 KiB)

There is such exception in all nodes' log file

2019-08-01 07:40:35.883 31972 INFO sysinv.agent.manager [-] iscsi initiator name = iqn.1994-05.com.redhat:6aa17f74fa21
2019-08-01 07:40:36.024 31972 INFO sysinv.agent.manager [-] Sysinv Agent platform update by host: {'config_applied': '81394040-4457-42aa-ab56-611953a7500c', 'first_report': True, 'availability': 'available', 'iscsi_initiator_name': 'iqn.1994-05.com.redhat:6aa17f74fa21'}
2019-08-01 09:04:56.939 31972 ERROR sysinv.openstack.common.rpc.common [-] Failed to consume message from queue: [Errno 104] Connection reset by peer
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common Traceback (most recent call last):
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/impl_kombu.py", line 564, in ensure
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common return method(*args, **kwargs)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/impl_kombu.py", line 644, in _consume
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common return self.connection.drain_events(timeout=timeout)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 301, in drain_events
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common return self.transport.drain_events(self.connection, **kwargs)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 103, in drain_events
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common return connection.drain_events(**kwargs)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 464, in drain_events
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common return self.blocking_read(timeout)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 469, in blocking_read
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common return self.on_inbound_frame(frame)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 68, in on_frame
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common callback(channel, method_sig, buf, None)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 473, in on_inbound_method
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common method_sig, payload, content,
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 142, in dispatch_method
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common listener(...

There is such exception in all nodes' log file

2019-08-01 07:40:35.883 31972 INFO sysinv.agent.manager [-] iscsi initiator name = iqn.1994-05.com.redhat:6aa17f74fa21
2019-08-01 07:40:36.024 31972 INFO sysinv.agent.manager [-] Sysinv Agent platform update by host: {'config_applied': '81394040-4457-42aa-ab56-611953a7500c', 'first_report': True, 'availability': 'available', 'iscsi_initiator_name': 'iqn.1994-05.com.redhat:6aa17f74fa21'}
2019-08-01 09:04:56.939 31972 ERROR sysinv.openstack.common.rpc.common [-] Failed to consume message from queue: [Errno 104] Connection reset by peer
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common Traceback (most recent call last):
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/impl_kombu.py", line 564, in ensure
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     return method(*args, **kwargs)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/impl_kombu.py", line 644, in _consume
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     return self.connection.drain_events(timeout=timeout)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 301, in drain_events
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     return self.transport.drain_events(self.connection, **kwargs)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 103, in drain_events
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     return connection.drain_events(**kwargs)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 464, in drain_events
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     return self.blocking_read(timeout)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 469, in blocking_read
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     return self.on_inbound_frame(frame)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 68, in on_frame
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     callback(channel, method_sig, buf, None)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 473, in on_inbound_method
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     method_sig, payload, content,
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 142, in dispatch_method
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     listener(*args)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 593, in _on_close
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     self._x_close_ok()
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 608, in _x_close_ok
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     self.send_method(spec.Connection.CloseOk, callback=self._on_close_ok)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 64, in send_method
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     conn.frame_writer(1, self.channel_id, sig, args, content)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 174, in write_frame
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     write(view[:offset])
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 301, in write
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     self._write(s)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 385, in sendall
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     tail = self.send(data, flags)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 379, in send
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     return self._send_loop(self.fd.send, data, flags)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common   File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 366, in _send_loop
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common     return send_method(data, *args)
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common error: [Errno 104] Connection reset by peer
2019-08-01 09:04:56.939 31972 TRACE sysinv.openstack.common.rpc.common
2019-08-01 09:04:56.944 31972 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.144.2:5672
2019-08-01 09:04:56.950 31972 ERROR sysinv.openstack.common.rpc.common [-] AMQP server on 192.168.144.2:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 1 seconds.
2019-08-01 09:04:57.952 31972 INFO sysinv.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.144.2:5672

For storage-1, such exception happens early, before send "configuration updated" to controller-0, which makes controller-0 always think storage-1 configuration out of date.

For storage-0, it is lucky, such exception happens after message "configuration updated" send to controller-0, so why no alarm "configuration out of data" for storage-0

Revision history for this message

Yang Liu (yliu12) wrote on 2019-08-08:

#7

Looking at the time stamps from above exceptions, they should be unrelated to this issue.
Note that the config out-of-date alarm was raised against both storage nodes and never cleared itself.
The time stamp for those alarms to be raised are around 7:40am. The reason storage-0's alarm was cleared was because a manual lock/unlock was performed on storage-0 around 8:53.

chen haochuan (martin1982) on 2019-08-12

Changed in starlingx:
assignee:	chen haochuan (martin1982) → nobody
assignee:	nobody → chen haochuan (martin1982)

Revision history for this message

Anujeyan Manokeran (anujeyan) wrote on 2019-08-12:

#8

This issue was seen in IP20-27 load 2019-08-09_20-59-00.

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-08-13:

#9

According to Wei Zhou and Martin Chen, this seems like an issue caused by patch: https://review.opendev.org/#/c/673336/. Wei and Martin are working together for a fix.

Wei Zhou (wzhou007) on 2019-08-13

Changed in starlingx:
assignee:	chen haochuan (martin1982) → Wei Zhou (wzhou007)

Wei Zhou (wzhou007) on 2019-08-13

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-13: Fix proposed to config (master)

#10

Fix proposed to branch: master
Review: https://review.opendev.org/676262

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-14: Fix merged to config (master)

#11

Reviewed: https://review.opendev.org/676262
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=1ab1aba4b0b90fd9205ebbc14847796636e5725b
Submitter: Zuul
Branch: master

commit 1ab1aba4b0b90fd9205ebbc14847796636e5725b
Author: Wei Zhou <email address hidden>
Date: Tue Aug 13 14:04:45 2019 -0400

Configuration out-of-date alarms on storage nodes since fresh install

    After fresh install of a Standard configuration with dedicated storage
    nodes, the storage node configuration out-of-date alarms are not
    cleared. This regression is caused by commit
    https://review.opendev.org/#/c/673336/. In check_unlock_storage()
    function a call to restore_ceph_config() was replaced by
    update_ceph_osd_config() when osdmap is empty. This will generate a
    config uuid and invoke config_apply_runtime_manifest before the
    .initial_config_complete flag file is created on the storage node.
    In this case sysinv-agent won't send response to sysinv-conductor to
    clear the configuration out-of-date alarm. The solution is to remove
    the code that calls update_ceph_osd_config() in check_unlock_storage()
    because ceph config is restored and osds are created automatically
    when storage node is unlocked.

    Following tests are performed:
      1. Install a storage lab and verify that ceph is running ok without
         storage node configuration out-of-date alarm
      2. Backup and restore a standard storage configuration

    Change-Id: If05adc5d30f7ea73cc5e630b4508be0351e4f6e4
    Closes-Bug: 1838652
    Signed-off-by: Wei Zhou <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-08-14:

#12

@Wei, please cherrypick the changes to the r/stx.2.0 branch

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-14: Fix proposed to config (r/stx.2.0)

#13

Fix proposed to branch: r/stx.2.0
Review: https://review.opendev.org/676526

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-15: Fix merged to config (r/stx.2.0)

#14

Reviewed: https://review.opendev.org/676526
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=cb0edea899af9234f92d25dfa1540069132b7e9a
Submitter: Zuul
Branch: r/stx.2.0

commit cb0edea899af9234f92d25dfa1540069132b7e9a
Author: Wei Zhou <email address hidden>
Date: Tue Aug 13 14:04:45 2019 -0400

Configuration out-of-date alarms on storage nodes since fresh install

    After fresh install of a Standard configuration with dedicated storage
    nodes, the storage node configuration out-of-date alarms are not
    cleared. This regression is caused by commit
    https://review.opendev.org/#/c/673336/. In check_unlock_storage()
    function a call to restore_ceph_config() was replaced by
    update_ceph_osd_config() when osdmap is empty. This will generate a
    config uuid and invoke config_apply_runtime_manifest before the
    .initial_config_complete flag file is created on the storage node.
    In this case sysinv-agent won't send response to sysinv-conductor to
    clear the configuration out-of-date alarm. The solution is to remove
    the code that calls update_ceph_osd_config() in check_unlock_storage()
    because ceph config is restored and osds are created automatically
    when storage node is unlocked.

    Following tests are performed:
      1. Install a storage lab and verify that ceph is running ok without
         storage node configuration out-of-date alarm
      2. Backup and restore a standard storage configuration

    Change-Id: If05adc5d30f7ea73cc5e630b4508be0351e4f6e4
    Closes-Bug: 1838652
    Signed-off-by: Wei Zhou <email address hidden>
    (cherry picked from commit 1ab1aba4b0b90fd9205ebbc14847796636e5725b)

Ghada Khalil (gkhalil) on 2019-08-15

tags:

added: in-r-stx20

Revision history for this message

Yang Liu (yliu12) wrote on 2019-08-20:

#15

This issue is no longer seen in recent dedicated storage sanity.
Load: stx2.0 2019-08-19_20-59-00

tags:

removed: stx.retestneeded

StarlingX

Configuration out-of-date alarms on storage nodes since fresh install

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches