controller-0 is always degraded when unlocked on multi-node

Bug #1852155 reported by Austin Sun
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Bin Qian

Bug Description

Brief Description
-----------------
controller-0 keeps degraded when first unlock on multi-node.

Severity
--------
Provide the severity of the defect.
<Critical: System/Feature is not usable due to the defect>
<Major: System/Feature is usable but degraded>
<Minor: System/Feature is usable with minor issue>

Steps to Reproduce
------------------
1) install and provision controller-0
2) unlock controller-0
3) the controller-0 keeps degraded.

Expected Behavior
------------------
 controller-0 will be available

Actual Behavior
----------------
the controller-0 keeps degraded.

Reproducibility
---------------
<Reproducible/Intermittent/Seen once>
100%
System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------
11-11 daily build

Last Pass
---------
1104 daily build

and
if revert https://review.opendev.org/#/c/691714/ based on latest code base
it will pass.

Timestamp/Logs
--------------
Attach the logs for debugging (use attachments in Launchpad or paste.openstack.org)
Provide a snippet of logs here and the timestamp when issue was seen.
Please indicate the unique identifier in the logs to highlight the problem

Test Activity
-------------
Sanity

Revision history for this message
Austin Sun (sunausti) wrote :
Revision history for this message
Austin Sun (sunausti) wrote :

if revert https://review.opendev.org/#/c/691714 and https://review.opendev.org/#/c/692439 , this issue is gone,
need owner to double check this issue.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

@Austin, What is the reason for the degraded condition? Can you put a list of the system alarms raised?
I've also subscribed Bin Qian as he is the author of the reviews that you reference above.

Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → Incomplete
Revision history for this message
Austin Sun (sunausti) wrote :

200.004 was the alarm id. from mtcAgent log
2019-11-11T13:49:12.314 fmAPI.cpp(490): Enqueue raise alarm request: UUID (ed83aa22-fbb9-4fbc-b018-026aec20812e) alarm id (200.004) instant id (host=controller-0)

and from sysinv log there is crash store_default_config. I check with test team , they did not meet issue , I think this should be raised in certain condition.

sysinv 2019-11-11 13:48:12.162 103024 ERROR wsme.api [-] Server-side error: "Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "store_default_config" info: "<unknown>"". Detail:
Traceback (most recent call last):

  File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction
    result = f(self, *args, **kwargs)

  File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 1762, in patch
    return self._patch_sys(uuid, patch, profile_uuid)

  File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 274, in inner
    return f(*args, **kwargs)

  File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 1768, in _patch_sys
    return self._patch(uuid, patch, profile_uuid)

  File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 1958, in _patch
    self.stage_administrative_update(hostupdate)

  File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 4531, in stage_administrative_update
    pecan.request.context)

  File "/usr/lib64/python2.7/site-packages/sysinv/conductor/rpcapi.py", line 1823, in store_default_config
    return self.call(context, self.make_msg('store_default_config'))

  File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/proxy.py", line 126, in call
    exc.info, real_topic, msg.get('method'))

Timeout: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "store_default_config" info: "<unknown>"
: Timeout: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "store_default_config" info: "<unknown>"

Revision history for this message
Ghada Khalil (gkhalil) wrote :

What is the output of "fm alarm-list" when the controller is degraded?

Revision history for this message
Austin Sun (sunausti) wrote :

Hi, Ghada:
    The deployment has been destroyed for other purpose. I think the fm alarm db was in log has already provided.
    Thanks.
    BR
Austin Sun.

Revision history for this message
Austin Sun (sunausti) wrote :
Download full text (3.6 KiB)

another setup met same issue again.
the fm alarm-list result is below:
[sysadmin@controller-1 ~(keystone_admin)]$ fm alarm-list
+----------+------------------------------------------------+---------------+----------+-------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+------------------------------------------------+---------------+----------+-------------+
| 100.114 | NTP cannot reach external time source; syncing | host= | minor | 2019-11-20T |
| | with peer controller only | controller-1. | | 05:30:28. |
| | | ntp | | 726949 |
| | | | | |
| 400.005 | Communication failure detected with peer over | host= | major | 2019-11-20T |
| | port ens6 on host controller-1 | controller-1. | | 05:10:01. |
| | | network=oam | | 354775 |
| | | | | |
| 100.114 | NTP address 10.104.195.152 is not a valid or a | host= | minor | 2019-11-18T |
| | reachable NTP server. | controller-1. | | 07:20:28. |
| | | ntp=10.104. | | 689707 |
| | | 195.152 | | |
| | | | | |
| 100.114 | NTP configuration does not contain any valid | host= | major | 2019-11-18T |
| | or reachable NTP servers. | controller-0. | | 06:52:54. |
| | | ntp | | 604071 |
| | | | | |
| 100.114 | NTP address 10.104.192.16 is not a valid or a | host= | minor | 2019-11-18T |
| | reachable NTP server. | controller-0. | | 06:52:54. |
| | | ntp=10.104. | | 598642 |
| | | 192.16 | | |
| | | | | |
| 200.004 | controller-0 experienced a service-affecting | host= | critical | 2019-11-18T |
| | failure. Auto-recovery in progress. Manual | controller-0 | | 06:42:09. |
| | Lock and Unlock may be required if auto- | | | 134928 |
| | recovery is unsuccessful. | | | |
| | | | ...

Read more...

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Bin Qian (bqian20)
status: Incomplete → Triaged
tags: added: stx.config stx.metal
Changed in starlingx:
importance: Undecided → Low
Revision history for this message
Bin Qian (bqian20) wrote :

The log showed that sysinv-conductor was blocked with a long last task _upgrade_downgrade_kube_networking
sysinv 2019-11-11 13:47:06.144 101124 INFO sysinv.conductor.manager [-] _upgrade_downgrade_kube_networking executing playbook: /usr/share/ansible/stx-ansible/playbooks/upgrade-k8s-networking.yml
The task blocked sysinv-conductor for about 2 minutes.

During the period, mtcAgent send update to sysinv-api to update state after enabling the node. which triggered a request from sysinv-api to sysinv-conductor, the request was blocked.

sysinv 2019-11-11 13:48:12.162 103024 ERROR wsme.api [-] Server-side error: "Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "store_default_config" info: "<unknown>"". Detail:
sysinv 2019-11-11 13:49:12.266 103024 ERROR wsme.api [-] Server-side error: "Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "store_default_config" info: "<unknown>"". Detail:

The mtcAgent degraded the controller-0 after its attempts both failed because of sysinv-api calling sysinv-conductor timeout.

The sysinv-conductor was designed to run short task and should not be blocked for too long. A blocking task that last 2 minutes should not be executed synchronized in sysinv-conductor process.

Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: WCP_63_66
Load: 2019-12-04_14-56-46

[sysadmin@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | degraded |
| 2 | compute-1 | worker | locked | disabled | offline |
| 3 | compute-0 | worker | locked | disabled | offline |
| 4 | controller-1 | controller | locked | disabled | offline |
+----+--------------+-------------+----------------+-------------+--------------+

Revision history for this message
Bin Qian (bqian20) wrote :

The issue that causes the sysinv-conductor being blocked has been fixed in:
https://review.opendev.org/#/c/695543/.
This issue is therefore considered as fixed. Please retest with the build that with above commit.

Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
Austin Sun (sunausti) wrote :

hi, Qian Bin:
   from pengpeng's statement, this issue is reproduced in 1204 daily , but the commit your mentained was merged 1203 , http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20191203T000000Z/outputs/CHANGELOG.txt
535879dfd0e88bd45006daceea090eb0cb2b5d50 , would you like double confirm ?

Revision history for this message
Bin Qian (bqian20) wrote :

Austin,

Peng's build did not include the change list mentioned above.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.