Bug #1852155 “controller-0 is always degraded when unlocked on ...” : Bugs : StarlingX

Revision history for this message

Austin Sun (sunausti) wrote on 2019-11-12:

#1

controller-0_20191112.020443.tar Edit (18.0 MiB, application/x-tar)

Revision history for this message

Austin Sun (sunausti) wrote on 2019-11-12:

#2

if revert https://review.opendev.org/#/c/691714 and https://review.opendev.org/#/c/692439 , this issue is gone,
need owner to double check this issue.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-11-18:

#3

@Austin, What is the reason for the degraded condition? Can you put a list of the system alarms raised?
I've also subscribed Bin Qian as he is the author of the reviews that you reference above.

Ghada Khalil (gkhalil) on 2019-11-18

Changed in starlingx:
status:	New → Incomplete

Revision history for this message

Austin Sun (sunausti) wrote on 2019-11-19:

#4

200.004 was the alarm id. from mtcAgent log
2019-11-11T13:49:12.314 fmAPI.cpp(490): Enqueue raise alarm request: UUID (ed83aa22-fbb9-4fbc-b018-026aec20812e) alarm id (200.004) instant id (host=controller-0)

and from sysinv log there is crash store_default_config. I check with test team , they did not meet issue , I think this should be raised in certain condition.

sysinv 2019-11-11 13:48:12.162 103024 ERROR wsme.api [-] Server-side error: "Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "store_default_config" info: "<unknown>"". Detail:
Traceback (most recent call last):

File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction
result = f(self, *args, **kwargs)

File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 1762, in patch
return self._patch_sys(uuid, patch, profile_uuid)

File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 274, in inner
return f(*args, **kwargs)

File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 1768, in _patch_sys
return self._patch(uuid, patch, profile_uuid)

File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 1958, in _patch
self.stage_administrative_update(hostupdate)

File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 4531, in stage_administrative_update
pecan.request.context)

File "/usr/lib64/python2.7/site-packages/sysinv/conductor/rpcapi.py", line 1823, in store_default_config
return self.call(context, self.make_msg('store_default_config'))

File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/proxy.py", line 126, in call
exc.info, real_topic, msg.get('method'))

Timeout: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "store_default_config" info: "<unknown>"
: Timeout: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "store_default_config" info: "<unknown>"

200.004 was the alarm id. from mtcAgent log
2019-11-11T13:49:12.314 fmAPI.cpp(490): Enqueue raise alarm request: UUID (ed83aa22-fbb9-4fbc-b018-026aec20812e) alarm id (200.004) instant id (host=controller-0)

and from sysinv log there is crash store_default_config. I check with test team , they did not meet issue , I think this should be raised in certain condition.

sysinv 2019-11-11 13:48:12.162 103024 ERROR wsme.api [-] Server-side error: "Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "store_default_config" info: "<unknown>"". Detail: 
Traceback (most recent call last):

File "/usr/lib/python2.7/site-packages/wsmeext/pecan.py", line 85, in callfunction
    result = f(self, *args, **kwargs)

File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 1762, in patch
    return self._patch_sys(uuid, patch, profile_uuid)

File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 274, in inner
    return f(*args, **kwargs)

File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 1768, in _patch_sys
    return self._patch(uuid, patch, profile_uuid)

File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 1958, in _patch
    self.stage_administrative_update(hostupdate)

File "/usr/lib64/python2.7/site-packages/sysinv/api/controllers/v1/host.py", line 4531, in stage_administrative_update
    pecan.request.context)

File "/usr/lib64/python2.7/site-packages/sysinv/conductor/rpcapi.py", line 1823, in store_default_config
    return self.call(context, self.make_msg('store_default_config'))

File "/usr/lib64/python2.7/site-packages/sysinv/openstack/common/rpc/proxy.py", line 126, in call
    exc.info, real_topic, msg.get('method'))

Timeout: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "store_default_config" info: "<unknown>"
: Timeout: Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "store_default_config" info: "<unknown>"

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-11-19:

#5

What is the output of "fm alarm-list" when the controller is degraded?

Revision history for this message

Austin Sun (sunausti) wrote on 2019-11-20:

#6

Hi, Ghada:
    The deployment has been destroyed for other purpose. I think the fm alarm db was in log has already provided.
    Thanks.
    BR
Austin Sun.

Revision history for this message

Austin Sun (sunausti) wrote on 2019-11-20:

#7

Download full text (3.6 KiB)

another setup met same issue again.
the fm alarm-list result is below:
[sysadmin@controller-1 ~(keystone_admin)]$ fm alarm-list
+----------+------------------------------------------------+---------------+----------+-------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+------------------------------------------------+---------------+----------+-------------+
| 100.114 | NTP cannot reach external time source; syncing | host= | minor | 2019-11-20T |
| | with peer controller only | controller-1. | | 05:30:28. |
| | | ntp | | 726949 |
| | | | | |
| 400.005 | Communication failure detected with peer over | host= | major | 2019-11-20T |
| | port ens6 on host controller-1 | controller-1. | | 05:10:01. |
| | | network=oam | | 354775 |
| | | | | |
| 100.114 | NTP address 10.104.195.152 is not a valid or a | host= | minor | 2019-11-18T |
| | reachable NTP server. | controller-1. | | 07:20:28. |
| | | ntp=10.104. | | 689707 |
| | | 195.152 | | |
| | | | | |
| 100.114 | NTP configuration does not contain any valid | host= | major | 2019-11-18T |
| | or reachable NTP servers. | controller-0. | | 06:52:54. |
| | | ntp | | 604071 |
| | | | | |
| 100.114 | NTP address 10.104.192.16 is not a valid or a | host= | minor | 2019-11-18T |
| | reachable NTP server. | controller-0. | | 06:52:54. |
| | | ntp=10.104. | | 598642 |
| | | 192.16 | | |
| | | | | |
| 200.004 | controller-0 experienced a service-affecting | host= | critical | 2019-11-18T |
| | failure. Auto-recovery in progress. Manual | controller-0 | | 06:42:09. |
| | Lock and Unlock may be required if auto- | | | 134928 |
| | recovery is unsuccessful. | | | |
| | | | ...

another setup met same issue again. 
the fm alarm-list result is below:
[sysadmin@controller-1 ~(keystone_admin)]$ fm alarm-list
+----------+------------------------------------------------+---------------+----------+-------------+
| Alarm ID | Reason Text                                    | Entity ID     | Severity | Time Stamp  |
+----------+------------------------------------------------+---------------+----------+-------------+
| 100.114  | NTP cannot reach external time source; syncing | host=         | minor    | 2019-11-20T |
|          | with peer controller only                      | controller-1. |          | 05:30:28.   |
|          |                                                | ntp           |          | 726949      |
|          |                                                |               |          |             |
| 400.005  | Communication failure detected with peer over  | host=         | major    | 2019-11-20T |
|          | port ens6 on host controller-1                 | controller-1. |          | 05:10:01.   |
|          |                                                | network=oam   |          | 354775      |
|          |                                                |               |          |             |
| 100.114  | NTP address 10.104.195.152 is not a valid or a | host=         | minor    | 2019-11-18T |
|          | reachable NTP server.                          | controller-1. |          | 07:20:28.   |
|          |                                                | ntp=10.104.   |          | 689707      |
|          |                                                | 195.152       |          |             |
|          |                                                |               |          |             |
| 100.114  | NTP configuration does not contain any valid   | host=         | major    | 2019-11-18T |
|          | or reachable NTP servers.                      | controller-0. |          | 06:52:54.   |
|          |                                                | ntp           |          | 604071      |
|          |                                                |               |          |             |
| 100.114  | NTP address 10.104.192.16 is not a valid or a  | host=         | minor    | 2019-11-18T |
|          | reachable NTP server.                          | controller-0. |          | 06:52:54.   |
|          |                                                | ntp=10.104.   |          | 598642      |
|          |                                                | 192.16        |          |             |
|          |                                                |               |          |             |
| 200.004  | controller-0 experienced a service-affecting   | host=         | critical | 2019-11-18T |
|          | failure. Auto-recovery in progress. Manual     | controller-0  |          | 06:42:09.   |
|          | Lock and Unlock may be required if auto-       |               |          | 134928      |
|          | recovery is unsuccessful.                      |               |          |             |
|          |                                                |               |          |             |
| 400.005  | Communication failure detected with peer over  | host=         | major    | 2019-11-18T |
|          | port ens6 on host controller-0                 | controller-0. |          | 06:39:51.   |
|          |                                                | network=oam   |          | 958022      |
|          |                                                |               |          |             |
+----------+------------------------------------------------+---------------+----------+-------------+

Ghada Khalil (gkhalil) on 2019-11-21

Changed in starlingx:
assignee:	nobody → Bin Qian (bqian20)
status:	Incomplete → Triaged
tags:	added: stx.config stx.metal
Changed in starlingx:
importance:	Undecided → Low

Revision history for this message

Bin Qian (bqian20) wrote on 2019-12-04:

#8

The log showed that sysinv-conductor was blocked with a long last task _upgrade_downgrade_kube_networking
sysinv 2019-11-11 13:47:06.144 101124 INFO sysinv.conductor.manager [-] _upgrade_downgrade_kube_networking executing playbook: /usr/share/ansible/stx-ansible/playbooks/upgrade-k8s-networking.yml
The task blocked sysinv-conductor for about 2 minutes.

During the period, mtcAgent send update to sysinv-api to update state after enabling the node. which triggered a request from sysinv-api to sysinv-conductor, the request was blocked.

sysinv 2019-11-11 13:48:12.162 103024 ERROR wsme.api [-] Server-side error: "Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "store_default_config" info: "<unknown>"". Detail:
sysinv 2019-11-11 13:49:12.266 103024 ERROR wsme.api [-] Server-side error: "Timeout while waiting on RPC response - topic: "sysinv.conductor_manager", RPC method: "store_default_config" info: "<unknown>"". Detail:

The mtcAgent degraded the controller-0 after its attempts both failed because of sysinv-api calling sysinv-conductor timeout.

The sysinv-conductor was designed to run short task and should not be blocked for too long. A blocking task that last 2 minutes should not be executed synchronized in sysinv-conductor process.

Revision history for this message

Peng Peng (ppeng) wrote on 2019-12-05:

#9

ALL_NODES_20191205.143257.tar Edit (18.7 MiB, application/x-tar)

Issue was reproduced on
Lab: WCP_63_66
Load: 2019-12-04_14-56-46

Revision history for this message

Bin Qian (bqian20) wrote on 2019-12-05:

#10

The issue that causes the sysinv-conductor being blocked has been fixed in:
https://review.opendev.org/#/c/695543/.
This issue is therefore considered as fixed. Please retest with the build that with above commit.

Changed in starlingx:
status:	Triaged → Fix Released

Revision history for this message

Austin Sun (sunausti) wrote on 2019-12-06:

#11

hi, Qian Bin:
from pengpeng's statement, this issue is reproduced in 1204 daily , but the commit your mentained was merged 1203 , http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20191203T000000Z/outputs/CHANGELOG.txt
535879dfd0e88bd45006daceea090eb0cb2b5d50 , would you like double confirm ?

Revision history for this message

Bin Qian (bqian20) wrote on 2019-12-06:

#12

Austin,

Peng's build did not include the change list mentioned above.

StarlingX

controller-0 is always degraded when unlocked on multi-node

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches