DC: mgr-restful-plugin restarting every few minutes, causing a number of other services to restart as well
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Triaged
|
Medium
|
Stefan Dinescu |
Bug Description
Brief Description
-----------------
A DC system was having stability issue when bootstraping subclouds (bootstrap of subcloud fails premature and stuck at bootstrapping even after it's done, etc). It was then noticed mgr-restful-plugin has been restarting every few minutes. It also causes a number of other services to restart. Even though those services recover fast, it causes instability of the system.
Following services are likely affected according to Gerry Kopec.
| 2020-06-
| 2020-06-
| 2020-06-
| 2020-06-
| 2020-06-
| 2020-06-
| 2020-06-
| 2020-06-
Severity
--------
Major
Steps to Reproduce
------------------
Check sm-customer.log and observe that mgr-restful-plugin is going from enabled-active to disabling due to audit failed every few minutes
Not sure about steps to reproduce.
Automated regression was running on when issue started to happen, but the events feels unrelated. - See details in timestamp and logs.
Expected Behavior
------------------
system is stable
Actual Behavior
----------------
mgr-restful-plugin restarts every few minutes along with a number of other services
Reproducibility
---------------
Intermittent
System Configuration
-------
DC system controller
Lab-name: DC-4
Branch/Pull Time/Commit
-------
2020-06-20_20-00-00
Last Pass
---------
2020-06-18_20-00-00 - but this likely to be intermittent, so not sure about exact last pass.
Timestamp/Logs
--------------
https:/
Here's what was done before the first restart:
# Modify timezone on system controller and subcloud7
[2020-06-22 14:04:15,729] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81:
[2020-06-22 14:04:51,483] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88:
# Time zone reverted
[2020-06-22 14:24:02,950] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81:
[2020-06-22 14:24:38,673] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88:
# lock/unlock subcloud7 host
[2020-06-22 14:26:00,195] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88:
[2020-06-22 14:27:20,656] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88:
# subcloud7 was just recovered at 14:35:
[2020-06-22 14:35:16,714] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88:
Authorization failed: Unable to establish connection to http://[fd01:88:
controller-0:~$
[2020-06-22 14:35:27,685] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88:
+------
| Property | Value |
+------
| contact | None |
| created_at | 2020-06-
| description | None |
| distributed_
| https_enabled | False |
| location | None |
| name | dc-subcloud7 |
| region_name | subcloud7 |
| sdn_enabled | False |
| security_feature | spectre_meltdown_v1 |
| service_
| shared_services | ['identity', ] |
| software_version | 20.06 |
| system_mode | simplex |
| system_type | All-in-one |
| timezone | UTC |
| updated_at | 2020-06-
| uuid | aa2c9682-
| vswitch_type | none |
+------
controller-0:~$
# There were no explicit operations on system controller at 14:35, where the issue started.
| 2020-06-
| 2020-06-
| 2020-06-
| 2020-06-
| 2020-06-
| 2020-06-
| 2020-06-
| 2020-06-
| 2020-06-
Test Activity
-------------
Regression Testing, Developer Testing
Workaround
----------
Performing a controller swact appears to fix the issue
description: | updated |
tags: | added: stx.storage |
Changed in starlingx: | |
status: | New → Triaged |
assignee: | nobody → Stefan Dinescu (stefandinescu) |
importance: | Undecided → Medium |
tags: | added: stx.4.0 |
tags: |
added: stx.5.0 removed: stx.4.0 |
tags: | added: stx.retestneeded |
Comments from Bart Wensley: plugin. log):
I don’t know anything about the mgr-restful-plugin, but it seems to be failing to connect to the ceph-mgr process on port 7999 when the issue is happening (from mgr-restful-
2020-06-22 20:12:27,194 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph fsid ceph/mgr/ ceph-controller -0/keyring server_ port mgr/restful/ controller- 0/crt controller- 0/crt mgr/restful/ controller- 0/key controller- 0/key mgr/restful/ controller- 0/crt -i /tmp/tmp4XO7c4/crt controller- 0/crt -i /tmp/tmp4XO7c4/crt mgr/restful/ controller- 0/key -i /tmp/tmp4XO7c4/key controller- 0/key -i /tmp/tmp4XO7c4/key keys/admin mgr/restful/ controller- 0/crt HTTPSConnection Pool(host= 'controller- 0', port=7999): Read timed out. (read timeout=15) HTTPSConnection Pool(host= 'controller- 0', port=7999): Read timed out. (read timeout=15)
2020-06-22 20:12:27,446 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph auth get mgr.controller-0 -o /var/lib/
2020-06-22 20:12:27,714 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get mgr/restful/
2020-06-22 20:12:28,002 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get config/
2020-06-22 20:12:28,274 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get mgr/restful/
2020-06-22 20:12:28,563 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get config/
2020-06-22 20:12:28,842 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get /mgr/restful/
2020-06-22 20:12:29,118 3728723 INFO mgr-restful-plugin Create restful plugin self signed certificate
2020-06-22 20:12:29,279 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key set config/
2020-06-22 20:12:29,561 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key set mgr/restful/
2020-06-22 20:12:29,845 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key set config/
2020-06-22 20:12:30,123 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key set mgr/restful/
2020-06-22 20:12:30,393 3728723 INFO mgr-restful-plugin Stop unmanaged running ceph-mgr processes
2020-06-22 20:12:30,491 3728723 INFO mgr-restful-plugin Start ceph-mgr daemon
2020-06-22 20:12:45,509 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph mgr module ls --format json
2020-06-22 20:12:48,781 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get mgr/restful/
2020-06-22 20:12:49,057 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph mgr services --format json
2020-06-22 20:12:49,320 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get config/
2020-06-22 20:13:28,726 3728723 WARNING mgr-restful-plugin REST API ping failed: reason=
2020-06-22 20:13:28,726 3728723 INFO mgr-restful-plugin REST API ping failure count=0
2020-06-22 20:13:28,726 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph fsid
2020-06-22 20:13:47,003 3728723 WARNING mgr-restful-plugin REST API ping failed: reason=
2020-06-22 20:13:47,004 3728723 INFO mgr-restful-plugin REST API ping failure count=1
2020-06-22 20:13:47,004 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph fsid
2020-06-22 20:15:22,719 3728723 WA...