2020-06-23 03:45:49 |
Yang Liu |
bug |
|
|
added bug |
2020-06-23 03:53:44 |
Yang Liu |
description |
Brief Description
-----------------
A DC system was having stability issue when bootstraping subclouds (bootstrap of subcloud fails premature and stuck at bootstrapping even after it's done, etc). It was then noticed mgr-restful-plugin has been restarting every few minutes. It also causes a number of other services to restart. Even though those services recover fast, it causes instability of the system.
Following services are likely affected according to Gerry Kopec.
| 2020-06-22T20:16:57.174 | 12980 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T20:16:57.552 | 12981 | service-scn | ceph-manager | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:57.553 | 12982 | service-scn | sysinv-conductor | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:57.554 | 12983 | service-scn | sysinv-inv | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.056 | 12984 | service-scn | dcorch-sysinv-api-proxy | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.057 | 12985 | service-scn | dcmanager-manager | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.058 | 12986 | service-scn | dnsmasq | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.058 | 12987 | service-scn | mtc-agent | enabled-active | disabling | disable state requested
Severity
--------
Major
Steps to Reproduce
------------------
Check sm-customer.log and observe that mgr-restful-plugin is going from enabled-active to disabling due to audit failed every few minutes
Not sure about steps to reproduce.
Automated regression was running on when issue started to happen, but the events feels unrelated. - See details in timestamp and logs.
Expected Behavior
------------------
system is stable
Actual Behavior
----------------
mgr-restful-plugin restarts every few minutes along with a number of other services
Reproducibility
---------------
Intermittent
System Configuration
--------------------
DC system controller
Lab-name: DC-4
Branch/Pull Time/Commit
-----------------------
2020-06-20_20-00-00
Last Pass
---------
2020-06-18_20-00-00 - but this likely to be intermittent, so not sure about exact last pass.
Timestamp/Logs
--------------
Here's what was done before the first restart:
# Modify timezone on system controller and subcloud7
[2020-06-22 14:04:15,729] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --timezone="Europe/Berlin"'
[2020-06-22 14:04:51,483] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 modify --timezone="Canada/Central"'
# Time zone reverted
[2020-06-22 14:24:02,950] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --timezone="UTC"'
[2020-06-22 14:24:38,673] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 modify --timezone="UTC"'
# lock/unlock subcloud7 host
[2020-06-22 14:26:00,195] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 host-lock controller-0'
[2020-06-22 14:27:20,656] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 host-unlock controller-0'
# subcloud7 was just recovered at 14:35:
[2020-06-22 14:35:16,714] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 show'
Authorization failed: Unable to establish connection to http://[fd01:88::2]:5000/v3/auth/tokens
controller-0:~$
[2020-06-22 14:35:27,685] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 show'
+------------------------+--------------------------------------+
| Property | Value |
+------------------------+--------------------------------------+
| contact | None |
| created_at | 2020-06-21T06:59:44.523373+00:00 |
| description | None |
| distributed_cloud_role | subcloud |
| https_enabled | False |
| location | None |
| name | dc-subcloud7 |
| region_name | subcloud7 |
| sdn_enabled | False |
| security_feature | spectre_meltdown_v1 |
| service_project_name | services |
| shared_services | ['identity', ] |
| software_version | 20.06 |
| system_mode | simplex |
| system_type | All-in-one |
| timezone | UTC |
| updated_at | 2020-06-22T14:24:40.061605+00:00 |
| uuid | aa2c9682-a3c0-4a41-9b4a-c303914c6a89 |
| vswitch_type | none |
+------------------------+--------------------------------------+
controller-0:~$
# There were no explicit operations on system controller at 14:35, where the issue started.
| 2020-06-22T14:35:49.783 | 6069 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T14:37:05.359 | 6140 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T14:47:01.390 | 6222 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T14:56:56.845 | 6295 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:02:12.548 | 6367 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:04:48.142 | 6439 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:07:24.001 | 6510 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:08:39.621 | 6582 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:11:56.083 | 6658 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
Test Activity
-------------
Regression Testing, Developer Testing |
Brief Description
-----------------
A DC system was having stability issue when bootstraping subclouds (bootstrap of subcloud fails premature and stuck at bootstrapping even after it's done, etc). It was then noticed mgr-restful-plugin has been restarting every few minutes. It also causes a number of other services to restart. Even though those services recover fast, it causes instability of the system.
Following services are likely affected according to Gerry Kopec.
| 2020-06-22T20:16:57.174 | 12980 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T20:16:57.552 | 12981 | service-scn | ceph-manager | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:57.553 | 12982 | service-scn | sysinv-conductor | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:57.554 | 12983 | service-scn | sysinv-inv | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.056 | 12984 | service-scn | dcorch-sysinv-api-proxy | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.057 | 12985 | service-scn | dcmanager-manager | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.058 | 12986 | service-scn | dnsmasq | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.058 | 12987 | service-scn | mtc-agent | enabled-active | disabling | disable state requested
Severity
--------
Major
Steps to Reproduce
------------------
Check sm-customer.log and observe that mgr-restful-plugin is going from enabled-active to disabling due to audit failed every few minutes
Not sure about steps to reproduce.
Automated regression was running on when issue started to happen, but the events feels unrelated. - See details in timestamp and logs.
Expected Behavior
------------------
system is stable
Actual Behavior
----------------
mgr-restful-plugin restarts every few minutes along with a number of other services
Reproducibility
---------------
Intermittent
System Configuration
--------------------
DC system controller
Lab-name: DC-4
Branch/Pull Time/Commit
-----------------------
2020-06-20_20-00-00
Last Pass
---------
2020-06-18_20-00-00 - but this likely to be intermittent, so not sure about exact last pass.
Timestamp/Logs
--------------
https://files.starlingx.kube.cengn.ca/launchpad/1884704
Here's what was done before the first restart:
# Modify timezone on system controller and subcloud7
[2020-06-22 14:04:15,729] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --timezone="Europe/Berlin"'
[2020-06-22 14:04:51,483] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 modify --timezone="Canada/Central"'
# Time zone reverted
[2020-06-22 14:24:02,950] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --timezone="UTC"'
[2020-06-22 14:24:38,673] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 modify --timezone="UTC"'
# lock/unlock subcloud7 host
[2020-06-22 14:26:00,195] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 host-lock controller-0'
[2020-06-22 14:27:20,656] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 host-unlock controller-0'
# subcloud7 was just recovered at 14:35:
[2020-06-22 14:35:16,714] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 show'
Authorization failed: Unable to establish connection to http://[fd01:88::2]:5000/v3/auth/tokens
controller-0:~$
[2020-06-22 14:35:27,685] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 show'
+------------------------+--------------------------------------+
| Property | Value |
+------------------------+--------------------------------------+
| contact | None |
| created_at | 2020-06-21T06:59:44.523373+00:00 |
| description | None |
| distributed_cloud_role | subcloud |
| https_enabled | False |
| location | None |
| name | dc-subcloud7 |
| region_name | subcloud7 |
| sdn_enabled | False |
| security_feature | spectre_meltdown_v1 |
| service_project_name | services |
| shared_services | ['identity', ] |
| software_version | 20.06 |
| system_mode | simplex |
| system_type | All-in-one |
| timezone | UTC |
| updated_at | 2020-06-22T14:24:40.061605+00:00 |
| uuid | aa2c9682-a3c0-4a41-9b4a-c303914c6a89 |
| vswitch_type | none |
+------------------------+--------------------------------------+
controller-0:~$
# There were no explicit operations on system controller at 14:35, where the issue started.
| 2020-06-22T14:35:49.783 | 6069 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T14:37:05.359 | 6140 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T14:47:01.390 | 6222 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T14:56:56.845 | 6295 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:02:12.548 | 6367 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:04:48.142 | 6439 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:07:24.001 | 6510 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:08:39.621 | 6582 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:11:56.083 | 6658 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
Test Activity
-------------
Regression Testing, Developer Testing |
|
2020-06-23 11:43:52 |
Bart Wensley |
tags |
|
stx.storage |
|
2020-06-23 18:37:45 |
Ghada Khalil |
bug |
|
|
added subscriber Daniel Badea |
2020-06-23 18:38:00 |
Ghada Khalil |
starlingx: status |
New |
Triaged |
|
2020-06-23 18:38:13 |
Ghada Khalil |
starlingx: assignee |
|
Stefan Dinescu (stefandinescu) |
|
2020-06-23 18:38:20 |
Ghada Khalil |
starlingx: importance |
Undecided |
Medium |
|
2020-06-23 18:38:30 |
Ghada Khalil |
tags |
stx.storage |
stx.4.0 stx.storage |
|
2020-06-24 15:55:25 |
Frank Miller |
description |
Brief Description
-----------------
A DC system was having stability issue when bootstraping subclouds (bootstrap of subcloud fails premature and stuck at bootstrapping even after it's done, etc). It was then noticed mgr-restful-plugin has been restarting every few minutes. It also causes a number of other services to restart. Even though those services recover fast, it causes instability of the system.
Following services are likely affected according to Gerry Kopec.
| 2020-06-22T20:16:57.174 | 12980 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T20:16:57.552 | 12981 | service-scn | ceph-manager | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:57.553 | 12982 | service-scn | sysinv-conductor | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:57.554 | 12983 | service-scn | sysinv-inv | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.056 | 12984 | service-scn | dcorch-sysinv-api-proxy | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.057 | 12985 | service-scn | dcmanager-manager | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.058 | 12986 | service-scn | dnsmasq | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.058 | 12987 | service-scn | mtc-agent | enabled-active | disabling | disable state requested
Severity
--------
Major
Steps to Reproduce
------------------
Check sm-customer.log and observe that mgr-restful-plugin is going from enabled-active to disabling due to audit failed every few minutes
Not sure about steps to reproduce.
Automated regression was running on when issue started to happen, but the events feels unrelated. - See details in timestamp and logs.
Expected Behavior
------------------
system is stable
Actual Behavior
----------------
mgr-restful-plugin restarts every few minutes along with a number of other services
Reproducibility
---------------
Intermittent
System Configuration
--------------------
DC system controller
Lab-name: DC-4
Branch/Pull Time/Commit
-----------------------
2020-06-20_20-00-00
Last Pass
---------
2020-06-18_20-00-00 - but this likely to be intermittent, so not sure about exact last pass.
Timestamp/Logs
--------------
https://files.starlingx.kube.cengn.ca/launchpad/1884704
Here's what was done before the first restart:
# Modify timezone on system controller and subcloud7
[2020-06-22 14:04:15,729] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --timezone="Europe/Berlin"'
[2020-06-22 14:04:51,483] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 modify --timezone="Canada/Central"'
# Time zone reverted
[2020-06-22 14:24:02,950] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --timezone="UTC"'
[2020-06-22 14:24:38,673] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 modify --timezone="UTC"'
# lock/unlock subcloud7 host
[2020-06-22 14:26:00,195] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 host-lock controller-0'
[2020-06-22 14:27:20,656] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 host-unlock controller-0'
# subcloud7 was just recovered at 14:35:
[2020-06-22 14:35:16,714] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 show'
Authorization failed: Unable to establish connection to http://[fd01:88::2]:5000/v3/auth/tokens
controller-0:~$
[2020-06-22 14:35:27,685] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 show'
+------------------------+--------------------------------------+
| Property | Value |
+------------------------+--------------------------------------+
| contact | None |
| created_at | 2020-06-21T06:59:44.523373+00:00 |
| description | None |
| distributed_cloud_role | subcloud |
| https_enabled | False |
| location | None |
| name | dc-subcloud7 |
| region_name | subcloud7 |
| sdn_enabled | False |
| security_feature | spectre_meltdown_v1 |
| service_project_name | services |
| shared_services | ['identity', ] |
| software_version | 20.06 |
| system_mode | simplex |
| system_type | All-in-one |
| timezone | UTC |
| updated_at | 2020-06-22T14:24:40.061605+00:00 |
| uuid | aa2c9682-a3c0-4a41-9b4a-c303914c6a89 |
| vswitch_type | none |
+------------------------+--------------------------------------+
controller-0:~$
# There were no explicit operations on system controller at 14:35, where the issue started.
| 2020-06-22T14:35:49.783 | 6069 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T14:37:05.359 | 6140 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T14:47:01.390 | 6222 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T14:56:56.845 | 6295 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:02:12.548 | 6367 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:04:48.142 | 6439 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:07:24.001 | 6510 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:08:39.621 | 6582 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:11:56.083 | 6658 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
Test Activity
-------------
Regression Testing, Developer Testing |
Brief Description
-----------------
A DC system was having stability issue when bootstraping subclouds (bootstrap of subcloud fails premature and stuck at bootstrapping even after it's done, etc). It was then noticed mgr-restful-plugin has been restarting every few minutes. It also causes a number of other services to restart. Even though those services recover fast, it causes instability of the system.
Following services are likely affected according to Gerry Kopec.
| 2020-06-22T20:16:57.174 | 12980 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T20:16:57.552 | 12981 | service-scn | ceph-manager | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:57.553 | 12982 | service-scn | sysinv-conductor | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:57.554 | 12983 | service-scn | sysinv-inv | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.056 | 12984 | service-scn | dcorch-sysinv-api-proxy | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.057 | 12985 | service-scn | dcmanager-manager | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.058 | 12986 | service-scn | dnsmasq | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.058 | 12987 | service-scn | mtc-agent | enabled-active | disabling | disable state requested
Severity
--------
Major
Steps to Reproduce
------------------
Check sm-customer.log and observe that mgr-restful-plugin is going from enabled-active to disabling due to audit failed every few minutes
Not sure about steps to reproduce.
Automated regression was running on when issue started to happen, but the events feels unrelated. - See details in timestamp and logs.
Expected Behavior
------------------
system is stable
Actual Behavior
----------------
mgr-restful-plugin restarts every few minutes along with a number of other services
Reproducibility
---------------
Intermittent
System Configuration
--------------------
DC system controller
Lab-name: DC-4
Branch/Pull Time/Commit
-----------------------
2020-06-20_20-00-00
Last Pass
---------
2020-06-18_20-00-00 - but this likely to be intermittent, so not sure about exact last pass.
Timestamp/Logs
--------------
https://files.starlingx.kube.cengn.ca/launchpad/1884704
Here's what was done before the first restart:
# Modify timezone on system controller and subcloud7
[2020-06-22 14:04:15,729] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --timezone="Europe/Berlin"'
[2020-06-22 14:04:51,483] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 modify --timezone="Canada/Central"'
# Time zone reverted
[2020-06-22 14:24:02,950] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --timezone="UTC"'
[2020-06-22 14:24:38,673] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 modify --timezone="UTC"'
# lock/unlock subcloud7 host
[2020-06-22 14:26:00,195] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 host-lock controller-0'
[2020-06-22 14:27:20,656] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 host-unlock controller-0'
# subcloud7 was just recovered at 14:35:
[2020-06-22 14:35:16,714] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 show'
Authorization failed: Unable to establish connection to http://[fd01:88::2]:5000/v3/auth/tokens
controller-0:~$
[2020-06-22 14:35:27,685] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 show'
+------------------------+--------------------------------------+
| Property | Value |
+------------------------+--------------------------------------+
| contact | None |
| created_at | 2020-06-21T06:59:44.523373+00:00 |
| description | None |
| distributed_cloud_role | subcloud |
| https_enabled | False |
| location | None |
| name | dc-subcloud7 |
| region_name | subcloud7 |
| sdn_enabled | False |
| security_feature | spectre_meltdown_v1 |
| service_project_name | services |
| shared_services | ['identity', ] |
| software_version | 20.06 |
| system_mode | simplex |
| system_type | All-in-one |
| timezone | UTC |
| updated_at | 2020-06-22T14:24:40.061605+00:00 |
| uuid | aa2c9682-a3c0-4a41-9b4a-c303914c6a89 |
| vswitch_type | none |
+------------------------+--------------------------------------+
controller-0:~$
# There were no explicit operations on system controller at 14:35, where the issue started.
| 2020-06-22T14:35:49.783 | 6069 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T14:37:05.359 | 6140 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T14:47:01.390 | 6222 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T14:56:56.845 | 6295 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:02:12.548 | 6367 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:04:48.142 | 6439 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:07:24.001 | 6510 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:08:39.621 | 6582 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:11:56.083 | 6658 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
Test Activity
-------------
Regression Testing, Developer Testing
Workaround
----------
Performing a controller swact appears to fix the issue |
|
2020-06-24 17:46:45 |
Ghada Khalil |
removed subscriber Daniel Badea |
|
|
|
2020-06-24 17:47:04 |
Ghada Khalil |
tags |
stx.4.0 stx.storage |
stx.5.0 stx.storage |
|
2020-06-24 17:47:25 |
Ghada Khalil |
bug |
|
|
added subscriber Allain Legacy |
2020-06-28 01:25:16 |
Ghada Khalil |
tags |
stx.5.0 stx.storage |
stx.5.0 stx.retestneeded stx.storage |
|
2020-07-01 12:44:59 |
Stefan Dinescu |
marked as duplicate |
|
1885582 |
|