DC: mgr-restful-plugin restarting every few minutes, causing a number of other services to restart as well

Bug #1884704 reported by Yang Liu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Triaged
Medium
Stefan Dinescu

Bug Description

Brief Description
-----------------
A DC system was having stability issue when bootstraping subclouds (bootstrap of subcloud fails premature and stuck at bootstrapping even after it's done, etc). It was then noticed mgr-restful-plugin has been restarting every few minutes. It also causes a number of other services to restart. Even though those services recover fast, it causes instability of the system.

Following services are likely affected according to Gerry Kopec.
| 2020-06-22T20:16:57.174 | 12980 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T20:16:57.552 | 12981 | service-scn | ceph-manager | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:57.553 | 12982 | service-scn | sysinv-conductor | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:57.554 | 12983 | service-scn | sysinv-inv | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.056 | 12984 | service-scn | dcorch-sysinv-api-proxy | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.057 | 12985 | service-scn | dcmanager-manager | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.058 | 12986 | service-scn | dnsmasq | enabled-active | disabling | disable state requested
| 2020-06-22T20:16:58.058 | 12987 | service-scn | mtc-agent | enabled-active | disabling | disable state requested

Severity
--------
Major

Steps to Reproduce
------------------

Check sm-customer.log and observe that mgr-restful-plugin is going from enabled-active to disabling due to audit failed every few minutes

Not sure about steps to reproduce.
Automated regression was running on when issue started to happen, but the events feels unrelated. - See details in timestamp and logs.

Expected Behavior
------------------
system is stable

Actual Behavior
----------------
 mgr-restful-plugin restarts every few minutes along with a number of other services

Reproducibility
---------------
Intermittent

System Configuration
--------------------
DC system controller
Lab-name: DC-4

Branch/Pull Time/Commit
-----------------------
2020-06-20_20-00-00

Last Pass
---------
2020-06-18_20-00-00 - but this likely to be intermittent, so not sure about exact last pass.

Timestamp/Logs
--------------
https://files.starlingx.kube.cengn.ca/launchpad/1884704

Here's what was done before the first restart:

# Modify timezone on system controller and subcloud7
[2020-06-22 14:04:15,729] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --timezone="Europe/Berlin"'

[2020-06-22 14:04:51,483] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 modify --timezone="Canada/Central"'

# Time zone reverted
[2020-06-22 14:24:02,950] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --timezone="UTC"'

[2020-06-22 14:24:38,673] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 modify --timezone="UTC"'

# lock/unlock subcloud7 host
[2020-06-22 14:26:00,195] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 host-lock controller-0'

[2020-06-22 14:27:20,656] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 host-unlock controller-0'

# subcloud7 was just recovered at 14:35:
[2020-06-22 14:35:16,714] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 show'
Authorization failed: Unable to establish connection to http://[fd01:88::2]:5000/v3/auth/tokens
controller-0:~$

[2020-06-22 14:35:27,685] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 show'
+------------------------+--------------------------------------+
| Property | Value |
+------------------------+--------------------------------------+
| contact | None |
| created_at | 2020-06-21T06:59:44.523373+00:00 |
| description | None |
| distributed_cloud_role | subcloud |
| https_enabled | False |
| location | None |
| name | dc-subcloud7 |
| region_name | subcloud7 |
| sdn_enabled | False |
| security_feature | spectre_meltdown_v1 |
| service_project_name | services |
| shared_services | ['identity', ] |
| software_version | 20.06 |
| system_mode | simplex |
| system_type | All-in-one |
| timezone | UTC |
| updated_at | 2020-06-22T14:24:40.061605+00:00 |
| uuid | aa2c9682-a3c0-4a41-9b4a-c303914c6a89 |
| vswitch_type | none |
+------------------------+--------------------------------------+
controller-0:~$

# There were no explicit operations on system controller at 14:35, where the issue started.
| 2020-06-22T14:35:49.783 | 6069 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T14:37:05.359 | 6140 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T14:47:01.390 | 6222 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T14:56:56.845 | 6295 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:02:12.548 | 6367 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:04:48.142 | 6439 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:07:24.001 | 6510 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:08:39.621 | 6582 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed
| 2020-06-22T15:11:56.083 | 6658 | service-scn | mgr-restful-plugin | enabled-active | disabling | audit failed

Test Activity
-------------
Regression Testing, Developer Testing

Workaround
----------
Performing a controller swact appears to fix the issue

Revision history for this message
Yang Liu (yliu12) wrote :
Download full text (5.7 KiB)

Comments from Bart Wensley:
I don’t know anything about the mgr-restful-plugin, but it seems to be failing to connect to the ceph-mgr process on port 7999 when the issue is happening (from mgr-restful-plugin.log):

2020-06-22 20:12:27,194 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph fsid
2020-06-22 20:12:27,446 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph auth get mgr.controller-0 -o /var/lib/ceph/mgr/ceph-controller-0/keyring
2020-06-22 20:12:27,714 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get mgr/restful/server_port
2020-06-22 20:12:28,002 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get config/mgr/restful/controller-0/crt
2020-06-22 20:12:28,274 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get mgr/restful/controller-0/crt
2020-06-22 20:12:28,563 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get config/mgr/restful/controller-0/key
2020-06-22 20:12:28,842 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get /mgr/restful/controller-0/key
2020-06-22 20:12:29,118 3728723 INFO mgr-restful-plugin Create restful plugin self signed certificate
2020-06-22 20:12:29,279 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key set config/mgr/restful/controller-0/crt -i /tmp/tmp4XO7c4/crt
2020-06-22 20:12:29,561 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key set mgr/restful/controller-0/crt -i /tmp/tmp4XO7c4/crt
2020-06-22 20:12:29,845 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key set config/mgr/restful/controller-0/key -i /tmp/tmp4XO7c4/key
2020-06-22 20:12:30,123 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key set mgr/restful/controller-0/key -i /tmp/tmp4XO7c4/key
2020-06-22 20:12:30,393 3728723 INFO mgr-restful-plugin Stop unmanaged running ceph-mgr processes
2020-06-22 20:12:30,491 3728723 INFO mgr-restful-plugin Start ceph-mgr daemon
2020-06-22 20:12:45,509 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph mgr module ls --format json
2020-06-22 20:12:48,781 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get mgr/restful/keys/admin
2020-06-22 20:12:49,057 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph mgr services --format json
2020-06-22 20:12:49,320 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get config/mgr/restful/controller-0/crt
2020-06-22 20:13:28,726 3728723 WARNING mgr-restful-plugin REST API ping failed: reason=HTTPSConnectionPool(host='controller-0', port=7999): Read timed out. (read timeout=15)
2020-06-22 20:13:28,726 3728723 INFO mgr-restful-plugin REST API ping failure count=0
2020-06-22 20:13:28,726 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph fsid
2020-06-22 20:13:47,003 3728723 WARNING mgr-restful-plugin REST API ping failed: reason=HTTPSConnectionPool(host='controller-0', port=7999): Read timed out. (read timeout=15)
2020-06-22 20:13:47,004 3728723 INFO mgr-restful-plugin REST API ping failure count=1
2020-06-22 20:13:47,004 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph fsid
2020-06-22 20:15:22,719 3728723 WA...

Read more...

Yang Liu (yliu12)
description: updated
tags: added: stx.storage
Revision history for this message
Frank Miller (sensfan22) wrote :

Stefan took a look as well. Capturing his analysis here:

Initial analysis is that ceph restful services (the internal ceph service that handles REST API requests) simply stopped responding to requests. I can't seem to find any relevant logs to determine the reason why the restful service would stop responding to requests.

mgr-restful-plugin is just a python wrapper over the REST-API. SM uses a GET request to the restful service to determine the status, so when restful services stopped responding, SM also stopped mgr-restful-plugin. Since, mgr-restful-plugin is a wrapper, its state is the same as the restful service inside ceph.

The ceph restful services run on one controller (not always the active one) that is determined by ceph automatically and listens on port 7999. Looked with netstat on both controllers and saw that no other service at this moment in time uses this port. I don't think I can find out if that port was used at the time by another service, but ceph-mgr didn't seem to complain about anything

The first errors appeared at:
36517:2020-06-22 00:06:59.142 7f225704e700 -1 received signal: Terminated from /usr/bin/python /etc/init.d/mgr-restful-plugin start (PID: 33982) UID: 0
36518:2020-06-22 00:06:59.142 7f225704e700 -1 mgr handle_signal *** Got signal Terminated ***

Looking through other logs the only thing note-worthy that seems to happen at this moment in time is disabling https:
2020-06-22T00:05:13.000 controller-0 -sh: info HISTORY: PID=372020 UID=42425 system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --https_enabled="false"

Not sure if this is related to the issue or not, but it's the only thing of interest that I could find.

The system stabilized after this last error
113745:2020-06-23 02:29:39.405 7fdcdc71d700 -1 received signal: Terminated from /usr/bin/python /etc/init.d/mgr-restful-plugin start (PID: 1023405) UID: 0
113746:2020-06-23 02:29:39.405 7fdcdc71d700 -1 mgr handle_signal *** Got signal Terminated ***

No obvious reasons as to why it stabilized after this.

I saw that there were timezone changes done on that setup. Were such tests done before? This is just a theory but maybe sudden timezone-changes might impact ceph in certain ways.

This seems like a weird ceph internal issue, but the cause is unclear because no obvious error logs are present.

Revision history for this message
Frank Miller (sensfan22) wrote :

Follow-up questions:
1) Can the issue be reproduced?
2) Is the issue seen when https is disabled?
3) Is the issue seen when the timezone is changed?

Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → Triaged
assignee: nobody → Stefan Dinescu (stefandinescu)
importance: Undecided → Medium
tags: added: stx.4.0
Revision history for this message
Frank Miller (sensfan22) wrote :

The trigger for this issue is:
- enable https on a system
- perform a controller swact
- disable https

As this is not a common scenario and as the workaround appears to be to perform an additional swact this issue should not be tagged for stx.4.0

description: updated
Ghada Khalil (gkhalil)
tags: added: stx.5.0
removed: stx.4.0
Revision history for this message
Frank Miller (sensfan22) wrote :

Update -- The trigger for this issue appears to be a swact when the stx-monitor is running. There seems to be at least one other factor as well that is not yet known.

Ghada Khalil (gkhalil)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

The root-cause of this issue is calico using the floating ip instead of the unit ip. This is a duplicate of https://bugs.launchpad.net/starlingx/+bug/1885582

Revision history for this message
Stefan Dinescu (stefandinescu) wrote :

This does seem to be the same issue as #1885582.

In order to reproduce this issue, you need an IPv6 lab with Ceph configured and stx-monitor app working on the system. The stx-monitor app does periodic queries to the ceph REST-API to gather various statistics. Since stx-monitor uses calico networking to gather said statistics, on a swact the IP binding was becoming invalid, resulting in various network connectivity issues.

Matt tested the calico fix on a setup where mgr-restful-plugin was having issues and said that after the fix, the issues were no longer observed.

Marking this LP as a duplicate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.