StarlingX

DC: mgr-restful-plugin restarting every few minutes, causing a number of other services to restart as well

Bug #1884704 reported by Yang Liu on 2020-06-23

This bug report is a duplicate of: Bug #1885582: IPv6: calico binds to the floating IP after pod restart, causing failures on swact. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Triaged	Medium	Stefan Dinescu

Bug Description

Brief Description
-----------------
A DC system was having stability issue when bootstraping subclouds (bootstrap of subcloud fails premature and stuck at bootstrapping even after it's done, etc). It was then noticed mgr-restful-plugin has been restarting every few minutes. It also causes a number of other services to restart. Even though those services recover fast, it causes instability of the system.

Severity
--------
Major

Steps to Reproduce
------------------

Check sm-customer.log and observe that mgr-restful-plugin is going from enabled-active to disabling due to audit failed every few minutes

Not sure about steps to reproduce.
Automated regression was running on when issue started to happen, but the events feels unrelated. - See details in timestamp and logs.

Expected Behavior
------------------
system is stable

Actual Behavior
----------------
mgr-restful-plugin restarts every few minutes along with a number of other services

Reproducibility
---------------
Intermittent

System Configuration
--------------------
DC system controller
Lab-name: DC-4

Branch/Pull Time/Commit
-----------------------
2020-06-20_20-00-00

Last Pass
---------
2020-06-18_20-00-00 - but this likely to be intermittent, so not sure about exact last pass.

Timestamp/Logs
--------------
https://files.starlingx.kube.cengn.ca/launchpad/1884704

Here's what was done before the first restart:

# Modify timezone on system controller and subcloud7
[2020-06-22 14:04:15,729] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --timezone="Europe/Berlin"'

[2020-06-22 14:04:51,483] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 modify --timezone="Canada/Central"'

# Time zone reverted
[2020-06-22 14:24:02,950] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --timezone="UTC"'

[2020-06-22 14:24:38,673] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 modify --timezone="UTC"'

# lock/unlock subcloud7 host
[2020-06-22 14:26:00,195] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 host-lock controller-0'

[2020-06-22 14:27:20,656] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 host-unlock controller-0'

# subcloud7 was just recovered at 14:35:
[2020-06-22 14:35:16,714] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:88::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name subcloud7 show'
Authorization failed: Unable to establish connection to http://[fd01:88::2]:5000/v3/auth/tokens
controller-0:~$

Test Activity
-------------
Regression Testing, Developer Testing

Workaround
----------
Performing a controller swact appears to fix the issue

See original description

Tags:

Revision history for this message

Yang Liu (yliu12) wrote on 2020-06-23:

Download full text (5.7 KiB)

Comments from Bart Wensley:
I don’t know anything about the mgr-restful-plugin, but it seems to be failing to connect to the ceph-mgr process on port 7999 when the issue is happening (from mgr-restful-plugin.log):

2020-06-22 20:12:27,194 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph fsid
2020-06-22 20:12:27,446 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph auth get mgr.controller-0 -o /var/lib/ceph/mgr/ceph-controller-0/keyring
2020-06-22 20:12:27,714 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get mgr/restful/server_port
2020-06-22 20:12:28,002 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get config/mgr/restful/controller-0/crt
2020-06-22 20:12:28,274 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get mgr/restful/controller-0/crt
2020-06-22 20:12:28,563 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get config/mgr/restful/controller-0/key
2020-06-22 20:12:28,842 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get /mgr/restful/controller-0/key
2020-06-22 20:12:29,118 3728723 INFO mgr-restful-plugin Create restful plugin self signed certificate
2020-06-22 20:12:29,279 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key set config/mgr/restful/controller-0/crt -i /tmp/tmp4XO7c4/crt
2020-06-22 20:12:29,561 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key set mgr/restful/controller-0/crt -i /tmp/tmp4XO7c4/crt
2020-06-22 20:12:29,845 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key set config/mgr/restful/controller-0/key -i /tmp/tmp4XO7c4/key
2020-06-22 20:12:30,123 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key set mgr/restful/controller-0/key -i /tmp/tmp4XO7c4/key
2020-06-22 20:12:30,393 3728723 INFO mgr-restful-plugin Stop unmanaged running ceph-mgr processes
2020-06-22 20:12:30,491 3728723 INFO mgr-restful-plugin Start ceph-mgr daemon
2020-06-22 20:12:45,509 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph mgr module ls --format json
2020-06-22 20:12:48,781 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get mgr/restful/keys/admin
2020-06-22 20:12:49,057 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph mgr services --format json
2020-06-22 20:12:49,320 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph config-key get config/mgr/restful/controller-0/crt
2020-06-22 20:13:28,726 3728723 WARNING mgr-restful-plugin REST API ping failed: reason=HTTPSConnectionPool(host='controller-0', port=7999): Read timed out. (read timeout=15)
2020-06-22 20:13:28,726 3728723 INFO mgr-restful-plugin REST API ping failure count=0
2020-06-22 20:13:28,726 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph fsid
2020-06-22 20:13:47,003 3728723 WARNING mgr-restful-plugin REST API ping failed: reason=HTTPSConnectionPool(host='controller-0', port=7999): Read timed out. (read timeout=15)
2020-06-22 20:13:47,004 3728723 INFO mgr-restful-plugin REST API ping failure count=1
2020-06-22 20:13:47,004 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph fsid
2020-06-22 20:15:22,719 3728723 WARNING mgr-restful-plugin REST API ping failed: reason=HTTPSConnectionPool(host='controller-0', port=7999): Read timed out. (read timeout=15)
2020-06-22 20:15:22,719 3728723 INFO mgr-restful-plugin REST API ping failure count=0
2020-06-22 20:15:22,719 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph fsid
2020-06-22 20:15:41,008 3728723 WARNING mgr-restful-plugin REST API ping failed: reason=HTTPSConnectionPool(host='controller-0', port=7999): Read timed out. (read timeout=15)
2020-06-22 20:15:41,009 3728723 INFO mgr-restful-plugin REST API ping failure count=1
2020-06-22 20:15:41,009 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph fsid
2020-06-22 20:16:28,480 3728723 WARNING mgr-restful-plugin REST API ping failed: reason=HTTPSConnectionPool(host='controller-0', port=7999): Read timed out. (read timeout=15)
2020-06-22 20:16:28,481 3728723 INFO mgr-restful-plugin REST API ping failure count=0
2020-06-22 20:16:42,306 3728709 INFO mgr-restful-plugin Restful plugin does not respond but failure count is within acceptable limits:  ceph_mgr=0 < 3, ping=0 < 5. Report status OK
2020-06-22 20:16:42,307 3728709 WARNING mgr-restful-plugin Failed to send response back. request=status, response=OK, reason=[Errno 32] Broken pipe
2020-06-22 20:16:42,307 3728723 INFO mgr-restful-plugin Run command: /usr/bin/ceph fsid
2020-06-22 20:16:57,329 3728709 INFO mgr-restful-plugin Restful plugin does not respond but failure count is within acceptable limits:  ceph_mgr=0 < 3, ping=1 < 5. Report status OK
2020-06-22 20:16:57,329 3728709 WARNING mgr-restful-plugin Failed to send response back. request=status, response=OK, reason=[Errno 32] Broken pipe
2020-06-22 20:16:57,329 3728709 INFO mgr-restful-plugin Stop monitor with SIGTERM to process group 3728723
2020-06-22 20:17:00,606 3728723 WARNING mgr-restful-plugin REST API ping failed: reason=HTTPSConnectionPool(host='controller-0', port=7999): Read timed out. (read timeout=15)
2020-06-22 20:17:00,607 3728723 INFO mgr-restful-plugin REST API ping failure count=1
2020-06-22 20:17:02,335 3728709 INFO mgr-restful-plugin Stop monitor with SIGKILL to process group 3728723
2020-06-22 20:17:02,336 3728709 INFO mgr-restful-plugin Monitor stopped: pid=3728723
2020-06-22 20:17:02,337 3728709 INFO mgr-restful-plugin Remove service pid file: path=/var/run/ceph/mgr-restful-plugin.pid
2020-06-22 20:17:02,337 3728709 INFO mgr-restful-plugin Close service socket and remove file: path=/var/run/ceph/mgr/mgr-restful-plugin.socket
2020-06-22 20:17:02,337 3728709 INFO mgr-restful-plugin Release service lock: path=/var/run/ceph/mgr/mgr-restful-plugin.lock

Yang Liu (yliu12) on 2020-06-23

description:

updated

Bart Wensley (bartwensley) on 2020-06-23

tags:

added: stx.storage

Revision history for this message

Frank Miller (sensfan22) wrote on 2020-06-23:

Stefan took a look as well. Capturing his analysis here:

Initial analysis is that ceph restful services (the internal ceph service that handles REST API requests) simply stopped responding to requests. I can't seem to find any relevant logs to determine the reason why the restful service would stop responding to requests.

mgr-restful-plugin is just a python wrapper over the REST-API. SM uses a GET request to the restful service to determine the status, so when restful services stopped responding, SM also stopped mgr-restful-plugin. Since, mgr-restful-plugin is a wrapper, its state is the same as the restful service inside ceph.

The ceph restful services run on one controller (not always the active one) that is determined by ceph automatically and listens on port 7999. Looked with netstat on both controllers and saw that no other service at this moment in time uses this port. I don't think I can find out if that port was used at the time by another service, but ceph-mgr didn't seem to complain about anything

The first errors appeared at:
36517:2020-06-22 00:06:59.142 7f225704e700 -1 received signal: Terminated from /usr/bin/python /etc/init.d/mgr-restful-plugin start (PID: 33982) UID: 0
36518:2020-06-22 00:06:59.142 7f225704e700 -1 mgr handle_signal *** Got signal Terminated ***

Looking through other logs the only thing note-worthy that seems to happen at this moment in time is disabling https:
2020-06-22T00:05:13.000 controller-0 -sh: info HISTORY: PID=372020 UID=42425 system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --https_enabled="false"

Not sure if this is related to the issue or not, but it's the only thing of interest that I could find.

The system stabilized after this last error
113745:2020-06-23 02:29:39.405 7fdcdc71d700 -1 received signal: Terminated from /usr/bin/python /etc/init.d/mgr-restful-plugin start (PID: 1023405) UID: 0
113746:2020-06-23 02:29:39.405 7fdcdc71d700 -1 mgr handle_signal *** Got signal Terminated ***

No obvious reasons as to why it stabilized after this.

I saw that there were timezone changes done on that setup. Were such tests done before? This is just a theory but maybe sudden timezone-changes might impact ceph in certain ways.

This seems like a weird ceph internal issue, but the cause is unclear because no obvious error logs are present.

Stefan took a look as well. Capturing his analysis here:

mgr-restful-plugin is just a python wrapper over the REST-API.  SM uses a GET request to the restful service to determine the status, so when restful services stopped responding,  SM also stopped mgr-restful-plugin. Since, mgr-restful-plugin is a wrapper, its state is the same as the restful service inside ceph.

The first errors appeared at:
36517:2020-06-22 00:06:59.142 7f225704e700 -1 received  signal: Terminated from /usr/bin/python /etc/init.d/mgr-restful-plugin start  (PID: 33982) UID: 0
36518:2020-06-22 00:06:59.142 7f225704e700 -1 mgr handle_signal *** Got signal Terminated ***

Looking through other logs the only thing note-worthy that seems to happen at this moment in time is disabling https:
2020-06-22T00:05:13.000 controller-0 -sh: info HISTORY: PID=372020 UID=42425 system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default  --os-endpoint-type internalURL --os-region-name RegionOne modify --https_enabled="false"

Not sure if this is related to the issue or not, but it's the only thing of interest that I could find.

The system stabilized after this last error
113745:2020-06-23 02:29:39.405 7fdcdc71d700 -1 received  signal: Terminated from /usr/bin/python /etc/init.d/mgr-restful-plugin start  (PID: 1023405) UID: 0
113746:2020-06-23 02:29:39.405 7fdcdc71d700 -1 mgr handle_signal *** Got signal Terminated ***

No obvious reasons as to why it stabilized after this.

I saw that there were timezone changes done on that setup. Were such tests done before? This is just a theory but maybe sudden timezone-changes might impact ceph in certain ways.

This seems like a weird ceph internal issue, but the cause is unclear because no obvious error logs are present.

Revision history for this message

Frank Miller (sensfan22) wrote on 2020-06-23:

Follow-up questions:
1) Can the issue be reproduced?
2) Is the issue seen when https is disabled?
3) Is the issue seen when the timezone is changed?

Ghada Khalil (gkhalil) on 2020-06-23

Changed in starlingx:
status:	New → Triaged
assignee:	nobody → Stefan Dinescu (stefandinescu)
importance:	Undecided → Medium
tags:	added: stx.4.0

Revision history for this message

Frank Miller (sensfan22) wrote on 2020-06-24:

The trigger for this issue is:
- enable https on a system
- perform a controller swact
- disable https

As this is not a common scenario and as the workaround appears to be to perform an additional swact this issue should not be tagged for stx.4.0

description:

updated

Ghada Khalil (gkhalil) on 2020-06-24

tags:

added: stx.5.0
removed: stx.4.0

Revision history for this message

Frank Miller (sensfan22) wrote on 2020-06-27:

Update -- The trigger for this issue appears to be a swact when the stx-monitor is running. There seems to be at least one other factor as well that is not yet known.

Ghada Khalil (gkhalil) on 2020-06-28

tags:

added: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2020-07-01:

The root-cause of this issue is calico using the floating ip instead of the unit ip. This is a duplicate of https://bugs.launchpad.net/starlingx/+bug/1885582

Revision history for this message

Stefan Dinescu (stefandinescu) wrote on 2020-07-01:

This does seem to be the same issue as #1885582.

In order to reproduce this issue, you need an IPv6 lab with Ceph configured and stx-monitor app working on the system. The stx-monitor app does periodic queries to the ceph REST-API to gather various statistics. Since stx-monitor uses calico networking to gather said statistics, on a swact the IP binding was becoming invalid, resulting in various network connectivity issues.

Matt tested the calico fix on a setup where mgr-restful-plugin was having issues and said that after the fix, the issues were no longer observed.

Marking this LP as a duplicate.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1885582 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.