Comment 2 for bug 1884704

Revision history for this message
Frank Miller (sensfan22) wrote :

Stefan took a look as well. Capturing his analysis here:

Initial analysis is that ceph restful services (the internal ceph service that handles REST API requests) simply stopped responding to requests. I can't seem to find any relevant logs to determine the reason why the restful service would stop responding to requests.

mgr-restful-plugin is just a python wrapper over the REST-API. SM uses a GET request to the restful service to determine the status, so when restful services stopped responding, SM also stopped mgr-restful-plugin. Since, mgr-restful-plugin is a wrapper, its state is the same as the restful service inside ceph.

The ceph restful services run on one controller (not always the active one) that is determined by ceph automatically and listens on port 7999. Looked with netstat on both controllers and saw that no other service at this moment in time uses this port. I don't think I can find out if that port was used at the time by another service, but ceph-mgr didn't seem to complain about anything

The first errors appeared at:
36517:2020-06-22 00:06:59.142 7f225704e700 -1 received signal: Terminated from /usr/bin/python /etc/init.d/mgr-restful-plugin start (PID: 33982) UID: 0
36518:2020-06-22 00:06:59.142 7f225704e700 -1 mgr handle_signal *** Got signal Terminated ***

Looking through other logs the only thing note-worthy that seems to happen at this moment in time is disabling https:
2020-06-22T00:05:13.000 controller-0 -sh: info HISTORY: PID=372020 UID=42425 system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[fd01:81::2]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne modify --https_enabled="false"

Not sure if this is related to the issue or not, but it's the only thing of interest that I could find.

The system stabilized after this last error
113745:2020-06-23 02:29:39.405 7fdcdc71d700 -1 received signal: Terminated from /usr/bin/python /etc/init.d/mgr-restful-plugin start (PID: 1023405) UID: 0
113746:2020-06-23 02:29:39.405 7fdcdc71d700 -1 mgr handle_signal *** Got signal Terminated ***

No obvious reasons as to why it stabilized after this.

I saw that there were timezone changes done on that setup. Were such tests done before? This is just a theory but maybe sudden timezone-changes might impact ceph in certain ways.

This seems like a weird ceph internal issue, but the cause is unclear because no obvious error logs are present.