Bug #1828863 “Hosts coudn't be locked and it failed with timeout...” : Bugs : StarlingX

Revision history for this message

Gopinath (gprabakx) wrote on 2019-05-13:

#1

lock_issue.log Edit (1.2 MiB, text/plain)

Revision history for this message

Gopinath (gprabakx) wrote on 2019-05-13:

#2

dmesg log

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-13:

#3

Marking as release gating; locks are consistently failing based on the reporter.

tags:

added: stx.2.0 stx.metal

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-13:

#4

Assigning to Eric MacDonald to triage.

Changed in starlingx:
status:	New → Triaged
importance:	Undecided → High
assignee:	nobody → Eric MacDonald (rocksolidmtce)

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-05-13:

#5

The dmesg and sysinv logs are not enough to debug this issue.
Was there a 'collect' run and if so where can I find that tarball.

Just an FYI ; maintenance is designed to not fail a lock.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-05-14:

#6

Please include the following information
- a tarball of /var/log on the active controller
- the time the lock that failed was attempted

Changed in starlingx:
status:	Triaged → Incomplete

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-14:

#7

It's possible this is a duplicate of https://bugs.launchpad.net/starlingx/+bug/1828729
The requested data is required to confirm

Revision history for this message

Gopinath (gprabakx) wrote on 2019-05-16:

#8

controller-1_20190516.145645.tar Edit (62.1 MiB, application/x-tar)

@<email address hidden>
Have attached tarball file for /var/log.
Please fine the attachment

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-05-16:

#9

You attached the logs for the controller being locked (controller-1) but I was expecting the logs for the active controller which was controller-0 because that is where the management processes are running to perform the lock.

However I looked at the bash.log in C1 and it seems you also tried to lock C0 from C1 here so I'll debug that case.

2019-05-16T13:46:03.000 controller-1 -sh: info HISTORY: PID=2575133 UID=1875 system host-lock controller-0

The mtcAgent log shows that it did not receive the lock request so it must have been rejected by sysinv which has the following logs

2019-05-16 13:45:27.183 1004430 INFO sysinv.api.controllers.v1.host [-] controller-1 1. delta_handle ['action']
2019-05-16 13:45:27.275 1004430 WARNING wsme.api [-] Client-side error: controller-0 target Config 5f7ee8b0-0e97-4a40-8cd5-3cddab0814cd not yet applied. Apply target Config via Lock/Unlock prior to Swact
2019-05-16 13:46:05.929 1004431 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_start_2019-05-16-13-46-05 patch
2019-05-16 13:46:05.929 1004431 INFO sysinv.api.controllers.v1.host [-] controller-0 1. delta_handle ['action']
2019-05-16 13:46:05.930 1004431 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost check_lock
2019-05-16 13:46:05.930 1004431 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost check_lock_controller
2019-05-16 13:46:06.085 1004431 INFO sysinv.common.ceph [-] Active ceph monitors in inventory = [u'controller-0', u'controller-1', u'storage-0']

Still investigating, consulting with peers.

Revision history for this message

Gopinath (gprabakx) wrote on 2019-05-16:

#10

Controller-1 was the active controller and the controller-0 was in config-out-of-date.To change the status of controller-0 from config-out-of-date to None we need to do lock and unlock and Here is where we cannot lock the controller-0.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-05-16:

#11

OK, so I'm on the right path.

What threw me off was your bug report above indicates a timeout on 'system host-lock controller-1' as the actual behavior.

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-05-16:

#12

I compared the sysinv log stream to that issue and they both stop after this

2019-05-12 12:37:32.174 3421085 INFO sysinv.api.controllers.v1.host [-] compute-1 ihost check_lock
2019-05-12 12:37:32.175 3421085 INFO sysinv.api.controllers.v1.host [-] compute-1 host check_lock_worker
2019-05-12 12:37:32.226 3421085 INFO sysinv.common.ceph [-] Active ceph monitors in inventory = [u'controller-0', u'controller-1', u'compute-1']

After monitoring the logs in a working case I'm confident that the two issues have the same root cause and are therefore duplicates.

Dups:
https://bugs.launchpad.net/starlingx/+bug/1828863
https://bugs.launchpad.net/starlingx/+bug/1828729

Revision history for this message

Daniel Badea (daniel.badea) wrote on 2019-05-20:

#13

I looked at what's the status of Ceph REST API and it looks like it's down:

1. ceph-mgr (up for 07:09:55 hours)
   - standby for until 09:53:27 when it becomes active
       2019-05-16 09:53:27.601 7f3f9b386700 1 mgr handle_mgr_map I am now activating
       2019-05-16 09:53:27.774 7f3f88a45700 1 mgr load Constructed class from module: restful
   - starts reporting "Too many open files"
       2019-05-16 09:53:30.823 7f3f9fb8f700 -1 Processor -- accept open file descriptions limit reached sd = 28 errno -24 (24) Too many open files.
   - continues to report "Too many open files" until 10:08:30.708
       2019-05-16 09:53:38.042 7f3f97b7f700 1 mgr send_beacon active
       2019-05-16 09:53:38.046 7f3f97b7f700 -1 os_release_parse - failed to open /etc/os-release: (24) Too many open files
       ...
       2019-05-16 10:08:30.708 7f3f97b7f700 1 mgr send_beacon active

2. mgr-restful-plugin (up for 19:04:44 hours)
   - active until 07:46:14 when ceph cluster stops responding
   - restarts ceph-mgr and restful plugin at 07:47:08 (REST API is ok)
   - REST API stops responding at 09:53:43 and doesn't recover
       2019-05-16 09:53:43,151 94824 INFO mgr-restful-plugin restful plugin not found. services={}
       2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin setup service monitor
       2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin setup signal handler
       2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin take service lock

The init script used to restart mgr-restful-plugin finds the service already running and leaves it alone waiting to recover on its own.

Steps mgr-restful-plugin takes to recover (based on source code):

  ceph mgr module enable restful
  ceph config set mgr mgr/restful/server_port 5001
  ceph restful create-key admin
  ceph restful create-self-signed-cert

consecutive_restful_plugin_restarts=0
(incremented only if restful is configured but API does not respond)

ceph mgr services --format json
-> "restful plugin not found"

The issue here is ceph-mgr service is started, encounters a temporary "too many open files" condition but doesn't fully recover after that. Attempts to enable restful plugin after mgr-restful-plugin finds it missing are silently failing.

A possible recovery from this would be to restart ceph-mgr process.

Now that we know REST API was not working at the time of host locking it is possible that a semantic check calls a REST API and is stuck forever waiting for a response (restful plugin is configured but ceph-mgr does not report it available). A fix for this problem is to ensure all Ceph REST API calls in sysinv around host lock are timing out after a grace period.

I looked at what's the status of Ceph REST API and it looks like it's down:

1. ceph-mgr (up for 07:09:55 hours)
   - standby for until 09:53:27 when it becomes active
       2019-05-16 09:53:27.601 7f3f9b386700  1 mgr handle_mgr_map I am now activating
       2019-05-16 09:53:27.774 7f3f88a45700  1 mgr load Constructed class from module: restful
   - starts reporting "Too many open files"
       2019-05-16 09:53:30.823 7f3f9fb8f700 -1  Processor -- accept open file descriptions limit reached sd = 28 errno -24 (24) Too many open files.
   - continues to report "Too many open files" until 10:08:30.708
       2019-05-16 09:53:38.042 7f3f97b7f700  1 mgr send_beacon active
       2019-05-16 09:53:38.046 7f3f97b7f700 -1 os_release_parse - failed to open /etc/os-release: (24) Too many open files
       ...
       2019-05-16 10:08:30.708 7f3f97b7f700  1 mgr send_beacon active
 
2. mgr-restful-plugin (up for 19:04:44 hours)
   - active until 07:46:14 when ceph cluster stops responding
   - restarts ceph-mgr and restful plugin at 07:47:08 (REST API is ok)
   - REST API stops responding at 09:53:43 and doesn't recover
       2019-05-16 09:53:43,151 94824 INFO mgr-restful-plugin restful plugin not found. services={}
       2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin setup service monitor
       2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin setup signal handler
       2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin take service lock

The init script used to restart mgr-restful-plugin finds the service already running and leaves it alone waiting to recover on its own.

Steps mgr-restful-plugin takes to recover (based on source code):

ceph mgr module enable restful
  ceph config set mgr mgr/restful/server_port 5001
  ceph restful create-key admin
  ceph restful create-self-signed-cert

consecutive_restful_plugin_restarts=0
  (incremented only if restful is configured but API does not respond)

ceph mgr services --format json
  -> "restful plugin not found"

The issue here is ceph-mgr service is started, encounters a temporary "too many open files" condition but doesn't fully recover after that. Attempts to enable restful plugin after mgr-restful-plugin finds it missing are silently failing.

A possible recovery from this would be to restart ceph-mgr process.

Now that we know REST API was not working at the time of host locking it is possible that a semantic check calls a REST API and is stuck forever waiting for a response (restful plugin is configured but ceph-mgr does not report it available). A fix for this problem is to ensure all Ceph REST API calls in sysinv around host lock are timing out after a grace period.

Revision history for this message

Bob Church (rchurch) wrote on 2019-05-20:

#14

Focusing on the following from Daniel's comment:

"accept open file descriptions limit reached sd = 28 errno -24 (24) Too many open files."

The following issue https://tracker.ceph.com/issues/36219 should change the behavior of the ceph-mgr and abort the process after encountering a fixed number of these failures. This should allow for easier recovery.

The Ceph v.13.2.3 commit for this change is https://github.com/ceph/ceph/commit/8b9898d01efb8ead8c903c458318eda61bc9b7b8

Revision history for this message

Daniel Badea (daniel.badea) wrote on 2019-05-21:

#15

Added support for restarting ceph-mgr when too many consecutive failures are reported: https://review.opendev.org/#/c/659820/

(when ceph-mgr fails too may times mgr-restful-plugin stops together with ceph-mgr and is restarted by SM)

Revision history for this message

Eric MacDonald (rocksolidmtce) wrote on 2019-05-21: RE: [Bug 1828863] Re: Hosts coudn't be locked and it failed with timeout Error

#16

Download full text (6.1 KiB)

Hi Daniel,

Do you think these 2 issues have the same root cause or simply both storage related with or without a similar failure mode ?

https://bugs.launchpad.net/starlingx/+bug/1828863
https://bugs.launchpad.net/starlingx/+bug/1828729

Eric.

> -----Original Message-----
> From: <email address hidden> [mailto:<email address hidden>] On Behalf Of Daniel Badea
> Sent: Monday, May 20, 2019 11:28 AM
> To: MacDonald, Eric
> Subject: [Bug 1828863] Re: Hosts coudn't be locked and it failed with timeout Error
> Importance: High
>
> I looked at what's the status of Ceph REST API and it looks like it's
> down:
>
> 1. ceph-mgr (up for 07:09:55 hours)
> - standby for until 09:53:27 when it becomes active
> 2019-05-16 09:53:27.601 7f3f9b386700 1 mgr handle_mgr_map I am now activating
> 2019-05-16 09:53:27.774 7f3f88a45700 1 mgr load Constructed class from module: restful
> - starts reporting "Too many open files"
> 2019-05-16 09:53:30.823 7f3f9fb8f700 -1 Processor -- accept open file descriptions limit reached sd
> = 28 errno -24 (24) Too many open files.
> - continues to report "Too many open files" until 10:08:30.708
> 2019-05-16 09:53:38.042 7f3f97b7f700 1 mgr send_beacon active
> 2019-05-16 09:53:38.046 7f3f97b7f700 -1 os_release_parse - failed to open /etc/os-release: (24)
> Too many open files
> ...
> 2019-05-16 10:08:30.708 7f3f97b7f700 1 mgr send_beacon active
>
> 2. mgr-restful-plugin (up for 19:04:44 hours)
> - active until 07:46:14 when ceph cluster stops responding
> - restarts ceph-mgr and restful plugin at 07:47:08 (REST API is ok)
> - REST API stops responding at 09:53:43 and doesn't recover
> 2019-05-16 09:53:43,151 94824 INFO mgr-restful-plugin restful plugin not found. services={}
> 2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin setup service monitor
> 2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin setup signal handler
> 2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin take service lock
>
> 3. SM gets mgr-restful-plugin status failure and tries to restart the service
> | 2019-05-16T09:54:33.958 | 764 | service-scn | mgr-restful-plugin | enabled-active
> | disabled | audit state mismatch
>
> The init script used to restart mgr-restful-plugin finds the service
> already running and leaves it alone waiting to recover on its own.
>
> Steps mgr-restful-plugin takes to recover (based on source code):
>
> ceph mgr module enable restful
> ceph config set mgr mgr/restful/server_port 5001
> ceph restful create-key admin
> ceph restful create-self-signed-cert
>
> consecutive_restful_plugin_restarts=0
> (incremented only if restful is configured but API does not respond)
>
> ceph mgr services --format json
> -> "restful plugin not found"
>
> The issue here is ceph-mgr service is started, encounters a temporary
> "too many open files" condition but doesn't fully recover after that.
> Attempts to enable restful plugin after mgr-restful-plugin finds it
> missing are silently failin...

Hi Daniel,

Do you think these 2 issues have the same root cause or simply both storage related with or without a similar failure mode ?

https://bugs.launchpad.net/starlingx/+bug/1828863
https://bugs.launchpad.net/starlingx/+bug/1828729

Eric.

> -----Original Message-----
> From: bounces@canonical.com [mailto:bounces@canonical.com] On Behalf Of Daniel Badea
> Sent: Monday, May 20, 2019 11:28 AM
> To: MacDonald, Eric
> Subject: [Bug 1828863] Re: Hosts coudn't be locked and it failed with timeout Error
> Importance: High
> 
> I looked at what's the status of Ceph REST API and it looks like it's
> down:
> 
> 1. ceph-mgr (up for 07:09:55 hours)
>    - standby for until 09:53:27 when it becomes active
>        2019-05-16 09:53:27.601 7f3f9b386700  1 mgr handle_mgr_map I am now activating
>        2019-05-16 09:53:27.774 7f3f88a45700  1 mgr load Constructed class from module: restful
>    - starts reporting "Too many open files"
>        2019-05-16 09:53:30.823 7f3f9fb8f700 -1  Processor -- accept open file descriptions limit reached sd
> = 28 errno -24 (24) Too many open files.
>    - continues to report "Too many open files" until 10:08:30.708
>        2019-05-16 09:53:38.042 7f3f97b7f700  1 mgr send_beacon active
>        2019-05-16 09:53:38.046 7f3f97b7f700 -1 os_release_parse - failed to open /etc/os-release: (24)
> Too many open files
>        ...
>        2019-05-16 10:08:30.708 7f3f97b7f700  1 mgr send_beacon active
> 
> 2. mgr-restful-plugin (up for 19:04:44 hours)
>    - active until 07:46:14 when ceph cluster stops responding
>    - restarts ceph-mgr and restful plugin at 07:47:08 (REST API is ok)
>    - REST API stops responding at 09:53:43 and doesn't recover
>        2019-05-16 09:53:43,151 94824 INFO mgr-restful-plugin restful plugin not found. services={}
>        2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin setup service monitor
>        2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin setup signal handler
>        2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin take service lock
> 
> 3. SM gets mgr-restful-plugin status failure and tries to restart the service
>      | 2019-05-16T09:54:33.958 |        764 | service-scn          | mgr-restful-plugin               | enabled-active
> | disabled                         | audit state mismatch
> 
> The init script used to restart mgr-restful-plugin finds the service
> already running and leaves it alone waiting to recover on its own.
> 
> Steps mgr-restful-plugin takes to recover (based on source code):
> 
>   ceph mgr module enable restful
>   ceph config set mgr mgr/restful/server_port 5001
>   ceph restful create-key admin
>   ceph restful create-self-signed-cert
> 
>   consecutive_restful_plugin_restarts=0
>   (incremented only if restful is configured but API does not respond)
> 
>   ceph mgr services --format json
>   -> "restful plugin not found"
> 
> The issue here is ceph-mgr service is started, encounters a temporary
> "too many open files" condition but doesn't fully recover after that.
> Attempts to enable restful plugin after mgr-restful-plugin finds it
> missing are silently failing.
> 
> A possible recovery from this would be to restart ceph-mgr process.
> 
> Now that we know REST API was not working at the time of host locking it
> is possible that a semantic check calls a REST API and is stuck forever
> waiting for a response (restful plugin is configured but ceph-mgr does
> not report it available). A fix for this problem is to ensure all Ceph
> REST API calls in sysinv around host lock are timing out after a grace
> period.
> 
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1828863
> 
> Title:
>   Hosts coudn't be locked and it failed with timeout Error
> 
> Status in StarlingX:
>   Incomplete
> 
> Bug description:
>   Brief Description
>   -----------------
>   Lock the controller or computes fails with timed out error
> 
>   Severity
>   --------
>   <Critical: System/Feature is not usable due to the defect>
> 
>   Steps to Reproduce
>   ------------------
>   1.After Provisioning the platform
>   2.Lock the controller or compute(system host-lock <controller>)
> 
>   Expected Behavior
>   ------------------
>   -system host-lock <controller>
>   -Should lock the controller or compute
> 
>   Actual Behavior
>   ----------------
>   [wrsroot@controller-0 ~(keystone_admin)]$ system host-lock controller-1
>   timed out
>   [wrsroot@controller-0 ~(keystone_admin)]$ system host-list
>   +----+--------------+-------------+----------------+-------------+--------------+
>   | id | hostname     | personality | administrative | operational | availability |
>   +----+--------------+-------------+----------------+-------------+--------------+
>   | 1  | controller-0 | controller  | unlocked       | enabled     | available    |
>   | 2  | compute-0    | worker      | unlocked       | enabled     | available    |
>   | 3  | compute-1    | worker      | unlocked       | enabled     | available    |
>   | 4  | controller-1 | controller  | unlocked       | enabled     | available    |
>   +----+--------------+-------------+----------------+-------------+--------------+
> 
>   Reproducibility
>   ---------------
>   The issue is 100% reproducible
> 
>   System Configuration
>   --------------------
>   Multi-node system, Dedicated storage
> 
>   Branch/Pull Time/Commit
>   -----------------------
>   bootimage.iso	2019-May-04 00:37:20	1.8G	application/x-iso9660-image
>   ISO:
>   http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190504T013000Z/outputs/iso/
>   Helm chart : stx-openstack-1.0-11-centos-stable-latest.tgz
> 
>   Last Pass
>   ---------
>   With Earlier ISO we dint observer this issue
>   bootimage.iso	2019-Apr-28 00:19:54	1.9G	application/x-iso9660-image
>   ISO:
>   http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190428T013000Z/outputs/iso/
> 
>   Timestamp/Logs
>   --------------
>   Have attached dmseg and sysinv logs
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/starlingx/+bug/1828863/+subscriptions

Revision history for this message

Daniel Badea (daniel.badea) wrote on 2019-05-21:

#17

Download full text (4.7 KiB)

I can't find any logs attached to https://bugs.launchpad.net/starlingx/+bug/1828729 so it's impossible to say what was the state of Ceph REST API.

I assume since we're seeing a similar pattern in sysinv log that it's stuck at the same point.

Zooming in on that area of code:

./sysinv/sysinv/sysinv/sysinv/common/ceph.py

LOG.info("Active ceph monitors in inventory = %s" % str(inventory_monitor_names))

        # check that the cluster is actually operational.
        # if we can get the monitor quorum from ceph, then
        # the cluster is truly operational
        if num_inv_monitors >= required_monitors:
            try:
                quorum_names = self._osd_quorum_names()
            except Exception:
                # if the cluster is not responding to requests
                # we set the quorum_names to an empty list , indicating a problem
                quorum_names = []
                LOG.error("Ceph cluster not responding to requests.")

LOG.info("Active ceph monitors in ceph cluster = %s" % str(quorum_names))

you can see there's a call to self._osd_quorum_names() guarded by two LOG.info()'s. One of them is present in the logs "Active ceph monitors in inventory" while the other is not (""Active ceph monitors in ceph cluster" or in case of an error "Ceph cluster not responding to request") which means sysinv is stuck in the REST API call:

    def _osd_quorum_names(self, timeout=10):
        quorum_names = []
        try:
            response, body = self._ceph_api.quorum_status(body='json',
                                                          timeout=timeout)
            quorum_names = body['output']['quorum_names']
        except Exception as ex:
            LOG.exception(ex)
            return quorum_names

return quorum_names

It's interesting the API call has a timeout of 10 seconds but does not return. The "timeout" parameter is passed on to requests library in python-cephclient/cephclient/client.py

def _r...

I can't find any logs attached to https://bugs.launchpad.net/starlingx/+bug/1828729 so it's impossible to say what was the state of Ceph REST API.

I assume since we're seeing a similar pattern in sysinv log that it's stuck at the same point.

Zooming in on that area of code:

./sysinv/sysinv/sysinv/sysinv/common/ceph.py

LOG.info("Active ceph monitors in inventory = %s" % str(inventory_monitor_names))                                          
                                                                                                                                   
        # check that the cluster is actually operational.                                                                          
        # if we can get the monitor quorum from ceph, then                                                                         
        # the cluster is truly operational                                                                                         
        if num_inv_monitors >= required_monitors:                                                                                  
            try:                                                                                                                   
                quorum_names = self._osd_quorum_names()                                                                            
            except Exception:                                                                                                      
                # if the cluster is not responding to requests                                                                     
                # we set the quorum_names to an empty list , indicating a problem                                                  
                quorum_names = []                                                                                                  
                LOG.error("Ceph cluster not responding to requests.")                                                              
                                                                                                                                   
        LOG.info("Active ceph monitors in ceph cluster = %s" % str(quorum_names))

you can see there's a call to self._osd_quorum_names() guarded by two LOG.info()'s. One of them is present in the logs "Active ceph monitors in inventory" while the other is not (""Active ceph monitors in ceph cluster" or in case of an error "Ceph cluster not responding to request") which means sysinv is stuck in the REST API call:

def _osd_quorum_names(self, timeout=10):
        quorum_names = []
        try:
            response, body = self._ceph_api.quorum_status(body='json',
                                                          timeout=timeout)
            quorum_names = body['output']['quorum_names']
        except Exception as ex:
            LOG.exception(ex)
            return quorum_names
        
        return quorum_names

It's interesting the API call has a timeout of 10 seconds but does not return. The "timeout" parameter is passed on to requests library in python-cephclient/cephclient/client.py

def _request(self, prefix, *args, **kwargs):

...

req_json = dict(kwargs)

...

if 'timeout' in req_json:
            timeout = req_json['timeout']
            del req_json['timeout']
        else:
            timeout = None
        ...
                result = self.session.post(
                    self.service_url + 'request?wait=1',
                    json=req_json,
                    verify=self.check_certificate,
                    timeout=timeout).json()            <-- here

But a closer look at  https://2.python-requests.org//en/master/user/quickstart/#timeouts reveals timeout is not a bound on the total amount of time spent making the request:

"""
  Note: timeout is not a time limit on the entire response download; rather, an exception is raised if the server has not issued a response for timeout seconds (more precisely, if no bytes have been received on the underlying socket for timeout seconds). If no timeout is specified explicitly, requests do not time out.
  """

So this means ceph-mgr is probably keeping the connection alive but does not return a reponse. If mgr-restful plugin were to properly detect that ceph-mgr is no longer responding then:
- ceph-mgr restarts
- sysinv's request made via python-cephclient fails because the TCP connection is closed
- python-cephclient retries the same request 2 times after waiting for 5 seconds
- if ceph-mgr starts within cephclient's retry window (10 seconds) then sysinv doesn't notice the issue
- else an exception is raised and it's up to sysinv to recover from it.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-21:

#18

Assigning to Daniel Badea since he appears to have linked this bug to this review:
https://review.opendev.org/#/c/659820/

Changed in starlingx:
status:	Incomplete → In Progress
assignee:	Eric MacDonald (rocksolidmtce) → Daniel Badea (daniel.badea)

Ghada Khalil (gkhalil) on 2019-05-21

tags:

added: stx.storage
removed: stx.metal

Revision history for this message

Daniel Badea (daniel.badea) wrote on 2019-05-22:

#19

Created stx-ceph pull request https://github.com/starlingx-staging/stx-ceph/pull/32 with upstream fix for the issue identified by Bob ( https://github.com/ceph/ceph/commit/8b9898d01efb8ead8c903c458318eda61bc9b7b8 )

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-05-22:

#20

After a discussion with Daniel, the approach for this LP is to fix the initial issue which is that ceph-mgr hits the "Too many open files" issue when starting up and then becomes unresponsive after this. This will be addressed by this stx-ceph pull request: https://github.com/starlingx-staging/stx-ceph/pull/32

Daniel is also working to determine why ceph did not detect that ceph-mgr was not fully initialized. The mgr-restful-plugin process is designed to monitor ceph-mgr but needs to be enhanced to detect when it is not providing service and restart the ceph-mgr process. But this work is being tracked under a different LP: https://bugs.launchpad.net/starlingx/+bug/1828024

As such the commit message for gerrit review for https://review.opendev.org/#/c/659820/ was updated to remove LP 1828863 and only reference LP 1828024.

Revision history for this message

Daniel Badea (daniel.badea) wrote on 2019-05-22:

#21

Tested with custom build on VBox dedicated storage setup (2+2+2):
1. create volume from raw and qcow2 images
2. start instances from volumes
3. lock/unlock controller-1
4. lock/unlock compute-0
5. instances are migrating to compute-1
6. lock/unlock compute-1
7. instances are migrating to compute-0
8. stop instances
9. delete volumes

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-22:

#22

Pull Request https://github.com/starlingx-staging/stx-ceph/pull/32 was merged on May 22/2019. This will be available in the cengn builds from May 23.

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Gopinath (gprabakx) wrote on 2019-06-03:

#23

Hi, The issue is not seen with new ISO: http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/latest_green_build/outputs/iso/

StarlingX

Hosts coudn't be locked and it failed with timeout Error

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches