Hosts coudn't be locked and it failed with timeout Error

Bug #1828863 reported by Gopinath
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Daniel Badea

Bug Description

Brief Description
-----------------
Lock the controller or computes fails with timed out error

Severity
--------
<Critical: System/Feature is not usable due to the defect>

Steps to Reproduce
------------------
1.After Provisioning the platform
2.Lock the controller or compute(system host-lock <controller>)

Expected Behavior
------------------
-system host-lock <controller>
-Should lock the controller or compute

Actual Behavior
----------------
[wrsroot@controller-0 ~(keystone_admin)]$ system host-lock controller-1
timed out
[wrsroot@controller-0 ~(keystone_admin)]$ system host-list
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

Reproducibility
---------------
The issue is 100% reproducible

System Configuration
--------------------
Multi-node system, Dedicated storage

Branch/Pull Time/Commit
-----------------------
bootimage.iso 2019-May-04 00:37:20 1.8G application/x-iso9660-image
ISO:
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190504T013000Z/outputs/iso/
Helm chart : stx-openstack-1.0-11-centos-stable-latest.tgz

Last Pass
---------
With Earlier ISO we dint observer this issue
bootimage.iso 2019-Apr-28 00:19:54 1.9G application/x-iso9660-image
ISO:
http://mirror.starlingx.cengn.ca/mirror/starlingx/master/centos/20190428T013000Z/outputs/iso/

Timestamp/Logs
--------------
Have attached dmseg and sysinv logs

Revision history for this message
Gopinath (gprabakx) wrote :
Revision history for this message
Gopinath (gprabakx) wrote :

dmesg log

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as release gating; locks are consistently failing based on the reporter.

tags: added: stx.2.0 stx.metal
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Eric MacDonald to triage.

Changed in starlingx:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Eric MacDonald (rocksolidmtce)
Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

The dmesg and sysinv logs are not enough to debug this issue.
Was there a 'collect' run and if so where can I find that tarball.

Just an FYI ; maintenance is designed to not fail a lock.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

Please include the following information
 - a tarball of /var/log on the active controller
 - the time the lock that failed was attempted

Changed in starlingx:
status: Triaged → Incomplete
Revision history for this message
Ghada Khalil (gkhalil) wrote :

It's possible this is a duplicate of https://bugs.launchpad.net/starlingx/+bug/1828729
The requested data is required to confirm

Revision history for this message
Gopinath (gprabakx) wrote :

@<email address hidden>
Have attached tarball file for /var/log.
Please fine the attachment

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

You attached the logs for the controller being locked (controller-1) but I was expecting the logs for the active controller which was controller-0 because that is where the management processes are running to perform the lock.

However I looked at the bash.log in C1 and it seems you also tried to lock C0 from C1 here so I'll debug that case.

2019-05-16T13:46:03.000 controller-1 -sh: info HISTORY: PID=2575133 UID=1875 system host-lock controller-0

The mtcAgent log shows that it did not receive the lock request so it must have been rejected by sysinv which has the following logs

2019-05-16 13:45:27.183 1004430 INFO sysinv.api.controllers.v1.host [-] controller-1 1. delta_handle ['action']
2019-05-16 13:45:27.275 1004430 WARNING wsme.api [-] Client-side error: controller-0 target Config 5f7ee8b0-0e97-4a40-8cd5-3cddab0814cd not yet applied. Apply target Config via Lock/Unlock prior to Swact
2019-05-16 13:46:05.929 1004431 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost_patch_start_2019-05-16-13-46-05 patch
2019-05-16 13:46:05.929 1004431 INFO sysinv.api.controllers.v1.host [-] controller-0 1. delta_handle ['action']
2019-05-16 13:46:05.930 1004431 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost check_lock
2019-05-16 13:46:05.930 1004431 INFO sysinv.api.controllers.v1.host [-] controller-0 ihost check_lock_controller
2019-05-16 13:46:06.085 1004431 INFO sysinv.common.ceph [-] Active ceph monitors in inventory = [u'controller-0', u'controller-1', u'storage-0']

Still investigating, consulting with peers.

Revision history for this message
Gopinath (gprabakx) wrote :

Controller-1 was the active controller and the controller-0 was in config-out-of-date.To change the status of controller-0 from config-out-of-date to None we need to do lock and unlock and Here is where we cannot lock the controller-0.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

OK, so I'm on the right path.

What threw me off was your bug report above indicates a timeout on 'system host-lock controller-1' as the actual behavior.

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote :

I compared the sysinv log stream to that issue and they both stop after this

2019-05-12 12:37:32.174 3421085 INFO sysinv.api.controllers.v1.host [-] compute-1 ihost check_lock
2019-05-12 12:37:32.175 3421085 INFO sysinv.api.controllers.v1.host [-] compute-1 host check_lock_worker
2019-05-12 12:37:32.226 3421085 INFO sysinv.common.ceph [-] Active ceph monitors in inventory = [u'controller-0', u'controller-1', u'compute-1']

After monitoring the logs in a working case I'm confident that the two issues have the same root cause and are therefore duplicates.

Dups:
https://bugs.launchpad.net/starlingx/+bug/1828863
https://bugs.launchpad.net/starlingx/+bug/1828729

Revision history for this message
Daniel Badea (daniel.badea) wrote :

I looked at what's the status of Ceph REST API and it looks like it's down:

1. ceph-mgr (up for 07:09:55 hours)
   - standby for until 09:53:27 when it becomes active
       2019-05-16 09:53:27.601 7f3f9b386700 1 mgr handle_mgr_map I am now activating
       2019-05-16 09:53:27.774 7f3f88a45700 1 mgr load Constructed class from module: restful
   - starts reporting "Too many open files"
       2019-05-16 09:53:30.823 7f3f9fb8f700 -1 Processor -- accept open file descriptions limit reached sd = 28 errno -24 (24) Too many open files.
   - continues to report "Too many open files" until 10:08:30.708
       2019-05-16 09:53:38.042 7f3f97b7f700 1 mgr send_beacon active
       2019-05-16 09:53:38.046 7f3f97b7f700 -1 os_release_parse - failed to open /etc/os-release: (24) Too many open files
       ...
       2019-05-16 10:08:30.708 7f3f97b7f700 1 mgr send_beacon active

2. mgr-restful-plugin (up for 19:04:44 hours)
   - active until 07:46:14 when ceph cluster stops responding
   - restarts ceph-mgr and restful plugin at 07:47:08 (REST API is ok)
   - REST API stops responding at 09:53:43 and doesn't recover
       2019-05-16 09:53:43,151 94824 INFO mgr-restful-plugin restful plugin not found. services={}
       2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin setup service monitor
       2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin setup signal handler
       2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin take service lock

3. SM gets mgr-restful-plugin status failure and tries to restart the service
     | 2019-05-16T09:54:33.958 | 764 | service-scn | mgr-restful-plugin | enabled-active | disabled | audit state mismatch

The init script used to restart mgr-restful-plugin finds the service already running and leaves it alone waiting to recover on its own.

Steps mgr-restful-plugin takes to recover (based on source code):

  ceph mgr module enable restful
  ceph config set mgr mgr/restful/server_port 5001
  ceph restful create-key admin
  ceph restful create-self-signed-cert

  consecutive_restful_plugin_restarts=0
  (incremented only if restful is configured but API does not respond)

  ceph mgr services --format json
  -> "restful plugin not found"

The issue here is ceph-mgr service is started, encounters a temporary "too many open files" condition but doesn't fully recover after that. Attempts to enable restful plugin after mgr-restful-plugin finds it missing are silently failing.

A possible recovery from this would be to restart ceph-mgr process.

Now that we know REST API was not working at the time of host locking it is possible that a semantic check calls a REST API and is stuck forever waiting for a response (restful plugin is configured but ceph-mgr does not report it available). A fix for this problem is to ensure all Ceph REST API calls in sysinv around host lock are timing out after a grace period.

Revision history for this message
Bob Church (rchurch) wrote :

Focusing on the following from Daniel's comment:

"accept open file descriptions limit reached sd = 28 errno -24 (24) Too many open files."

The following issue https://tracker.ceph.com/issues/36219 should change the behavior of the ceph-mgr and abort the process after encountering a fixed number of these failures. This should allow for easier recovery.

The Ceph v.13.2.3 commit for this change is https://github.com/ceph/ceph/commit/8b9898d01efb8ead8c903c458318eda61bc9b7b8

Revision history for this message
Daniel Badea (daniel.badea) wrote :

Added support for restarting ceph-mgr when too many consecutive failures are reported: https://review.opendev.org/#/c/659820/

(when ceph-mgr fails too may times mgr-restful-plugin stops together with ceph-mgr and is restarted by SM)

Revision history for this message
Eric MacDonald (rocksolidmtce) wrote : RE: [Bug 1828863] Re: Hosts coudn't be locked and it failed with timeout Error
Download full text (6.1 KiB)

Hi Daniel,

Do you think these 2 issues have the same root cause or simply both storage related with or without a similar failure mode ?

https://bugs.launchpad.net/starlingx/+bug/1828863
https://bugs.launchpad.net/starlingx/+bug/1828729

Eric.

> -----Original Message-----
> From: <email address hidden> [mailto:<email address hidden>] On Behalf Of Daniel Badea
> Sent: Monday, May 20, 2019 11:28 AM
> To: MacDonald, Eric
> Subject: [Bug 1828863] Re: Hosts coudn't be locked and it failed with timeout Error
> Importance: High
>
> I looked at what's the status of Ceph REST API and it looks like it's
> down:
>
> 1. ceph-mgr (up for 07:09:55 hours)
> - standby for until 09:53:27 when it becomes active
> 2019-05-16 09:53:27.601 7f3f9b386700 1 mgr handle_mgr_map I am now activating
> 2019-05-16 09:53:27.774 7f3f88a45700 1 mgr load Constructed class from module: restful
> - starts reporting "Too many open files"
> 2019-05-16 09:53:30.823 7f3f9fb8f700 -1 Processor -- accept open file descriptions limit reached sd
> = 28 errno -24 (24) Too many open files.
> - continues to report "Too many open files" until 10:08:30.708
> 2019-05-16 09:53:38.042 7f3f97b7f700 1 mgr send_beacon active
> 2019-05-16 09:53:38.046 7f3f97b7f700 -1 os_release_parse - failed to open /etc/os-release: (24)
> Too many open files
> ...
> 2019-05-16 10:08:30.708 7f3f97b7f700 1 mgr send_beacon active
>
> 2. mgr-restful-plugin (up for 19:04:44 hours)
> - active until 07:46:14 when ceph cluster stops responding
> - restarts ceph-mgr and restful plugin at 07:47:08 (REST API is ok)
> - REST API stops responding at 09:53:43 and doesn't recover
> 2019-05-16 09:53:43,151 94824 INFO mgr-restful-plugin restful plugin not found. services={}
> 2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin setup service monitor
> 2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin setup signal handler
> 2019-05-16 09:53:43,463 1005248 INFO mgr-restful-plugin take service lock
>
> 3. SM gets mgr-restful-plugin status failure and tries to restart the service
> | 2019-05-16T09:54:33.958 | 764 | service-scn | mgr-restful-plugin | enabled-active
> | disabled | audit state mismatch
>
> The init script used to restart mgr-restful-plugin finds the service
> already running and leaves it alone waiting to recover on its own.
>
> Steps mgr-restful-plugin takes to recover (based on source code):
>
> ceph mgr module enable restful
> ceph config set mgr mgr/restful/server_port 5001
> ceph restful create-key admin
> ceph restful create-self-signed-cert
>
> consecutive_restful_plugin_restarts=0
> (incremented only if restful is configured but API does not respond)
>
> ceph mgr services --format json
> -> "restful plugin not found"
>
> The issue here is ceph-mgr service is started, encounters a temporary
> "too many open files" condition but doesn't fully recover after that.
> Attempts to enable restful plugin after mgr-restful-plugin finds it
> missing are silently failin...

Read more...

Revision history for this message
Daniel Badea (daniel.badea) wrote :
Download full text (4.7 KiB)

I can't find any logs attached to https://bugs.launchpad.net/starlingx/+bug/1828729 so it's impossible to say what was the state of Ceph REST API.

I assume since we're seeing a similar pattern in sysinv log that it's stuck at the same point.

Zooming in on that area of code:

./sysinv/sysinv/sysinv/sysinv/common/ceph.py

        LOG.info("Active ceph monitors in inventory = %s" % str(inventory_monitor_names))

        # check that the cluster is actually operational.
        # if we can get the monitor quorum from ceph, then
        # the cluster is truly operational
        if num_inv_monitors >= required_monitors:
            try:
                quorum_names = self._osd_quorum_names()
            except Exception:
                # if the cluster is not responding to requests
                # we set the quorum_names to an empty list , indicating a problem
                quorum_names = []
                LOG.error("Ceph cluster not responding to requests.")

        LOG.info("Active ceph monitors in ceph cluster = %s" % str(quorum_names))

you can see there's a call to self._osd_quorum_names() guarded by two LOG.info()'s. One of them is present in the logs "Active ceph monitors in inventory" while the other is not (""Active ceph monitors in ceph cluster" or in case of an error "Ceph cluster not responding to request") which means sysinv is stuck in the REST API call:

    def _osd_quorum_names(self, timeout=10):
        quorum_names = []
        try:
            response, body = self._ceph_api.quorum_status(body='json',
                                                          timeout=timeout)
            quorum_names = body['output']['quorum_names']
        except Exception as ex:
            LOG.exception(ex)
            return quorum_names

        return quorum_names

It's interesting the API call has a timeout of 10 seconds but does not return. The "timeout" parameter is passed on to requests library in python-cephclient/cephclient/client.py

    def _r...

Read more...

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Daniel Badea since he appears to have linked this bug to this review:
https://review.opendev.org/#/c/659820/

Changed in starlingx:
status: Incomplete → In Progress
assignee: Eric MacDonald (rocksolidmtce) → Daniel Badea (daniel.badea)
Ghada Khalil (gkhalil)
tags: added: stx.storage
removed: stx.metal
Revision history for this message
Daniel Badea (daniel.badea) wrote :
Revision history for this message
Frank Miller (sensfan22) wrote :

After a discussion with Daniel, the approach for this LP is to fix the initial issue which is that ceph-mgr hits the "Too many open files" issue when starting up and then becomes unresponsive after this. This will be addressed by this stx-ceph pull request: https://github.com/starlingx-staging/stx-ceph/pull/32

Daniel is also working to determine why ceph did not detect that ceph-mgr was not fully initialized. The mgr-restful-plugin process is designed to monitor ceph-mgr but needs to be enhanced to detect when it is not providing service and restart the ceph-mgr process. But this work is being tracked under a different LP: https://bugs.launchpad.net/starlingx/+bug/1828024

As such the commit message for gerrit review for https://review.opendev.org/#/c/659820/ was updated to remove LP 1828863 and only reference LP 1828024.

Revision history for this message
Daniel Badea (daniel.badea) wrote :

Tested with custom build on VBox dedicated storage setup (2+2+2):
1. create volume from raw and qcow2 images
2. start instances from volumes
3. lock/unlock controller-1
4. lock/unlock compute-0
5. instances are migrating to compute-1
6. lock/unlock compute-1
7. instances are migrating to compute-0
8. stop instances
9. delete volumes

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Pull Request https://github.com/starlingx-staging/stx-ceph/pull/32 was merged on May 22/2019. This will be available in the cengn builds from May 23.

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Gopinath (gprabakx) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.