oslo.service ProcessLauncher fails to call stop()

Bug #1773070 reported by jun923.gu on 2018-05-24
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
oslo.messaging
Undecided
Unassigned
oslo.service
Undecided
jun923.gu

Bug Description

Version:OpenStack pike version
Cinder: 11.0.0-1.el7
oslo-service: 1.25.1-1.el7
oslo-messaging: 5-30.2-1.el7

    When I startup openstack-cinder-volume serivce, I found that cinder-volume serivce is running, but it can't report service status to db. So I run cinder service-list command, the cinder-volume service is down.
    Next, I print the stack information of cinder-volume process and found that cinder-volume block at the _wait funciton of _OrderedTask class.
    After analyzing the log,I found that when startup cinder-volume service, cinder/service.py:Service:start function raise a exception( Can't connected to db ).According to the log, I can reproduce the problem:
    1. Add a raise code after self.manager.ini_host_with_rpc() in cinder/service.py:Service:start function.
    2. restart the cinder-volume service.

    After analyzing, the cause of the problem is that wait function is called and stop funtion is not called in the oslo_messaing/server.py:MessageHandlingServer class. wait function need to wait stop function to become COMPLETE status, but the stop function is not called and is INIT status.

jun923.gu (gujun1989) wrote :

I pay attention to the below code for oslo.service.

    def _child_wait_for_exit_or_signal(self, launcher):
        status = 0
        signo = 0

        # NOTE(johannes): All exceptions are caught to ensure this
        # doesn't fallback into the loop spawning children. It would
        # be bad for a child to spawn more children.
        try:
            launcher.wait()
        except SignalExit as exc:
            signame = self.signal_handler.signals_to_name[exc.signo]
            LOG.info('Child caught %s, exiting', signame)
            status = exc.code
            signo = exc.signo
        except SystemExit as exc:
            status = exc.code
        except BaseException:
            LOG.exception('Unhandled exception')
            status = 2

        return status, signo

    def _start_child(self, wrap):
        if len(wrap.forktimes) > wrap.workers:
            # Limit ourselves to one process a second (over the period of
            # number of workers * 1 second). This will allow workers to
            # start up quickly but ensure we don't fork off children that
            # die instantly too quickly.
            if time.time() - wrap.forktimes[0] < wrap.workers:
                LOG.info('Forking too fast, sleeping')
                time.sleep(1)

            wrap.forktimes.pop(0)

        wrap.forktimes.append(time.time())

        pid = os.fork()
        if pid == 0:
            self.launcher = self._child_process(wrap.service)
            while True:
                self._child_process_handle_signal()
                status, signo = self._child_wait_for_exit_or_signal(
                    self.launcher)
                if not _is_sighup_and_daemon(signo):
                    self.launcher.wait()
                    break
                self.launcher.restart()

            os._exit(status)

        LOG.debug('Started child %d', pid)

        wrap.children.add(pid)
        self.children[pid] = wrap

        return pid

When the exception occurred, _child_wait_for_exit_or_signal function would catch the exception. The exception is SystemExit. The function just get the code of exception, don't do other operations. I think the function miss the stop function call.

Ken Giusti (kgiusti) on 2018-06-01
Changed in oslo.messaging:
status: New → Invalid
Ken Giusti (kgiusti) wrote :

From your description I don't believe this is a bug in oslo.messaging.

If I understand you correctly, the RPC servers have been started *before* the exception was thrown. In that case I would expect any call to the RPC server's wait() method to block until the RPC's stop() method is invoked.

What I don't understand is how the exception you are hitting is preventing stop from being called. Can you explain were the service is expected to be stopped in the normal case (without the exception being thrown).

I've attached a simple RPC service example - can you create a re-producer from this? I've had it raise a SystemExit exception in the start() method, but that didn't prevent the services from being stopped once stop was called on the launcher.

Changed in oslo.service:
status: New → Incomplete
jun923.gu (gujun1989) wrote :

Thanks Ken Giusti (kgiusti) for reply, I had run your example and everything is OK. Your example creates a thread to call stop function. When wait function to wait stop function to complete, the os will switch the thread that call stop function and run stop function. So your example would run OK. But cinder-volume don't do this.So I found a similar bug on cinder: https://bugs.launchpad.net/cinder/+bug/1644751

Ken Giusti (kgiusti) wrote :

Thanks for the cinder link.

I think this is a bug in oslo.service. See the attached reproducer, which does not use oslo.messaging.

If the Service.start() method manages to start a background thread _then_ raise an error, calling ProcessLauncher.stop() will not call Service.stop(). However calling ProcessLauncher.wait() *does* invoke Service.wait(), which results in a hang since the background thread was never stopped.

One (ugly) work-around for this would be to wrap the Service.start() call in a try: except: block, and clean up any running RPC servers (by calling rpc_server.stop()) in the except block.

Changed in oslo.service:
status: Incomplete → Confirmed
Ken Giusti (kgiusti) on 2018-06-06
summary: - oslo.messaging _OrderedTask wait condition lock
+ oslo.service ProcessLauncher fails to call stop()
jun923.gu (gujun1989) on 2018-06-19
Changed in oslo.service:
assignee: nobody → jun923.gu (gujun1989)

Fix proposed to branch: master
Review: https://review.openstack.org/576327

Changed in oslo.service:
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/576327
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=612c6507dc226f99d95bd22e1b68548b40275f21
Submitter: Zuul
Branch: master

commit 612c6507dc226f99d95bd22e1b68548b40275f21
Author: jun923.gu <gujun1989@126.com>
Date: Tue Jun 19 10:25:07 2018 +0800

    Fix oslo.service ProcessLauncher fails to call stop

    Change-Id: Ie8fd5f1da991f9b7edf43dbe15ad088599fef053
    Closes-Bug: #1773070

Changed in oslo.service:
status: In Progress → Fix Released

This issue was fixed in the openstack/oslo.service 1.31.3 release.

Reviewed: https://review.openstack.org/645203
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=7bd23cff925c48831fcf61593b9cf7e286e5ff1d
Submitter: Zuul
Branch: stable/queens

commit 7bd23cff925c48831fcf61593b9cf7e286e5ff1d
Author: jun923.gu <gujun1989@126.com>
Date: Tue Jun 19 10:25:07 2018 +0800

    Fix oslo.service ProcessLauncher fails to call stop

    Change-Id: Ie8fd5f1da991f9b7edf43dbe15ad088599fef053
    Closes-Bug: #1773070
    (cherry picked from commit 612c6507dc226f99d95bd22e1b68548b40275f21)

tags: added: in-stable-queens

This issue was fixed in the openstack/oslo.service 1.29.1 release.

Reviewed: https://review.opendev.org/645208
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=af82db7e4314aca7930eb23da9fa13f664fe2072
Submitter: Zuul
Branch: stable/pike

commit af82db7e4314aca7930eb23da9fa13f664fe2072
Author: jun923.gu <gujun1989@126.com>
Date: Tue Jun 19 10:25:07 2018 +0800

    Fix oslo.service ProcessLauncher fails to call stop

    Change-Id: Ie8fd5f1da991f9b7edf43dbe15ad088599fef053
    Closes-Bug: #1773070
    (cherry picked from commit 612c6507dc226f99d95bd22e1b68548b40275f21)

tags: added: in-stable-pike

Reviewed: https://review.opendev.org/645214
Committed: https://git.openstack.org/cgit/openstack/oslo.service/commit/?id=7db3f687212e570c6906801a60757c7c7a693463
Submitter: Zuul
Branch: stable/ocata

commit 7db3f687212e570c6906801a60757c7c7a693463
Author: jun923.gu <gujun1989@126.com>
Date: Tue Jun 19 10:25:07 2018 +0800

    Fix oslo.service ProcessLauncher fails to call stop

    Change-Id: Ie8fd5f1da991f9b7edf43dbe15ad088599fef053
    Closes-Bug: #1773070
    (cherry picked from commit 612c6507dc226f99d95bd22e1b68548b40275f21)

tags: added: in-stable-ocata
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers