Neutron_api (unhealthy) after few days

Bug #1809823 reported by Yossi Ovadia on 2018-12-26
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Undecided
Ahmed Zaid
oslo.service
High
Ahmed Zaid

Bug Description

Description
===========
on the undercloud ( pretty sure we also seen it on overcloud, i'll update when sure )
Without any action, we notice that neutron_api service is in "unhealthy" state and stop functioning.
Log shows -
2018-12-26 00:00:35.774 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children
2018-12-26 00:00:36.077 40997 ERROR oslo_service.service [-] Error starting thread.: RuntimeError: A fixed interval looping call can only run one function at a time

openstack commands that needs neutron fails ( e.g openstack server list )

Restarting the docker ( neutron_api ) resolves the problem.

Steps to reproduce
==================
Deploy.
Wait 4 days.

Expected result
===============
Service should remain healthy..

Actual result
=============
not healthy ..

Environment
===========
Rocky , container based.

Logs & Configs
==============

Logs : http://paste.openstack.org/show/738658/

More info:
==========
Google showed this -
https://bugs.launchpad.net/oslo.service/+bug/1547029
follow by -
http://paste.openstack.org/show/487420/

It seems that if we'll add "eventlet.sleep(0)" in <<<HERE>>> below, it might resolve the issue. -

    def run_service(service, done):
        """Service start wrapper.

        :param service: service to run
        :param done: event to wait on until a shutdown is triggered
        :returns: None

        """
        try:
            <<<<< HERE >>>>>>>>
            service.start()
        except Exception:
            LOG.exception('Error starting thread.')
            raise SystemExit(1)
        else:
            done.wait()

The problem is that I didnt come up with an easy way to reproduce the issue in order to confirm it.

Any suggestions ?

Yossi Ovadia (jabadia) wrote :

Code above is taken from "/usr/lib/python2.7/site-packages/oslo_service/service.py" Line 794

Changed in tripleo:
assignee: nobody → Yossi Ovadia (jabadia)
Yossi Ovadia (jabadia) wrote :
Ben Nemec (bnemec) on 2019-01-10
Changed in oslo.service:
status: New → Confirmed
importance: Undecided → High

Change abandoned by Yossi Ovadia (<email address hidden>) on branch: master
Review: https://review.openstack.org/629289
Reason: after alot of research, this fix does not fix the initial neutron reported issue.

Yossi Ovadia (jabadia) wrote :

UPDATE:
to reproduce the bug:
log into the neutron_api container :

docker exec -it --user root neutron_api bash
ps fax| grep neutron_api
()[root@undercloud /]# ps fax
  PID TTY STAT TIME COMMAND
  115 ? Ss 0:00 bash
  140 ? S+ 0:00 \_ top
   44 ? Ss 0:00 bash
  241 ? R+ 0:00 \_ ps fax
    1 ? Ss 0:00 /usr/local/bin/dumb-init /bin/bash /usr/local/bin/kolla_sta
    7 ? Ss 0:05 /usr/bin/python2 /usr/bin/neutron-server --config-file /usr
   27 ? S 0:08 \_ /usr/bin/python2 /usr/bin/neutron-server --config-file
   28 ? S 0:00 \_ /usr/bin/python2 /usr/bin/neutron-server --config-file
   29 ? S 0:03 \_ /usr/bin/python2 /usr/bin/neutron-server --config-file
   30 ? S 0:03 \_ /usr/bin/python2 /usr/bin/neutron-server --config-file
   31 ? S 0:03 \_ /usr/bin/python2 /usr/bin/neutron-server --config-file
   32 ? R 5:27 \_ /usr/bin/python2 /usr/bin/neutron-server --config-file

Kill the last on pid ( 32 ) with sigup-
kill -1 32

check the server.log after few seconds :
2018-12-26 00:00:36.077 40997 ERROR oslo_service.service [-] Error starting thread.: RuntimeError: A fixed interval looping call can only run one function at a time

in our environment, this occurs without someone issuing kill -1 , but just after 4 days more or less there's a sigup and docker becomes unhealthy.

affects: tripleo → neutron
Ahmed Zaid (ahmedzaid10) on 2019-01-31
Changed in neutron:
assignee: Yossi Ovadia (jabadia) → Ahmed Zaid (ahmedzaid10)
Changed in oslo.service:
assignee: nobody → Ahmed Zaid (ahmedzaid10)

Fix proposed to branch: master
Review: https://review.openstack.org/634290

Changed in neutron:
status: New → In Progress

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/634290
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers