Neutron_api (unhealthy) after few days

Bug #1809823 reported by Yossi Ovadia
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Undecided
Ahmed Zaid
oslo.service
Confirmed
High
Ahmed Zaid

Bug Description

Description
===========
on the undercloud ( pretty sure we also seen it on overcloud, i'll update when sure )
Without any action, we notice that neutron_api service is in "unhealthy" state and stop functioning.
Log shows -
2018-12-26 00:00:35.774 7 INFO oslo_service.service [-] Caught SIGHUP, stopping children
2018-12-26 00:00:36.077 40997 ERROR oslo_service.service [-] Error starting thread.: RuntimeError: A fixed interval looping call can only run one function at a time

openstack commands that needs neutron fails ( e.g openstack server list )

Restarting the docker ( neutron_api ) resolves the problem.

Steps to reproduce
==================
Deploy.
Wait 4 days.

Expected result
===============
Service should remain healthy..

Actual result
=============
not healthy ..

Environment
===========
Rocky , container based.

Logs & Configs
==============

Logs : http://paste.openstack.org/show/738658/

More info:
==========
Google showed this -
https://bugs.launchpad.net/oslo.service/+bug/1547029
follow by -
http://paste.openstack.org/show/487420/

It seems that if we'll add "eventlet.sleep(0)" in <<<HERE>>> below, it might resolve the issue. -

    def run_service(service, done):
        """Service start wrapper.

        :param service: service to run
        :param done: event to wait on until a shutdown is triggered
        :returns: None

        """
        try:
            <<<<< HERE >>>>>>>>
            service.start()
        except Exception:
            LOG.exception('Error starting thread.')
            raise SystemExit(1)
        else:
            done.wait()

The problem is that I didnt come up with an easy way to reproduce the issue in order to confirm it.

Any suggestions ?

Revision history for this message
Yossi Ovadia (jabadia) wrote :

Code above is taken from "/usr/lib/python2.7/site-packages/oslo_service/service.py" Line 794

Changed in tripleo:
assignee: nobody → Yossi Ovadia (jabadia)
Revision history for this message
Yossi Ovadia (jabadia) wrote :
Ben Nemec (bnemec)
Changed in oslo.service:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on oslo.service (master)

Change abandoned by Yossi Ovadia (<email address hidden>) on branch: master
Review: https://review.openstack.org/629289
Reason: after alot of research, this fix does not fix the initial neutron reported issue.

Revision history for this message
Yossi Ovadia (jabadia) wrote :

UPDATE:
to reproduce the bug:
log into the neutron_api container :

docker exec -it --user root neutron_api bash
ps fax| grep neutron_api
()[root@undercloud /]# ps fax
  PID TTY STAT TIME COMMAND
  115 ? Ss 0:00 bash
  140 ? S+ 0:00 \_ top
   44 ? Ss 0:00 bash
  241 ? R+ 0:00 \_ ps fax
    1 ? Ss 0:00 /usr/local/bin/dumb-init /bin/bash /usr/local/bin/kolla_sta
    7 ? Ss 0:05 /usr/bin/python2 /usr/bin/neutron-server --config-file /usr
   27 ? S 0:08 \_ /usr/bin/python2 /usr/bin/neutron-server --config-file
   28 ? S 0:00 \_ /usr/bin/python2 /usr/bin/neutron-server --config-file
   29 ? S 0:03 \_ /usr/bin/python2 /usr/bin/neutron-server --config-file
   30 ? S 0:03 \_ /usr/bin/python2 /usr/bin/neutron-server --config-file
   31 ? S 0:03 \_ /usr/bin/python2 /usr/bin/neutron-server --config-file
   32 ? R 5:27 \_ /usr/bin/python2 /usr/bin/neutron-server --config-file

Kill the last on pid ( 32 ) with sigup-
kill -1 32

check the server.log after few seconds :
2018-12-26 00:00:36.077 40997 ERROR oslo_service.service [-] Error starting thread.: RuntimeError: A fixed interval looping call can only run one function at a time

in our environment, this occurs without someone issuing kill -1 , but just after 4 days more or less there's a sigup and docker becomes unhealthy.

affects: tripleo → neutron
Ahmed Zaid (ahmedzaid10)
Changed in neutron:
assignee: Yossi Ovadia (jabadia) → Ahmed Zaid (ahmedzaid10)
Changed in oslo.service:
assignee: nobody → Ahmed Zaid (ahmedzaid10)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/634290

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/634290
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Closed due to inactivity. Please feel free to reopen if needed.

Changed in neutron:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.