neutron-server api don't shutdown gracefully

Bug #1815871 reported by Jesse
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Medium
Terry Wilson

Bug Description

When stop neutron-server, api worker will shutdown immediately no matter there are ongoing requests.
And the ongoing requests will abort immediately.

After testing, go through codes, compare with nova and cinder codes.
The reason is that the stop and wait function in WorkerService in neutron/wsgi.py have issue.

    def wait(self):
        if isinstance(self._server, eventlet.greenthread.GreenThread):
            self._server.wait()

    def stop(self):
        if isinstance(self._server, eventlet.greenthread.GreenThread):
            self._server.kill()
            self._server = None

Check the neutron codes above.
After kill in stop function, self._server is forced to set to None, which makes nothing to do in wait function. This leads to api worker shutdown immediately without wait.

Nova has the correct logic, check: https://github.com/openstack/nova/blob/master/nova/wsgi.py#L197
Cinder use the oslo_service.wsgi, which has the same codes like nova.

My Debugging as follows:
I add log at line 978:

https://github.com/eventlet/eventlet/blob/master/eventlet/wsgi.py#L979

serv.log.info('({0}) wsgi exiting, {1}'.format(serv.pid, pool.__dict__))

I updated a neutron API to sleep for 10s, then I curl this API, at the same time I kill neutorn-server.

Bellow is the neutorn-server log, I have 4 api workers. You can see. process 329 has a coroutines_running, but it does not log 'wsgi exited' because pool.waitall() in https://github.com/eventlet/eventlet/blob/master/eventlet/wsgi.py#L979 ,
other 3 processes have no coroutines_running, so they log 'wsgi exited'.
At last, these 4 child processes all exited with status 0.

2019-02-13 17:37:31.193 319 INFO oslo_service.service [-] Caught SIGTERM, stopping children
2019-02-13 17:37:31.194 319 DEBUG oslo_concurrency.lockutils [-] Acquired semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212
2019-02-13 17:37:31.194 319 DEBUG oslo_concurrency.lockutils [-] Releasing semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:225
2019-02-13 17:37:31.195 319 DEBUG oslo_service.service [-] Stop services. stop /usr/lib/python2.7/site-packages/oslo_service/service.py:611
2019-02-13 17:37:31.195 319 DEBUG oslo_service.service [-] Killing children. stop /usr/lib/python2.7/site-packages/oslo_service/service.py:616
2019-02-13 17:37:31.195 319 INFO oslo_service.service [-] Waiting on 4 children to exit
2019-02-13 17:37:31.196 328 INFO neutron.wsgi [-] (328) wsgi exiting, {'sem': <Semaphore at 0x5a97810 c=100 _w[0]>, 'coroutines_running': set([]), 'no_coros_running': <eventlet.event.Event object at 0x5a97a10>, 'size': 100}
2019-02-13 17:37:31.196 329 INFO neutron.wsgi [-] (329) wsgi exiting, {'sem': <Semaphore at 0x5a97810 c=99 _w[0]>, 'coroutines_running': set([<greenlet.greenlet object at 0x5477cd0>]), 'no_coros_running': <eventlet.event.Event object at 0x5a97b10>, 'size': 100}
2019-02-13 17:37:31.196 331 INFO neutron.wsgi [-] (331) wsgi exiting, {'sem': <Semaphore at 0x5a97810 c=100 _w[0]>, 'coroutines_running': set([]), 'no_coros_running': <eventlet.event.Event object at 0x5a97a10>, 'size': 100}
2019-02-13 17:37:31.197 330 INFO neutron.wsgi [-] (330) wsgi exiting, {'sem': <Semaphore at 0x5a97810 c=100 _w[0]>, 'coroutines_running': set([]), 'no_coros_running': <eventlet.event.Event object at 0x5a97b10>, 'size': 100}
2019-02-13 17:37:31.210 329 DEBUG oslo_concurrency.lockutils [req-d813d601-8563-4d0f-8b16-1418f81ddcc1 - - - - -] Acquired semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212
2019-02-13 17:37:31.211 329 DEBUG oslo_concurrency.lockutils [req-d813d601-8563-4d0f-8b16-1418f81ddcc1 - - - - -] Releasing semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:225
2019-02-13 17:37:31.212 328 INFO neutron.wsgi [-] (328) wsgi exited, is_accepting=True
2019-02-13 17:37:31.216 328 DEBUG oslo_concurrency.lockutils [req-d813d601-8563-4d0f-8b16-1418f81ddcc1 - - - - -] Acquired semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212
2019-02-13 17:37:31.217 331 INFO neutron.wsgi [-] (331) wsgi exited, is_accepting=True
2019-02-13 17:37:31.218 328 DEBUG oslo_concurrency.lockutils [req-d813d601-8563-4d0f-8b16-1418f81ddcc1 - - - - -] Releasing semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:225
2019-02-13 17:37:31.218 330 INFO neutron.wsgi [-] (330) wsgi exited, is_accepting=True
2019-02-13 17:37:31.219 331 DEBUG oslo_concurrency.lockutils [req-d813d601-8563-4d0f-8b16-1418f81ddcc1 - - - - -] Acquired semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212
2019-02-13 17:37:31.220 331 DEBUG oslo_concurrency.lockutils [req-d813d601-8563-4d0f-8b16-1418f81ddcc1 - - - - -] Releasing semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:225
2019-02-13 17:37:31.220 330 DEBUG oslo_concurrency.lockutils [req-d813d601-8563-4d0f-8b16-1418f81ddcc1 - - - - -] Acquired semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:212
2019-02-13 17:37:31.221 330 DEBUG oslo_concurrency.lockutils [req-d813d601-8563-4d0f-8b16-1418f81ddcc1 - - - - -] Releasing semaphore "singleton_lock" lock /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:225
2019-02-13 17:37:31.215 319 INFO oslo_service.service [-] Child 329 exited with status 0
2019-02-13 17:37:31.224 319 INFO oslo_service.service [-] Child 328 exited with status 0
2019-02-13 17:37:31.224 319 INFO oslo_service.service [-] Child 331 exited with status 0
2019-02-13 17:37:31.226 319 INFO oslo_service.service [-] Child 330 exited with status 0

If use my patch, neutron-server will wait this coroutines_running finished (Parent neutron-server will exit after 60s because the config: graceful_shutdown_timeout is 60s by default in oslo-service), then stop this child process.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/636855

Changed in neutron:
assignee: nobody → Jesse (jesse-5)
status: New → In Progress
Jesse (jesse-5)
description: updated
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Jesse:

As I commented in https://review.openstack.org/636855/1, you should not use the first version of your patch because you can't block the wait() function in the worker waiting for other threads to stop. WorkerService.wait() function should wait only for the specific thread spawned and stored in self._server. The worker should be only aware of its own thread, not the service pool.

As I asked you in the patch, do you have a log report of this error?

Revision history for this message
Jesse (jesse-5) wrote :

Hi Rodolfo,
I have put log in the description of this bug.
I referred to nova side implementation in: https://github.com/openstack/nova/blob/master/nova/wsgi.py#L197
And I tested, which works.

I will find out the details why it works later, thanks.

Revision history for this message
Jesse (jesse-5) wrote :

I find there is an issue when neutron-server API implement graceful stop.
When stop neutron-server, API worker will stop first [1] in a graceful way, and WSGI is waiting all running requests to neutron-server close.
But there maybe persistent connection requests on neutron-server API workers which will prevent API workers from stopping. About persistent connection HTTP, refer to [2].
I checked my environment, nova-api and nova-compute using neutronclient to connect neutron-server, neutronclient will use persistent connection to neutron-server.
When API workers do not stop, and after 60s (graceful_shutdown_timeout in oslo-service), parent neutron-server will stop. This will make RPC workers stop immediately.
There is no time for RPC workers stop gracefuly.

If we change the stop order: let RPC workers stop first, then API workers, there will be issues when API workers accept a net-create request when stopping.

I think the better way is stopping API workers with a timeout, half of the graceful_shutdown_timeout is fine, this will give time for RPC workers stop.

When there are persistent connections in neutron server APIs, WSGI graceful stop seems not implement the graceful way, like stopping connection when HTTP request finished in TCP connection. It seems nginx implement this function. [3]

[1] https://github.com/openstack/neutron/blob/master/neutron/server/wsgi_eventlet.py#L33
[2] https://en.wikipedia.org/wiki/HTTP_persistent_connection
[3] https://trac.nginx.org/nginx/ticket/1022

Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: Jesse (jesse-5) → nobody
status: In Progress → New
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/636855
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Bug closed due to lack of activity, please feel free to reopen if needed.

Changed in neutron:
status: New → Won't Fix
Changed in neutron:
assignee: nobody → Terry Wilson (otherwiseguy)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.