neutron

Bad initialization sequence in Neutron agents (and maybe somewhere else)

Bug #1594439 reported by Yuriy Taraday on 2016-06-20

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	neutron	Won't Fix	Medium	Unassigned

Bug Description

TL;DR: running threads that modify global state (which can be any RPC calls, or any other library calls) in background while forking happens may cause weird hard-to-debug errors, so background jobs should be initialized after forking (or even better in separate process)

Currently at least metadata and l3 agents start background threads that periodically report state before forking and running main loop in child threads. Those threads can modify global state so that it is inconsistent in the time of forking which will lead to bad service behavior. In the ideal world main process shouldn't do anything except managing child processes to avoid any global state leakage into child processes (open fds, locked locks, etc).

This bug was uncovered during investigation into weird seemingly unrelated error in grenade job in Ann's CR that was changing to new oslo.db EngineFacade [0]. Symptoms were: SSHTimeout while instance's cloud-init is busy trying to get metadata and failing. In q-meta logs we noticed that there's no INFO messages about incoming HTTP requests, but there are some DEBUG ones, which means that requests are coming to the agent but are not being responded to. Digging deeper we noticed that in the normal operation metadata agent should:

- receive request;
- do RPC call to neutron-server;
- do HTTP request to Nova;
- send response.

There were no RPC "CALL" in logs, only CASTs from state reporting and the very first one CALL for state reporting.

Since it's hard to reproduce gate jobs, especially multinode ones, we've created another CR [1] that added tracing of every Python LOC to see what really happens. You can find the long log with all the tracing at [2] or in attachment (logs don't live forever). It uncovered the following chain of events:

- main thread in main process starts background thread for state reporting;
- some time later that thread starts reporting, wants to do the first CALL (it does CALL once and then it does CASTs);
- to get 'reply_q' (essentially, connection shared for replies IIUC), it acquires a lock;
- since there are no connections available at that time (it's the first time RPC is used), oslo_messaging starts connecting to RabbitMQ;
- background thread yield execution to main thread;
- main thread forks bunch of WSGI workers;
- in WSGI workers when requests come, handler tries to do CALL to neutron-server;
- to get 'reply_q' it tries to acquire a lock, but it is already has been "taken" by background thread in the main process;
- it hangs forever, which can be seen in Guru Meditation Report.

There are several problems here (including eventlet having not fork-aware locks), but the one that can be fixed to fix them all is to start such background threads after all forking happens. I've published CR [3] to verify that changing initialization order will fix this issue, and it did.

Note that from what I've been told, forks can still happen if child worker unexpectedly dies and main process reforks it. To properly fix this issue we should not do anything in the main process that can spoil global state that can influence child process. It means that we'll need to either run state reporting in a separate process or have an isolated oslo_messaging environment (context? IANA expert in oslo_messaging) for it.

[0] https://review.openstack.org/312393
[1] https://review.openstack.org/331485
[2] http://logs.openstack.org/85/331485/5/check/gate-grenade-dsvm-neutron-multinode/1c65056/logs/new/screen-q-meta.txt.gz
[3] https://review.openstack.org/331672

Tags:

Revision history for this message

Yuriy Taraday (yorik-sar) wrote on 2016-06-20:

Logs with tracing Edit (469.4 KiB, text/plain)

OpenStack Infra (hudson-openstack) on 2016-06-20

Changed in neutron:
assignee:	nobody → Yuriy Taraday (yorik-sar)
status:	New → In Progress

OpenStack Infra (hudson-openstack) on 2016-06-21

Changed in neutron:
assignee:	Yuriy Taraday (yorik-sar) → Ann Kamyshnikova (akamyshnikova)

Ann Taraday (akamyshnikova) on 2016-06-21

Changed in neutron:
assignee:	Ann Kamyshnikova (akamyshnikova) → Yuriy Taraday (yorik-sar)

John Davidge (john-davidge) on 2016-06-21

tags:

added: needs-attention

Oleg Bondarev (obondarev) on 2016-06-22

Changed in neutron:
importance:	Undecided → Medium

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-06-24: Fix merged to neutron (master)

Reviewed: https://review.openstack.org/331672
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7eeeb99a55385b5fc3d4ce1a3b390e41be252b17
Submitter: Jenkins
Branch: master

commit 7eeeb99a55385b5fc3d4ce1a3b390e41be252b17
Author: Yuriy Taraday <email address hidden>
Date: Mon Jun 20 16:39:03 2016 +0300

Move state reporting initialization to after worker forking

    We start state reporting thread before forking child processes. This
    leads to possibility of fork in the middle of reporting process. This
    can cause child processes to start with corrupted global state. In [0]
    it represented itself as metadata agent not being able to make RPC call
    because of lock acquired in main process before fork and released after
    fork. See bug description for details.

[0] https://review.openstack.org/312393

Partial-Bug: 1594439

Change-Id: Id5079a296bc402c6f5b6cdb2df72811eab5bc6ed

Yuriy Taraday (yorik-sar) on 2016-06-24

Changed in neutron:
assignee:	Yuriy Taraday (yorik-sar) → nobody
status:	In Progress → Confirmed

Ihar Hrachyshka (ihar-hrachyshka) on 2016-06-30

tags:

added: neutron-proactive-backport-potential

Sindhu Devale (sindhu-devale-3) on 2016-07-07

Changed in neutron:
assignee:	nobody → Sindhu Devale (sindhu-devale-3)

Sindhu Devale (sindhu-devale-3) on 2016-07-07

Changed in neutron:
assignee:	Sindhu Devale (sindhu-devale-3) → nobody

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-26: Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/347394

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-07-29: Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/347394
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ae6a2ec2a560105c0300c07c72da3c82dce6b9f6
Submitter: Jenkins
Branch: stable/mitaka

commit ae6a2ec2a560105c0300c07c72da3c82dce6b9f6
Author: Yuriy Taraday <email address hidden>
Date: Mon Jun 20 16:39:03 2016 +0300

Move state reporting initialization to after worker forking

[0] https://review.openstack.org/312393

Partial-Bug: 1594439

Change-Id: Id5079a296bc402c6f5b6cdb2df72811eab5bc6ed
(cherry picked from commit 7eeeb99a55385b5fc3d4ce1a3b390e41be252b17)

tags:

added: in-stable-mitaka

Ihar Hrachyshka (ihar-hrachyshka) on 2016-10-07

tags:

removed: neutron-proactive-backport-potential

Ihar Hrachyshka (ihar-hrachyshka) on 2017-03-28

tags:

removed: needs-attention

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2022-10-19:

Bug closed due to lack of activity, please feel free to reopen if needed.

Changed in neutron:
status:	Confirmed → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Logs with tracing Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.