Neutron multiple api workers can't send cast message to agent when use zeromq

Bug #1364814 reported by Dongcan Ye
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Invalid
Medium
Elena Ezhova
oslo.messaging
Fix Released
Undecided
Elena Ezhova
oslo.messaging (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

When I set api_workers > 0 in Neutron configuration, delelting or adding router interface, Neutron L3 agent can't receive message from Neutron Server.
In this situation, L3 agent report state can cast to Neutron Server, meanwhile it can receive cast message from Neutron Server.(use call method)

Obviously, Neutron Server can use cast method for sending message to L3 agent, But why cast routers_updated fails? This also occurs in other Neutron agent.

Then I make a test, write some codes in Neutron server starts or l3_router_plugins, sends cast periodic message to L3 agent directly. From L3 agent rpc-zmq-receiver log file shows it receives message from Neutron Server.

By the way, everything works well when api_workers = 0.

Test environment:
neutron(master) + oslo.messaging(master) + zeromq

Tags: zmq
description: updated
description: updated
Changed in neutron:
importance: Undecided → Medium
Elena Ezhova (eezhova)
Changed in neutron:
assignee: nobody → Elena Ezhova (eezhova)
Revision history for this message
Elena Ezhova (eezhova) wrote :

Could you please specify how did you configure Neutron to use zmq? I am especially interested in the driver you used (MatchMakerRedis, MatchmakerRing).

Dongcan Ye (hellochosen)
description: updated
Revision history for this message
Dongcan Ye (hellochosen) wrote :

Ok, eezhoova. I only use MatchmakerRing, because MatchMakerRedis can't work here.

In three node(controller, network, compute), matchmaker in neutron conf like this:
rpc_zmq_matchmaker = oslo.messaging._drivers.matchmaker_ring.MatchMakerRing

Here I use haproxy node for listen neturon service .

Compute and Network node:
In neutron_matchmaker_ring.json, I assign all neutron topic and host(host point to node ha, you can point to controller node when disable haproxy), here is a slice of it:
{
 "network": ["ha"],
    "subnet": ["ha"],
    "port": ["ha"],
    "security_group": ["ha"],
    "l2population": ["ha"],
    "create": ["ha"],
    "delete": ["ha"],
    "update": ["ha"],
    "q-agent-notifier": ["ha"],
    "q-plugin": ["ha"]
}

Controller node:
In neutron_matchmaker_ring.json, I also assign all neutron topic and host(host point to network node and compute node), here is a slice of it:
{
    "network": ["network","compute"],
    "subnet": ["network","compute"],
    "port": ["network","compute"],
    "security_group": ["network","compute"],
    "l2population": ["network","compute"],
    "create": ["network","compute"],
    "delete": ["network","compute"],
    "update": ["network","compute"],
    "q-agent-notifier": ["network","compute"],
    "q-plugin": ["network","compute"]
}

Revision history for this message
Elena Ezhova (eezhova) wrote :

Dongcan, thank you for providing the details.

Changed in neutron:
status: New → Confirmed
Revision history for this message
Elena Ezhova (eezhova) wrote :

As I found out the problem is in zmq context which is a singleton and thus is created only once. [1] This leads to problems when there is more than one process working with it. [2]
The solution is to make zmq context thread-local by using threading.local class. [3]

I have a working fix that I will upload shortly.

[1] https://github.com/openstack/oslo.messaging/blob/master/oslo/messaging/_drivers/impl_zmq.py#L813
[2] http://lists.zeromq.org/pipermail/zeromq-dev/2011-December/014900.html
[3] https://docs.python.org/2/library/threading.html#threading.local

Changed in neutron:
status: Confirmed → Opinion
Changed in oslo.messaging:
status: New → Confirmed
assignee: nobody → Elena Ezhova (eezhova)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.messaging (master)

Fix proposed to branch: master
Review: https://review.openstack.org/126914

Changed in oslo.messaging:
status: Confirmed → In Progress
James Page (james-page)
tags: added: zmq
Revision history for this message
Elena Ezhova (eezhova) wrote :

If we look at the problem from the Neutron side, we can see that while Neutron server starts, it firstly loads core plugin and service plugins, which start message handling server, and only then forks to create api-workers. As a result, all child processes get the same copy of the context.

In this case, instead of making singletony ZeroMQ Context thread-local, that is supposed to be used for threads and not processes, it is better to create new Context for each socket. This will prevent such situations from happening and will guarantee that each process works with its own Context.

I have updated the proposed fix accordingly.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.messaging (master)

Reviewed: https://review.openstack.org/126914
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=0d49793e340728416c0c7b1bf964b54efd7e5acb
Submitter: Jenkins
Branch: master

commit 0d49793e340728416c0c7b1bf964b54efd7e5acb
Author: Elena Ezhova <email address hidden>
Date: Wed Oct 8 18:18:20 2014 +0400

    Create ZeroMQ Context per socket

    ZeroMQ Context is a singleton and thus is created only once. This leads
    to problems when there is more than one process working with it.
    For example, while Neutron server starts, it firstly loads core
    plugin and service plugins, which start message handling server,
    and only then forks to create api-workers. As a result, all child
    processes get the same copy of the context.

    Creating new Context for each socket will prevent such situations
    from happening and will guarantee that each process works with its
    own Context.

    Change-Id: I56912e39b119c20f6f23311fc2c7c4b9e9e480d0
    Closes-Bug: #1364814

Changed in oslo.messaging:
status: In Progress → Fix Committed
Revision history for this message
Dongcan Ye (hellochosen) wrote :

Good job, thanks eezhoova.

Elena Ezhova (eezhova)
Changed in neutron:
status: Opinion → Invalid
Mehdi Abaakouk (sileht)
Changed in oslo.messaging:
milestone: none → 1.5.0
status: Fix Committed → Fix Released
Revision history for this message
James Page (james-page) wrote :

Released in Ubuntu in 1.5.1.

Changed in oslo.messaging (Ubuntu):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.