AgentNotFoundByTypeHost exception logged when L3-agent starts up

Bug #1456822 reported by Stephen Ma
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
Stephen Ma
Juno
Fix Released
Undecided
Unassigned
Kilo
New
Undecided
Unassigned

Bug Description

On my single-node devstack setup running the latest neutron code, there is one AgentNotFoundByTypeHost exception found for the L3-agent. However, the AgentNotFoundByTypeHost exception is not logged for the DHCP, OVS, or metadata agents. This fact would point to a problem with how the L3-agent is starting up.

Exception found in the L3-agent log:

2015-05-19 11:27:57.490 23948 DEBUG oslo_messaging._drivers.amqpdriver [-] MSG_ID is 1d0f3e0a8a6744c9a9fc43eb3fdc5153 _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:311^M
2015-05-19 11:27:57.550 23948 ERROR neutron.agent.l3.agent [-] Failed synchronizing routers due to RPC error^M
2015-05-19 11:27:57.550 23948 TRACE neutron.agent.l3.agent Traceback (most recent call last):^M
2015-05-19 11:27:57.550 23948 TRACE neutron.agent.l3.agent File "/opt/stack/neutron/neutron/agent/l3/agent.py", line 517, in fetch_and_sync_all_routers^M
2015-05-19 11:27:57.550 23948 TRACE neutron.agent.l3.agent routers = self.plugin_rpc.get_routers(context)^M
2015-05-19 11:27:57.550 23948 TRACE neutron.agent.l3.agent File "/opt/stack/neutron/neutron/agent/l3/agent.py", line 91, in get_routers^M
2015-05-19 11:27:57.550 23948 TRACE neutron.agent.l3.agent router_ids=router_ids)^M
2015-05-19 11:27:57.550 23948 TRACE neutron.agent.l3.agent File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 156, in call^M
2015-05-19 11:27:57.550 23948 TRACE neutron.agent.l3.agent retry=self.retry)^M
2015-05-19 11:27:57.550 23948 TRACE neutron.agent.l3.agent File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send^M
2015-05-19 11:27:57.550 23948 TRACE neutron.agent.l3.agent timeout=timeout, retry=retry)^M
2015-05-19 11:27:57.550 23948 TRACE neutron.agent.l3.agent File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 350, in send^M
2015-05-19 11:27:57.550 23948 TRACE neutron.agent.l3.agent retry=retry)^M
2015-05-19 11:27:57.550 23948 TRACE neutron.agent.l3.agent File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 341, in _send^M
2015-05-19 11:27:57.550 23948 TRACE neutron.agent.l3.agent raise result^M
2015-05-19 11:27:57.550 23948 TRACE neutron.agent.l3.agent RemoteError: Remote error: AgentNotFoundByTypeHost Agent with agent_type=L3 agent and host=DVR-Ctrl2 could not be found^M
2015-05-19 11:27:57.550 23948 TRACE neutron.agent.l3.agent [u'Traceback (most recent call last):\n', u' File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 142, in _dispatch_and_reply\n executor_callback))\n', u' File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 186, in _dispatch\n executor_callback)\n', u' File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 130, in _do_dispatch\n result = func(ctxt, **new_args)\n', u' File "/opt/stack/neutron/neutron/api/rpc/handlers/l3_rpc.py", line 81, in sync_routers\n context, host, router_ids))\n', u' File "/opt/stack/neutron/neutron/db/l3_agentschedulers_db.py", line 290, in list_active_sync_routers_on_active_l3_agent\n context, constants.AGENT_TYPE_L3, host)\n', u' File "/opt/stack/neutron/neutron/db/agents_db.py", line 197, in _get_agent_by_type_and_host\n host=host)\n', u'AgentNotFoundByTypeHost: Agent with agent_type=L3 agent and host=DVR-Ctrl2 could not be found\n'].^M

Stephen Ma (stephen-ma)
Changed in neutron:
assignee: nobody → Stephen Ma (stephen-ma)
assignee: Stephen Ma (stephen-ma) → nobody
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

That's really interesting.
I guess it might happen when request for routers came in before L3 agent has registered.
It should not cause problems because later it would register and do another attempt.
But still it's interesting how state report could arrive later than the request for routers.

Changed in neutron:
importance: Undecided → Low
status: New → Confirmed
Revision history for this message
Stephen Ma (stephen-ma) wrote :

In the case here, neutron-l3-agent is running L3NATAgentWithStateReport, a subclass of L3NATAgent.

The L3NATAgentWithStateReport's __init__ first calls its parent's __init_ routine. The periodic sync routers tasks starts right after the L3NatAgent has done its __init__ routine. The L3NATAGentWithStateReport's report_state function is not started until its __init__ is done. In effect, the periodic sync routers tasks starts slightly before the report_state.

The handlers for the rpc calls are looking up the agent information in the same table.

Usually there are multiple api servers running and there are multiple rpc workers per api server. So both rpc handlers are doing the same query at the same time. So if the query for the report_state completed first, there is no problem. If the query from sync_routers finish first, then there will the AgentNotFoundByTypeHost error. This may be the reason this error is not always seen.

So we should let the first report-state to be done before starting the sync routers.

Stephen Ma (stephen-ma)
Changed in neutron:
assignee: nobody → Stephen Ma (stephen-ma)
Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/186584
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b649b9c871d0734745da5a201eca83a6b407a1c5
Submitter: Jenkins
Branch: master

commit b649b9c871d0734745da5a201eca83a6b407a1c5
Author: Stephen Ma <email address hidden>
Date: Thu Jun 4 20:09:23 2015 +0000

    L3 agent should do report state before full sync at start

    Sometimes the AgentNotFoundByTypeHost exception is reported during
    L3-agent startup. The exception is generated when the first
    get_routers RPC call is made. When the neutron server gets this
    RPC call, it might not have handled the report state RPC call yet.
    So the L3-agent hasn't been registered in the API server.
    The result is a RPC Error exception. By the time the next
    get_routers RPC call is made, the report state RPC call has already
    been done and agent registered.

    This patch modifies the L3 agent startup behavior to have the report
    state done before the agent do the sync routers RPC call.

    Closes-bug: 1456822
    Change-Id: Id40cfd8466f45e20fea0e9df6fd57bf9c9e59da7

Changed in neutron:
status: In Progress → Fix Committed
Stephen Ma (stephen-ma)
tags: added: juno-backport-potential kilo-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/kilo)

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/189852

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/kilo)

Reviewed: https://review.openstack.org/189852
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e801eb29d79a8d1f689394756c7870b11ae75c1b
Submitter: Jenkins
Branch: stable/kilo

commit e801eb29d79a8d1f689394756c7870b11ae75c1b
Author: Stephen Ma <email address hidden>
Date: Thu Jun 4 20:09:23 2015 +0000

    L3 agent should do report state before full sync at start

    Sometimes the AgentNotFoundByTypeHost exception is reported during
    L3-agent startup. The exception is generated when the first
    get_routers RPC call is made. When the neutron server gets this
    RPC call, it might not have handled the report state RPC call yet.
    So the L3-agent hasn't been registered in the API server.
    The result is a RPC Error exception. By the time the next
    get_routers RPC call is made, the report state RPC call has already
    been done and agent registered.

    This patch modifies the L3 agent startup behavior to have the report
    state done before the agent do the sync routers RPC call.

    Closes-bug: 1456822
    Change-Id: Id40cfd8466f45e20fea0e9df6fd57bf9c9e59da7
    (cherry picked from commit b649b9c871d0734745da5a201eca83a6b407a1c5)

tags: added: in-stable-kilo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/191469

Thierry Carrez (ttx)
Changed in neutron:
milestone: none → liberty-1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (feature/pecan)

Fix proposed to branch: feature/pecan
Review: https://review.openstack.org/196701

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (feature/pecan)

Change abandoned by Kyle Mestery (<email address hidden>) on branch: feature/pecan
Review: https://review.openstack.org/196701
Reason: This is lacking the functional fix [1], so I'll propose a new merge commit which includes that one.

[1] https://review.openstack.org/#/c/196711/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (feature/pecan)

Fix proposed to branch: feature/pecan
Review: https://review.openstack.org/196920

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (feature/pecan)
Download full text (171.5 KiB)

Reviewed: https://review.openstack.org/196920
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7f759c077f8f860c13db92d2ea6b353ef6b70900
Submitter: Jenkins
Branch: feature/pecan

commit 8123144fadd7c5d5e6e56a76ea860512619a2cf6
Author: Moshe Levi <email address hidden>
Date: Sun Jun 28 14:37:14 2015 +0300

    Fix Consolidate sriov agent and driver code

    This patch add mising __init to mech_sriov/mech_driver/
    and update the setup.cfg to the new agent entrypoint

    Trivial Fix

    Change-Id: I53a527081feb78472f496675bbb3c5121d38a14a

commit 8942fccf02e6e179d47582fdb2792a1ca972da21
Author: Assaf Muller <email address hidden>
Date: Mon Jun 29 11:38:51 2015 -0400

    Remove failing SafeFixture tests

    The fixtures 1.3 release attempted to fix the fixtures resource
    leak issue, but failed to do so completely. Our own SafeFixture
    is still needed: The 1.3 release broke our SafeFixture tests,
    but not the usage of SafeFixture itself. This patch removes
    those failing tests for now to unbreak the gate. Jakub reported
    a bug on fixtures 1.3:
    https://bugs.launchpad.net/python-fixtures/+bug/1469759

    We will continue to use SafeFixture until that bug is fixed
    in fixtures, at which point we will be able to require
    fixtures > 1.3.

    Change-Id: I59457c3bb198ff86d5ad55a1e623d008f0034b8f
    Closes-Bug: #1469734

commit 71dffb0a2c1720cd8233a329d32958a0160dd6f5
Author: Kevin Benton <email address hidden>
Date: Mon Jun 29 08:27:41 2015 +0000

    Revert "Removed test_lib module"

    This reverts commit 9a6536de6e1a7fe9b2552adc142e254426b82b6f.

    We pulled all of the plugins out of the tree, many of which still inherit
    from neutron test classes. This change then stated that we no longer
    support testing other plugins. I think this is a bit premature and should
    have been discussed under the subject
    "Neutron plugins can't use neutron plugin unit tests" or something
    similar.

    Change-Id: I68318589f010b731574ea3bfa8df98492bab31fc

commit b20fd81dbd497e058384a0af065dd0f1fdc4c728
Author: Jakub Libosvar <email address hidden>
Date: Fri Jun 5 14:32:51 2015 +0000

    Refactor NetcatTester class

    Following capabilities were added:
       - used transport protocol is passed as a constant instead of bool
       - src port for testing was added
       - connection can be established explicitly
       - change constructor parameters of NetcatTester

    As a part of removing bool for protocol definition
    get_free_namespace_port() was also modified to match the behavior.

    Change-Id: Id2ec322e7f731c05a3754a65411c9a5d8b258126

commit 83e37980dcd0b2bad6d64dd2cb23bcd2891cafca
Author: jingliuqing <email address hidden>
Date: Sat Jun 27 13:41:54 2015 +0800

    Use REST rather than ReST

    Change-Id: I06c9deaab58c5ec13bfeec39fb8fd4b1fe21f42d

commit 1b60df85ba3ad442c2e4e7e52538e1b9a1bf9378
Author: Kevin Benton <email address hidden>
Date: Thu Jun 25 18:34:38 2015 -0700

    Add a double-mock guard to the base test case

    Use mock to patch mock with a check to prevent multiple active
    patches to the...

tags: added: in-feature-pecan
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/juno)

Reviewed: https://review.openstack.org/191469
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f11aa037f6c3a4a0ed2c563e4c0a090a469b48c3
Submitter: Jenkins
Branch: stable/juno

commit f11aa037f6c3a4a0ed2c563e4c0a090a469b48c3
Author: Stephen Ma <email address hidden>
Date: Thu Jun 4 20:09:23 2015 +0000

    L3 agent should do report state before full sync at start

    Sometimes the AgentNotFoundByTypeHost exception is reported during
    L3-agent startup. The exception is generated when the first
    get_routers RPC call is made. When the neutron server gets this
    RPC call, it might not have handled the report state RPC call yet.
    So the L3-agent hasn't been registered in the API server.
    The result is a RPC Error exception. By the time the next
    get_routers RPC call is made, the report state RPC call has already
    been done and agent registered.

    This patch modifies the L3 agent startup behavior to have the report
    state done before the agent do the sync routers RPC call.

    Note: files in the merge conflict list are not in the stable/juno
    branch. They were moved during the kilo development cycle:
     * neutron/agent/l3_agent.py is moved to neutron/agent/l3/agent.py
    in kilo.
     * neutron/tests/unit/test_l3_agent.py is moved to
    neutron/tests/unit/agent/l3/test_agent.py in kilo.

    Conflicts:
            neutron/agent/l3/agent.py
            neutron/tests/unit/agent/l3/test_agent.py

    (cherry picked from commit e801eb29d79a8d1f689394756c7870b11ae75c1b)
    Closes-bug: 1456822
    Change-Id: Id40cfd8466f45e20fea0e9df6fd57bf9c9e59da7

tags: added: in-stable-juno
Thierry Carrez (ttx)
Changed in neutron:
milestone: liberty-1 → 7.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.