No routers created after l3-agent start - error during L3NATAgentWithStateReport.periodic_sync_routers_task

Bug #1641879 reported by Tore Anderson
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Pepijn Oomen

Bug Description

After (re)starting the L3 agent, a failure during L3NATAgentWithStateReport.periodic_sync_routers_task was encountered. This resulted in a Python traceback being dumped to the log ending with "TypeError: 'NoneType' object is not iterable". The error repeats every minute or so. (See below.)

While the error persists, no routers are provisioned by the L3 agent. It just keeps failing and failing and failing in an seemingly infinite loop.

The fix/workaround was to remove a random router from the L3 agent having problems, like so:

$ neutron l3-agent-router-remove <uuid-of-l3-agent-having-problems> <uuid-of-some-router>

It did not seem to matter exactly which of the many routers scheduled to run on the problematic L3 agent that was removed in this way. The removal itself seemed to get rid of the blockage, allowing the L3 agent to start normal operations shortly after.

Excerpts from l3-agent.log:

2016-11-15 03:14:26.201 2988 INFO neutron.common.config [-] Logging enabled!
2016-11-15 03:14:26.201 2988 INFO neutron.common.config [-] /usr/bin/neutron-l3-agent version 8.3.0
2016-11-15 03:14:26.504 2988 INFO eventlet.wsgi.server [-] (2988) wsgi starting up on http:/var/lib/neutron/keepalived-state-change
2016-11-15 03:14:26.548 2988 INFO neutron.agent.l3.agent [-] Agent has just been revived. Doing a full sync.
2016-11-15 03:14:26.781 2988 INFO neutron.agent.l3.agent [-] L3 agent started
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task [req-97b76ec8-9eb6-4e23-8074-63c889c3e199 - - - - -] Error during L3NATAgentWithStateReport.periodic_sync_routers_task
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task Traceback (most recent call last):
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/oslo_service/periodic_task.py", line 220, in run_periodic_tasks
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task task(self, context)
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 545, in periodic_sync_routers_task
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task self.fetch_and_sync_all_routers(context, ns_manager)
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 556, in fetch_and_sync_all_routers
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task self.plugin_rpc.get_router_ids(context))
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 108, in get_router_ids
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task return cctxt.call(context, 'get_router_ids', host=self.host)
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/neutron/common/rpc.py", line 136, in call
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task return self._original_context.call(ctxt, method, **kwargs)
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 158, in call
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task retry=self.retry)
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task timeout=timeout, retry=retry)
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task retry=retry)
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 461, in _send
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task raise result
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task TypeError: 'NoneType' object is not iterable
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task Traceback (most recent call last):
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 138, in _dispatch_and_reply
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task incoming.message))
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 185, in _dispatch
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task return self._do_dispatch(endpoint, method, ctxt, args)
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 127, in _do_dispatch
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task result = func(ctxt, **new_args)
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/neutron/api/rpc/handlers/l3_rpc.py", line 75, in get_router_ids
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task router_ids=None)
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/neutron/db/l3_agentschedulers_db.py", line 525, in auto_schedule_routers
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task self, context, host, router_ids)
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/neutron/scheduler/l3_agent_scheduler.py", line 148, in auto_schedule_routers
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task plugin, context, l3_agent)
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/neutron/scheduler/l3_agent_scheduler.py", line 314, in _schedule_ha_routers_to_addition
al_agent
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task for router in schedulable_routers:
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task TypeError: 'NoneType' object is not iterable
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task
2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task

[ some time pass ]

2016-11-15 03:15:08.152 2988 ERROR oslo_service.periodic_task [req-97b76ec8-9eb6-4e23-8074-63c889c3e199 - - - - -] Error during L3NATAgentWithStateReport.periodic_sync_routers_task
2016-11-15 03:15:08.152 2988 ERROR oslo_service.periodic_task Traceback (most recent call last):
2016-11-15 03:15:08.152 2988 ERROR oslo_service.periodic_task File "/usr/lib/python2.7/site-packages/oslo_service/periodic_task.py", line 220, in run_periodic_tasks
2016-11-15 03:15:08.152 2988 ERROR oslo_service.periodic_task task(self, context)
[ ... ]

[ "neutron l3-agent-router-remove <l3-agent-uuid> d95d9198-9960-4f2c-b33f-d6211ef1a8a3" is run]

2016-11-15 03:15:50.441 2988 WARNING neutron.agent.l3.agent [-] Info for router d95d9198-9960-4f2c-b33f-d6211ef1a8a3 was not found. Performing router cleanup
2016-11-15 03:17:23.102 2988 WARNING oslo.service.loopingcall [req-97b76ec8-9eb6-4e23-8074-63c889c3e199 - - - - -] Function 'neutron.service.Service.periodic_tasks' run outlasted interval by
 15.31 sec
2016-11-15 03:17:23.174 2988 INFO oslo_rootwrap.client [-] Spawned new rootwrap daemon process with pid=9015
2016-11-15 03:17:23.634 2988 INFO neutron.agent.l3.ha [-] Router 19258eff-2755-41b8-aefb-1c9b2af991f4 transitioned to backup
[ ... ]

At this point, the L3 agent starts operating normally, creating all the routers it's scheduled to run and so on.

Revision history for this message
Victor Morales (electrocucaracha) wrote :

Hey Tore,

I have a question regarding this issue, before you restarted the L3 agent didn't you change the hostname or change the host value of the configuration file. I asked you this because it seems like this call is returning an empty list from the database.

2016-11-15 03:14:28.297 2988 ERROR oslo_service.periodic_task return cctxt.call(context, 'get_router_ids', host=self.host)

By other hand, in order to reproduce this can you provide the release version and the configuration that you used it.

Changed in neutron:
status: New → Incomplete
Revision history for this message
Tore Anderson (toreanderson) wrote :

Hi Victor. This was on Mitaka (as the log shows it's neutron-l3-agent version 8.3.0). However a colleague has also managed to reproduce it in our test environment which runs Newton.

His preliminary conclusion is that the bug occurs when /var/lib/neutron/ha_confs does not exist when neutron-l3-agent starts. This will always the case after a reboot of our network nodes, since they boot off the network and make a fresh install into a RAM-backed filesystem during every boot.

Maybe that is already enough info for you to pinpoint the bug? If not my colleague will supply more details in the coming days as he completes his analysis.

Tore

Revision history for this message
Pepijn Oomen (pjoomen) wrote :

I am still working on analysis of this bug, but it seems to be easily reproducable and does not seem to be (directly) related to persistent storage being wiped. I currently have the following scenario:

1. Boot the network node (failover node)
2. Notice the following in /var/log/neutron/l3-agent.log

ERROR oslo_service.periodic_task [req-8b6c1048-1a4a-484c-bdf6-e069f478a144 - - - - -] Error during L3NATAgentWithStateReport.periodic_sync_routers_task

3. Fix the problem by removing the router from the l3-agent

$ neutron l3-agent-router-remove <agent id> <router id>

4. Notice the following lines in /var/log/neutron/l3-agent.log

WARNING neutron.agent.l3.agent [-] Info for router a94b7f00-36ba-4944-bc48-025473758224 was not found. Performing router cleanup
INFO neutron.agent.l3.ha [-] Router a94b7f00-36ba-4944-bc48-025473758224 transitioned to backup

6. The system now performs as expected
5. Now, stop the l3-agent

# systemctl stop neutron-l3-agent

6. Start the l3-agent

# systemctl start neutron-l3-agent

7. Notice the same error as (2) re-appears

Revision history for this message
Pepijn Oomen (pjoomen) wrote :

It turns out this problem is caused by using:

router_scheduler_driver = neutron.scheduler.l3_agent_scheduler.AZLeastRoutersScheduler

and an unexpected NoneType returned by _get_routers_can_schedule called from within _schedule_ha_routers_to_additional_agent.

Pepijn Oomen (pjoomen)
Changed in neutron:
assignee: nobody → Pepijn Oomen (pjoomen)
status: Incomplete → Fix Committed
Changed in neutron:
status: Fix Committed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/400233

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/newton)

Change abandoned by Pepijn Oomen (<email address hidden>) on branch: stable/newton
Review: https://review.openstack.org/400192
Reason: Replaced with https://review.openstack.org/400233

Changed in neutron:
importance: Undecided → High
milestone: none → ocata-2
tags: added: l3-bgp
tags: added: l3-ipam-dhcp
removed: l3-bgp
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/400233
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=62176a9b40619327aacee9ed4162252d1245d019
Submitter: Jenkins
Branch: master

commit 62176a9b40619327aacee9ed4162252d1245d019
Author: Pepijn Oomen <email address hidden>
Date: Mon Nov 21 12:07:45 2016 +0100

    Solve unexpected NoneType returned by _get_routers_can_schedule.

    Solve a problem with an unexpected NoneType returned by
    _get_routers_can_schedule called from within
    _schedule_ha_routers_to_additional_agent when using:

    router_scheduler_driver =
        neutron.scheduler.l3_agent_scheduler.AZLeastRoutersScheduler

    This was leading to problems with starting neutron-l3-agent on network
    nodes, causing HA routers to fail to start.

    Closes-Bug: #1641879
    Change-Id: I33c5a6214670f0ada9c2293b0eb2ff243f6f7b1b

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.0.0b2

This issue was fixed in the openstack/neutron 10.0.0.0b2 development milestone.

Revision history for this message
LIU Yulong (dragon889) wrote :

I've reported a new bug for mitaka and liberty:
https://bugs.launchpad.net/neutron/+bug/1654128

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/418644

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/418644
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=dd6fd625391d310040118ae5c3575ff6cff6e833
Submitter: Jenkins
Branch: stable/mitaka

commit dd6fd625391d310040118ae5c3575ff6cff6e833
Author: Pepijn Oomen <email address hidden>
Date: Mon Nov 21 12:07:45 2016 +0100

    Solve unexpected NoneType returned by _get_routers_can_schedule.

    Solve a problem with an unexpected NoneType returned by
    _get_routers_can_schedule called from within
    _schedule_ha_routers_to_additional_agent when using:

    router_scheduler_driver =
        neutron.scheduler.l3_agent_scheduler.AZLeastRoutersScheduler

    This was leading to problems with starting neutron-l3-agent on network
    nodes, causing HA routers to fail to start.

    Closes-Bug: #1641879
    Related-Bug: #1654128
    Change-Id: I33c5a6214670f0ada9c2293b0eb2ff243f6f7b1b
    (cherry picked from commit 62176a9b40619327aacee9ed4162252d1245d019)

tags: added: in-stable-mitaka
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/400192
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bc91bcef8403eefaea012658a7ac44c5cbbef0a1
Submitter: Jenkins
Branch: stable/newton

commit bc91bcef8403eefaea012658a7ac44c5cbbef0a1
Author: Pepijn Oomen <email address hidden>
Date: Mon Nov 21 12:07:45 2016 +0100

    Solve unexpected NoneType returned by _get_routers_can_schedule.

    Solve a problem with an unexpected NoneType returned by
    _get_routers_can_schedule called from within
    _schedule_ha_routers_to_additional_agent when using:

    router_scheduler_driver =
        neutron.scheduler.l3_agent_scheduler.AZLeastRoutersScheduler

    This was leading to problems with starting neutron-l3-agent on network
    nodes, causing HA routers to fail to start.

    Closes-Bug: #1641879
    Change-Id: I33c5a6214670f0ada9c2293b0eb2ff243f6f7b1b
    (cherry picked from commit 62176a9b40619327aacee9ed4162252d1245d019)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.2.0

This issue was fixed in the openstack/neutron 9.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 8.4.0

This issue was fixed in the openstack/neutron 8.4.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.