Scheduler Failing for Multiple L3 Agents in Grizzly

Bug #1154622 reported by Daneyon Hansen
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
Jaume Devesa

Bug Description

I am following these documents to test multiple DHCP and L3 Agents:

openstack-manuals/doc/src/docbkx/openstack-network-connectivity-
admin/app_demo_multi_dhcp_agents.xml

openstack-manuals/doc/src/docbkx/openstack-network-connectivity-
admin/app_demo_multi_l3_agents.xml

Everything went well with testing DHCP, but I am having problems with L3.
I have replicated the setup in the documentation. Everything is fine
until I re-enable the L3 agent on HostA and attempt the following command:

quantum l3-agent-router-add <HostA_L3_Agent_ID> router1

Here is the error:

root@control03:~# quantum l3-agent-router-add
0d979401-8113-4e6e-9970-8ba46a7343e5 router1
Failed scheduling router b6146778-1a07-4518-82a5-480f203bc3fe to the L3
Agent 0d979401-8113-4e6e-9970-8ba46a7343e5.

Here is a more detailed workflow:

http://pastebin.com/HHuZjE5Q

Even though logging is set to verbose, I am seeing nothing in the logs. Attached are configuration files of the environment. Let me know if you have any troubleshooting suggestions. Thanks!

Tags: l3-ipam-dhcp
Revision history for this message
Daneyon Hansen (danehans) wrote :
tags: added: l3-ipam-dhcp
Revision history for this message
yong sheng gong (gongysh) wrote :

l3-agent-router-add works when no other l3 agent is hosting the router. can u run l3-agent-list-hosting-route router1 to show us the result.
and the pastebin.com will delete the paste in a certan time. can u paste the related log on the server side here?

Gary Kotton (garyk)
Changed in quantum:
status: New → Incomplete
Revision history for this message
Daneyon Hansen (danehans) wrote :

That sounds like the issue then:

root@control03:~# quantum l3-agent-list-hosting-router router1
+--------------------------------------+-----------+----------------+-------+
| id | host | admin_state_up | alive |
+--------------------------------------+-----------+----------------+-------+
| 32d1be16-0ab5-4ae2-a026-a4b70009af08 | control02 | True | :-) |
+--------------------------------------+-----------+----------------+-------+

The documentation says to add a 2nd L3 agent for a router with the l3-agent-router-add command. If this is not the case, what is the proper workflow for having multiple L3 agents service a router? Do I need to set the admin state for both L3 agents to false, create the router, and manually add both L3 agents to the router? Thanks for your help.

Revision history for this message
Robert van Leeuwen (rovanleeuwen) wrote :

Daneyon, did you get this to work?

I'm hitting the same issue it seems.
Adding a second L3-agent will always result in a "Failed scheduling router" message.
I've tried disabling the router-agent first (This will result in a "is not a L3 Agent or has been disabled")

Revision history for this message
Daneyon Hansen (danehans) wrote :

I did not. I found that running concurrent L3 agents for a tenant network was not implemented. In Grizzly, only a single L3 agent can be connected to a network at a time.

Jian Wen (wenjianhn)
Changed in quantum:
status: Incomplete → Confirmed
assignee: nobody → Jian Wen (wenjianhn)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to quantum (master)

Fix proposed to branch: master
Review: https://review.openstack.org/31009

Changed in quantum:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/31422

Jian Wen (wenjianhn)
Changed in quantum:
status: In Progress → Confirmed
assignee: Jian Wen (wenjianhn) → nobody
Revision history for this message
yong sheng gong (gongysh) wrote :

I think we can just fix it by updating the message like 'Failed scheduling router b6146778-1a07-4518-82a5-480f203bc3fe to the L3
Agent 0d979401-8113-4e6e-9970-8ba46a7343e5 sinceit is hosted on another L3 agent.'

Changed in quantum:
importance: Undecided → Low
milestone: none → havana-2
Changed in neutron:
milestone: havana-2 → havana-3
Jaume Devesa (devvesa)
Changed in neutron:
assignee: nobody → Jaume Devesa (devvesa)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/38599

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
ZhiQiang Fan (aji-zqfan) wrote :

@Daneyon Hansen (danehans) #3:

in neutron/scheduler/l3_agent_scheduler.py, class ChanceScheduler has the method auto_schedule_routers() which is called by neutron/db/agentschedulers_db.py class L3AgentSchedulerDbMixin.add_router_to_l3_agent()

the auto_schedule_routers() has the following code in the beggining:
with context.session.begin(subtransactions=True):
            # query if we have valid l3 agent on the host
            query = context.session.query(agents_db.Agent)
            query = query.filter(agents_db.Agent.agent_type ==
                                 constants.AGENT_TYPE_L3,
                                 agents_db.Agent.host == host,
                                 agents_db.Agent.admin_state_up == True)
            try:
                l3_agent = query.one()
            except (exc.MultipleResultsFound, exc.NoResultFound):
                LOG.debug(_('No enabled L3 agent on host %s'),
                          host)
                return False

If i am right, if you try to want to do your work with multiple l3-agent on the same host, it will raise and catch the exc.MultipleResultsFound exception and return false, which will cause the RouterSchedulingFailed raised

So there may be some problem (if i'm wrong, please correct me) with the auto_schedule_routers() method and the add_router_to_l3_agent():
1) this code doesn't seperate two exception which may be confuse
2) this code doesn't allow multiple l3-agent in the same host
3) the RouterSchedulingFailed cannot tell which `False` is returned by auto_schedule_routers
4) the RouterSchedulingFailed may raised in the wrong place
5) the RouterSchedulingFailed may be more precise for this case

Revision history for this message
Daneyon Hansen (danehans) wrote :

@Zhiqiang Fan (aji-zqfan)

I am trying to use multiple L3 agents across multiple hosts to provide HA to Quantum networks. I can not comment regarding the code details as I am an OpenStack operator and not a developer.

Revision history for this message
Philip Smith (philip-smith-r) wrote :

I would also like to utilise request this feature, since without it any instance inside OpenStack is always limited to this single SPoF. The development of Neutron should include the ability to be Highly Available, as there would be with a real network providing multiple devices in an active/active or active/standby configuration.

From a network engineers perspective and while I completely understand that 'we shouldn't consider our applications precious', I still think it would aid adoption of the platform if it could actually provide highly available networking. So with all due respect gongysh, please don't fix this by changing the error, please make this a feature ;o)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/38599
Committed: http://github.com/openstack/neutron/commit/5db010a9a737356e426e6f2e10c8fcafb36c533c
Submitter: Jenkins
Branch: master

commit 5db010a9a737356e426e6f2e10c8fcafb36c533c
Author: Jaume Devesa <email address hidden>
Date: Wed Jul 24 16:54:09 2013 +0200

    Unify exception thrown in l3-agent-scheduler fails

    Since you can only attach a single l3 agent to a router, when you try
    to add another l3 agent to a router that already have one, the l3
    agent scheduler raises an exception.

    This fix removes the discrimination by id: either it is the same agent
    or another one, the router can not be hosted and the same exception is
    raised.

    Change-Id: If832bbd4bf17e4e0c4720172aded4c9fffedc6fc
    Fixes: bug #1154622

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Revision history for this message
Daneyon Hansen (danehans) wrote :

So, does this fix allow Neutron routers to be hosted by multiple L3 agents or does it only change the error message?

Revision history for this message
Jian Wen (wenjianhn) wrote :

Only the error message.

Thierry Carrez (ttx)
Changed in neutron:
milestone: havana-3 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.