l3 and dhcp agents should be corosync clones

Bug #1359082 reported by Kevin Benton
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Sergey Vasilenko
5.1.x
Won't Fix
High
Sergey Vasilenko
6.0.x
Fix Released
High
Sergey Vasilenko

Bug Description

The current corosync configuration for Neutron L3 agents and DHCP agents only allows one of each to run at a time. This is not how DCHP and L3 agents are intended to be deployed. There should be many active of both types to share the load.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/115529

Changed in fuel:
assignee: nobody → Kevin Benton (kevinbenton)
status: New → In Progress
Mike Scherbakov (mihgen)
Changed in fuel:
milestone: none → 5.1
Changed in fuel:
importance: Undecided → Medium
milestone: 5.1 → 6.0
Revision history for this message
Sergey Vasilenko (xenolog) wrote :

for multiplying l3 and dhcp agents we should fully refactor network rescheduling mechanism.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Kevin, what you are requesting is an essential change to neutron architecture. It looks pretty much like a feature. Unfortunately, we cannot accept it to 5.1 as 5.1 is a month behind Feature Freeze stage. We would appreciate if you created a blueprint for 6.0 release at https://blueprints.launchpad.net/fuel where we could discuss all the details of implementation.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

There is nothing to refactor in this case. Fuel just needs to deploy the agents correctly and then the current HA mechanisms and the ones coming in Juno will work correctly.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

This is NOT a change to the neutron architecture. Fuel is broken in the way it is deploying the agents. It will be even more broken as the HA work in Juno expects L3 agents running on every compute node.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

I am not going to file a blueprint for this. Fuel is broken with any large-scale neutron deployment. This is a major bug that should be at least fixed in master and it should be strongly considered for a back-port.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Kevin, what we are concerned about is that L3 agent will run the same IP address on two hosts, which will ultimately break the communication between VMs and L3 agent. Neutron DVR is what we are considering to support in 6.x release, but it has nothing to do with Icehouse right now as Icehouse does not have DVR.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

That's not how the L3 agent works so that won't be a problem. Neutron schedules a given router to only one L3 agent. Two agents will not use the same IP.

DVR is orthogonal to this. Fuel is currently broken even without it because it's preventing different routers from being spread across L3 agents. The same applies for DHCP.

See my link in the patch to the configuration options in neutron. None of those would exist if neutron only worked with one of each agent type.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

What Kevin is trying to achieve here is just the usage of the feature that was in the Neutron since Grizzly - agent scheduling.

It just distributes dhcp/router load between different hosts. It's not DVR or VRRP, as was said, it's totally unrelated to those features.

Fuel should have had that for 1.5 years already and I honestly don't see any reason why we should not add this.
It might be late for 5.1 for sure, but we definitely should add it into further release.

I'm not advocating exact solution that was proposed in the patch (since i don't have enough expertise in fuel), but this is really how we need to deploy neutron.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

I think I need to clarify more here. The way fuel is deploying L3 agents is not an active/active scenario where multiple agents are hosting the same router. Each agent is showing up independently in neutron agent-list.

| ba6e4fc1-7a08-47ed-b974-1283b3da3a8c | Open vSwitch agent | node-29.domain.tld | :-) | True |
| bc5bf388-7c7e-4456-aeac-85360193bbfd | L3 agent | node-25.domain.tld | :-) | True |
| bcad7d6b-42b1-43bb-a719-d3de1462373a | Open vSwitch agent | node-25.domain.tld | :-) | True |
| cb4e5008-21fd-47c2-843d-4642d63a8ed4 | L3 agent | node-28.domain.tld | xxx | True |

Fuel is deploying agents in the way the default operation is intended where there are many active agents, and a router is scheduled to one of the many active agents on creation. However, it's blocking the agents from actually starting except for on one node.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Here are two commands you can run to see what my change will do.

crm configure clone clone_p_neutron-dhcp-agent p_neutron-dhcp-agent meta interleave="true" is-managed="true" target-role="Started"
crm configure clone clone_p_neutron-l3-agent p_neutron-l3-agent meta interleave="true" is-managed="true" target-role="Started"

Changed in fuel:
importance: Medium → High
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Kevin, anyway, thank you for your contribution. I do perfectly understand what you are trying to convince us to do and I think we should support multiple agents feature. What we will really need is to do additionally is to test your solution and ensure that neutron is really doing well with agent scheduling including failover test cases, e.g. when you kill one of the controllers. We saw that in some cases neutron did not even try to send/consume messages to/by the newer agent in case of fail over and implemented a tool that does it manually along with deletion of the dead agent from database. We will take your patch as a base and test it. If it works, we will surely add it to 5.1.1 release along with documentation how to manually apply this patch to existing 5.1 environments.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

The current fuel solution is worse than the default neutron deployment. The custom approach it takes to restrict routers to one agent makes neutron unusable at any kind of real deployment scale. One tenant with heavy East-West traffic between networks could block up the entire openstack deployment since one node is turned into a bottleneck for what could easily be thousands of VMs and networks.

You can query for dead agents with routers assigned to them and reschedule them to alive agents via the current neutron database. There isn't a need to keep only one agent online instead of all of them.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

It's worth noting that neutron doesn't failover L3 agents by default because it can cause IP conflicts if the agent fails but leaves the namespaces configured. This is why this feature was just finally added to neutron but is disabled by default.

https://launchpad.net/bugs/1174591

https://review.openstack.org/#/c/110893/

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Kevin, thank you for pointing us at this fix. This is what we have been always concerned about - network conflict. And there was no way to do this in neutron of a long time. I agree with you that the only l3 agent affects network performance and usability of neutron - we will test your fix and if it works start using it either in 5.1/5.0.2 or 5.1.1. Thank you for contribution and patience.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

It seems there is no interest in fixing this for 5.1.x. My patch has had a -2 for 45 days with no activity. Can you mark this bug as "Won't Fix" if that's the intention?

This is a critical issue with Fuel for us and we have to have a script to do what my patch does after the installation so the deployment can scale. It will be helpful to know if this is something we are going to have to keep doing until after Fuel 6.

Revision history for this message
Sergey Vasilenko (xenolog) wrote :

Kevin, now we going to migrate to upstream Neutron manifest.

We don't forget you idea and has some blueprints for implement multiple l3 agents, multiple l3 agents with HA (VRRP), multiple l3 agents with DVR. We plan return to the work on "multiple l3 agents" about 2-3 week, after end of adaptation and stabilize upstream code.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

L3 agents with HA (VRRP or DVR) are features of Juno. I was hoping to at least just allow multiple L3 agents for Icehouse deployments. It sounds like this is something you will not be supporting. Am I understanding correctly?

Revision history for this message
Sergey Vasilenko (xenolog) wrote :

Yes, we plan multiple L3 agents in Fuel 6.x
We will not plan to change behavior of Neutron agents in 5.0.x and 5.1.x.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

A fix for this in the 5 series is not being considered at this time.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Kevin Benton (<email address hidden>) on branch: master
Review: https://review.openstack.org/115529
Reason: This fix is not being accepted for 5x.TJ

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Thanks for letting me know. Please mark it as "won't fix" for the 5 series then.

I still have no idea why the fuel team is insisting that one L3 agent or DHCP agent is optimal under any conditions. Everything is being concentrated into a single point of failure and performance choke point. Failover is still not handled by Fuel. It only starts the agent process on another node where it should have been running all along.

The change I posted works. There are no other changes required. We have tested this in our lab many times.

The current behavior of is exactly the same as restricting Nova to running one compute node at a time. It's intentionally breaking Neutron and there doesn't seem to be a reason.

IP conflicts will never happen without something doing a failover, which doesn't exist in Icehouse. It would need to be manually done by an external process, which fuel isn't doing.

Can you please explain why this decision was made in the first place and why it's being kept this way? It's very frustrating to see a deployment tool actively stop Neutron from being deployed the way it was intended.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Marked as Won't Fix for 5.1.x since that release series is in maintenance mode and this is a functional change that is not suitable for backport. I still think that it should be considered for 6.0.

Kevin: Can you confirm that the patch linked above eliminates the possibility of IP conflict in combination with the patch you proposed for Fuel here? Do I understand correctly that it was merged to Juno but that the problem still exists in Icehouse (further confirming that your patch belongs in Juno based 6.0 and not in Icehouse based 5.1.x)?

Vladimir and Sergey: Can you re-consider this patch for 6.0, respond to Kevin's comment #23, and, if you agree with his reasoning, un-abandon and un-reject his patch? It is already targeted for master branch (6.0), but it is old enough to potentially need a rebase.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Dmitry, there is no possibility of IP conflict to begin with if the L3 agents are deployed normally. In the current Neutron code, routers are only scheduled to one L3 agent. If that L3 agent goes down, there is no automatic moving to a new agent so there is no risk of an IP conflict. The only way there could be a conflict is if two agents were configured with the same hostname so they both received the same RPC messages, which is not what they do by default and Fuel has not configured them to do that either.

The patch I referenced adds the ability to automatically reschedule routers to a new agent if an agent goes down, but that is only available in Juno and is disabled by default.

Revision history for this message
Sergey Vasilenko (xenolog) wrote :

The 6.0 release contains "multiple l3 agents" feature as must have. Work on implementation this feature is going. Also we migrate to upstream neutron manifests. Path from #22 is incompatible with this implementation.

However, "multiple l3 agents" feature provides functionality, requested by Kevin.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

verified that l3 agents are clones now on
{

    "build_id": "2014-12-26_14-25-46",
    "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4",
    "build_number": "58",
    "auth_required": true,
    "api": "1.0",
    "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90",
    "production": "docker",
    "fuelmain_sha": "81d38d6f2903b5a8b4bee79ca45a54b76c1361b8",
    "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91",
    "feature_groups": [
        "mirantis"
    ],
    "release": "6.0",
    "release_versions": {
        "2014.2-6.0": {
            "VERSION": {
                "build_id": "2014-12-26_14-25-46",
                "ostf_sha": "a9afb68710d809570460c29d6c3293219d3624d4",
                "build_number": "58",
                "api": "1.0",
                "nailgun_sha": "5f91157daa6798ff522ca9f6d34e7e135f150a90",
                "production": "docker",
                "fuelmain_sha": "81d38d6f2903b5a8b4bee79ca45a54b76c1361b8",
                "astute_sha": "16b252d93be6aaa73030b8100cf8c5ca6a970a91",
                "feature_groups": [
                    "mirantis"
                ],
                "release": "6.0",
                "fuellib_sha": "fde8ba5e11a1acaf819d402c645c731af450aff0"
            }
        }
    },
    "fuellib_sha": "fde8ba5e11a1acaf819d402c645c731af450aff0"

}

Changed in fuel:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.