[RFE] Allow admin to mark agents down

Bug #1513144 reported by Carlos Goncalves on 2015-11-04
26
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Wishlist
Unassigned

Bug Description

Cloud administrators have monitoring systems externally placed watching different types of resources of their cloud infrastructures. A cloud infrastructure is comprehended not exclusively by an OpenStack instance but also other components not managed by and possibly not visible to OpenStack such as SDN controller, physical network elements, etc.

External systems may detect a fault on one of multiple of infrastructure resources that subsequently may affect services being provided by OpenStack. From a network perspective, an example of a fault can be the crashing of openvswitch on a compute node.

When using the reference implementation (ovs + neutron-l2-agent), neutron-l2-agent will continue reporting to the Neutron server its state as alive (there's heartbeat; service's up ), although there's an internal error caused by unreachability to the virtual bridge (br-int). By means of external tools to OpenStack monitoring openvswitch, the administrator knows there's something wrong and as a fault management action he may want to explicitly set the agent state down.

Such action requires a new API exposed by Neutron allowing admins to set (true/false) the aliveness state of Neutron agents.

This feature request goes in line with the work proposed to Nova [1] and implemented in Liberty. The same is also being currently proposed to Cinder [2]

[1] https://blueprints.launchpad.net/nova/+spec/mark-host-down
[2] https://blueprints.launchpad.net/cinder/+spec/mark-services-down

Tags: rfe Edit Tag help
Changed in neutron:
assignee: nobody → Carlos Goncalves (cgoncalves)
Kyle Mestery (mestery) wrote :

Will review at drivers meeting next week.

Changed in neutron:
status: New → Triaged
Akihiro Motoki (amotoki) wrote :

I think it is a valid feature request.

We can use this information "force down" for example) combined with the existing heartbeat timestamp,
and possibly reschedule agents based on it. As one of OpenStack operators this feature would be desired.

The implementation would be simple.
I am not sure we need a spec proposal for this.

Miguel Angel Ajo (mangelajo) wrote :

wouldn't "neutron agent-update --admin-state-down" do the same thing or be used for that?

Kevin Benton (kevinbenton) wrote :

No, setting --admin-state-down does not affect things already scheduled to the agent. It only prevents future things from being scheduled.

If I understand this correctly, this boils down to the fact that today, if something goes wrong on the host the agent (l2 or otherwise) runs on, we can only assume a failure by the lack of heartbeat. However, this is not to be confused: the lack of heartbeat is to be interpreted as control plane failure, rather than data plane failure.

I think that Carlos is saying that we really need more than what we have today. As of today, we can mark an agent admin status down and that takes the agent out of the scheduling fabric. The OVS agent is kinda peculiar, because Neutron is not in charge of scheduling, Nova is. If we want to take the host out of the system, nova-manage service down does the trick though. The host will not be used to schedule VM's anymore.

So the question IMO is two-fold:

1) Do we enhance the agent ([1] is the schema of the agent table) to report a host-related health status? Not necessary, but nice.
2) Do we provide the ability to disable an agent based on broken health status or any other decision? We already do have that with the admin-status-up flag. However that doesn't really work for OVS because, because the scheduling is really beyond what Neutron does today.

I don't believe that 1) is really required, but it would be nice to clean up 2), and for that we could that in the of a simple bug report. Ultimately we'd make system honor the ADMIN_STATUS_UP=False for OVS, but then again, today when we create a port nothing happens and the binding is initiated by the host only at a later time (once the VM is scheduled to the host).

[1] http://paste.openstack.org/show/478474/

So at this point, I am unclear what this RFE asks because we *technically* already have a flag to mark agents down, but it may not work as you'd expect for certain types of agents, like L2.

@Carlos: could you please elaborate a bit more?

@Akihiro: I wouldn't go into rescheduling resources for L2, it will be incredibly messy if not coordinated with Nova.

Kevin Benton (kevinbenton) wrote :

@Armando, we can cause the binding to fail if the agent is in the admin_state_up == False condition. This will force Nova to reschedule if it tried to put a VM there. The following line would just check the admin state as well: https://github.com/openstack/neutron/blob/31f4c878cde38b240e76489ca2a2202ed53c495b/neutron/plugins/ml2/drivers/mech_agent.py#L69

@Carlos, would that meet your use case?

Changed in neutron:
importance: Undecided → Wishlist

Retarget if you care.

Changed in neutron:
assignee: Carlos Goncalves (cgoncalves) → nobody
status: Triaged → Incomplete
tags: removed: rfe

It's been a week we asked for feedback

Akihiro Motoki (amotoki) wrote :

Sorry for late. Let me share my thought. A bit long :-(

I am not sure the below is what the bug author really wants. Carlos, could you confirm?

--

First of all, I am afraid that L2 agent is not an good example of use cases of a requested feature, because we don't have a rescheduling mechanism for L2-agent without coordinating nova. It makes a thing complicated.

On the other hand, from what I see from bug description, the requested feature is not limited to L2-agent, and it can be applied to other agents including L3-agent, dhcp-agent, and LBaaS haproxy agent.

--

I have one useful use case in this context what I want as an operator. Let me explain.

For these agents, an external monitoring system can detect a failure fast or another failure of an agent which neutron cannot detect. In such case, operators want to reschedule resource on the agent and it would be nice if operators leverage neutron scheduler. To do so, we need a way to notify a failure of an agent to neutron-server.

--

To notify a failure of an agent, I think we need a new attribute as proposed in the bug report.
setting admin_state_up to False is not enough for this purpose.

In the context of agents with schedulers, admin_state_up=False indicates an operator want not to schedule a resource anymore AUTOMATICALLY. The manual scheduling is still allowed (mainly for testing by operators).
(optionally existing resources on an agent with admin_state_up=False stays on the agent)

If setting admin_state_up to False triggers agent rescheduling, we can no longer do manual scheduling for agents with admin_state_up=False, so admin_state_up=False does not make sense to me.
A straight forward way is to introduce a new attribute ("force_down" for example) which makes "alive" down.

--

I think the above I discussed is similar to the current nova model.
Use case and related operation may be different, but I think similar model would be nice.

neutron admin_state_up <-> nova-compute enabled/disabled
neutron alive <-> nova-compute state (up/down)
neutron "force_down" <-> nova-compute "force

--

It may be better to file another RFE request. If so, I will another RFE bug, but I think we can cover this in this RFE.
Thought?

Akihiro Motoki (amotoki) wrote :

I still think it is a valid RFE request. Let me add "rfe" tag again with keeping the status Incomplete.

tags: added: rfe
Carlos Goncalves (cgoncalves) wrote :

Motoki, I believe your understanding is in line with my feature proposal. I'm not specifically and exclusively targeting a l2-agent use case but it was rather an example of a failure that may happen. It is my intention with this RFE to focus on Neutron agents in general, as I wrote in the original description ("allowing admins to set (true/false) the aliveness state of Neutron agents").

From an operator perspective, the 'alive' field reports the state of each neutron agent. Such state is expected to cover data and control plane states, e.g. control actions on a broken data plane will eventually fail or worse may result in an inconsistent state of the cloud system. Operators are in production deployments externally monitoring data plane of multiple neutron agents, and being able to help Neutron detecting faults by changing agents' alive state would be ideal.

Henry Gessau (gessau) on 2015-11-24
summary: - Allow admin to mark agents down
+ [RFE] Allow admin to mark agents down
Carl Baldwin (carl-baldwin) wrote :

Putting agents in a state that prevents automatic scheduling but allows manual scheduling is very important for operators. At one point, it wasn't this way and I wanted it to be. Since then, I lost track of any changes to this and I'm honestly not sure what admin_state_up does today. If I understand the comments correctly, we have this behavior today. Is it correct that admin_state_up disables automatic scheduling but allows manual scheduling and the agent is otherwise healthy?

If the answer is yes then I think there may be a valid RFE here. Is the gap just that there is no way to kick something back in to the scheduler? If we could mark the agent admin_state_up=False and then kick all of its routers/networks/etc. back in to the automatic scheduler, would that solve it?

In the case of the L2 agent, shouldn't it be the higher level resources that get rescheduled (router, network, vm, etc)?

Just to make sure we are all on the same page.

* Today Neutron provides automatic failover for both L3 and DHCP agents whose admin status is up [1, 2].
* As an operator, I can disable the automatic failover if I want to implement some out-of-band failover mechanism.
* If I set admin_status to DOWN, the agent is ignored, regardless of the failover mechanism
* We have an 'alive' flag associated to agents, but this is dynamically set based on the last heartbeat.

Carlos, you said and I quote you:

[..Such action requires a new API exposed by Neutron allowing admins to set (true/false) the aliveness state of Neutron agents..]

The 'alive' cannot be overriden by the admin, because it means: the neutron agent (whichever it may be) is alive and kicking. If you want to mark an agent as down, that's what admin_status is for.

[1] https://github.com/openstack/neutron/blob/master/etc/neutron.conf#L243
[2] https://github.com/openstack/neutron/blob/master/etc/neutron.conf#L238

I tried to provide more context, and I would like to understand why admin_status doesn't address the author's need.

If the aim of this RFE is simply to add a new field to describe a user-driven health status of a host to drive some out of band orchestration, I think it's would lead to confusion, because it overlaps with 'admin_status' and 'alive' attributes.

One crazy thought would be to extend tags [1] to encompass agents too. Then you can tag your agents any way you see fit.

[1] https://review.openstack.org/#/c/216021/

Kevin Benton (kevinbenton) wrote :

Would an 'evacuate' action for an agent achieve the use case you are looking for? With that call you can have your monitoring script call 'evacuate' whenever it deems the data plane (or whatever you are monitoring) unhealthy.

I talked to Carlos offline (hoping he would take the conversation here), but that's exactly what I suggested him. An evacuate action that marks the agent down and triggers a rescheduling to other available agents. Overloading the meaning of 'alive' or introducing a new field is not the right way to go about this.

I think Carl is also onboard with this idea.

Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers