[RFE] neutron resource health check

Bug #1817872 reported by LIU Yulong
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Opinion
Wishlist
Unassigned

Bug Description

Problem Description
===================
How to do trouble shooting if one vm lost the connection? How to find out the problem why the floating IP is not connectable?
No easy way, cloud operators need to dump the flows or iptables rules for it, and then find out which parts was not set properly. What if there are huge amounts of flows or rules, it is not human-readable, how to find out what happened to that port? When there are plenty iptables rules, how to find out why floating IP is not reachable? When there are many routers hosted in one same agent node, how to find out why router is not up?
Each one seems unfriendly to mankind. And people make mistakes. But we have the resource process procedure, so we can follow that workflow to let the machine do the status check/trouble shooting/recovery for us.

Proposed Change
===============
This will aim to the community goal "Service-side health checks".
http://lists.openstack.org/pipermail/openstack-discuss/2018-December/000558.html

And we already have that trouble shooting BP:
https://blueprints.launchpad.net/neutron/+spec/troubleshooting
seems we do not have much progress.

Overview
--------
Add some API, CLI tools, agent side functions to check resource status.

Basic plan:
1. In the agent side, adds some functions to detect the status of one single resource.
For instance, check router iptables rules, check router route rules; for ports, check the basic flow status, check the openflow security group, l2 pop, arp, etc.
2. bulk check, ports for a tenant, or ports from one subnet, routers for a tenant
3. check resources of one entire agent
4. API extension for the related resource, such as, router_check, port_check
For some automatically scenario, cloud operators may not want to login the neutron-server host, then the API can be a good way to call these check methods.

Implement plan:
1. adds some functions to detect the status of one single resource.
For instance, according to the router process procesure, add check methods for each step: check_router_gateway, check_nat_rules, check_route_rules, check_qos_rules, check_meta_proxy, and so on.
2. CLI tool (cloud admin only, needs to run in neutron server host with directly access of DB) to check resources of one entire agent.
For instance, check the routers of one l3 agent.
3. API extension for the related resource, check_router, check_port

---------------
to be continued...

Miguel Lavalle (minsel)
tags: added: rfe
Changed in neutron:
importance: Undecided → Wishlist
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Do we need this new RFE? There is, as You mentioned already Blueprint related to that topic. Spec for it https://review.opendev.org/#/c/308973/ is already merged long time ago.
So IMO if You are interested in implementing it, You can continue work on this BPs and according to merged spec.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Do You still plan to work on this? Or maybe should we close this one due to lack of activity?

Revision history for this message
LIU Yulong (dragon889) wrote :

The 'diagnostics' is basically same to this, so I will add the "router" diagnostics as a first step.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Can You explain how it's different than https://bugs.launchpad.net/neutron/+bug/1830014 ?

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Ping @LIU. If You are still interested in doing that, please reply to my last question in c#4.
If there will be no any info about that for next few weeks, I'm going to close this RFE.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

No updates since long time. I'm closing this RFE for now.

Changed in neutron:
status: New → Opinion
tags: added: rfe-postponed
removed: rfe
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.