Comment 9 for bug 1507499

Revision history for this message
Hynek Mlnarik (hmlnarik-s) wrote : Re: Centralized Management System for testing the environment

My 2 cents.

TL;DR version: I support version of diagnostics provided by individual agents, exposed via RPC/API, and processed by separate CLI/GUI tools. For CLI, I offer a proof of concept implementation that extends existing neutron-debug tool.

Elaborate version:

To enable operators to fast pinpoint the cause of a failure, there needs to be a tool that can query individual agents and provide the operator with aggregated response. The tool should be ideally in place already when neutron starts - when a failure occurs, it is often not time to install any new diagnostic tool onto nodes. Hence neutron itself should offer detailed diagnostics, along the lines of bug 1519537 suggestion.

On the other hand, the diagnostics in neutron should only describe actual state, not attempt to diagnose the root cause or even repair anything. This is task for a separate tool that would request diagnostic input from various agents and based on it dig deeper, obtain further information until it is able to at least limit the potential causes. For example, when diagnosing ping of floating IP, the tool would first try to ping the FIP; if it works, state it, else attempt to ping floating IP from router namespace, ping corresponding fixed IP from there, check security groups etc. Hence there is a kind of hierarchical (bi- or multisect) search.

The separation of the two diagnostic layers - diagnostic information retrieval and aggregation - has benefits for all:
* Agents and developers of agents know best which diagnostic
  information they can offer
* There is no big complexity in implementation these RPC/API
  calls in agents
* When part of agents, the diagnostic information remains
  maintained - contrary to situation when there would be separate
  agent providing all diagnostic information for all agents
* The aggregation tool can handle diagnostic information from
  several sources (agents) to better pinpoint the cause.
* The complexity of searching for the cause would be held away
  from agents, would be in place where it belongs - in the
  diagnostic aggregation tool

To illustrate the idea, I have submitted a work-in-progress patch of a CLI tool that works as diagnostics aggregator [1]. The CLI enhances existing neutron-debug tool with diagnose-router command. Given a router-id and IP address, it attempts to verify that it is possible to ping from router namespace to that IP and connect to port 22. It does so in a sequence of diagnostic steps which obtain information from system tools (ping, netcat) and API. At this moment, it is necessary to run the CLI in a networking node where router namespace is defined. If the agents implemented diagnostics RPC, the system tools usage could be replaced with agent RPC calls. Note that the complexity of the checks on agent side is actually very small, all the logic that handles "in case of failure of current step, check also this and that step" is fully handled in the aggregation tool.

Sample usage: neutron-debug --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini <<< 'diagnose-router --include-success <router-id> 172.24.4.5'

[1] https://review.openstack.org/#/q/topic:neutron-debug-diagnostics-poc