[RFE] Centralized Management System for testing the environment

Bug #1507499 reported by Kanchan Gupta
36
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
Triaged
Wishlist
Boden R

Bug Description

To enable operators to reduce manual work upon experiencing networking
issue, and to fast pinpoint the cause of a failure, there is a need for
neutron to provide real-time diagnostics of its resources. This way,
current need for manual checks, often requiring root access, would be
gradually replaced by API queries. Providing diagnostics options in
neutron API would also open space for development of specialized tools
that would solve particular type of issues, e.g. inability to ping VM’s
interface.

Note: The description of this RFE was changed to cover previous RFEs
related to diagnostics (namely bug 1563538, bug 1537686, bug 1519537
and the original of this bug).

Problem Description
===================

One of common questions seen at ask.openstack.org and mailing lists is
"Why cannot I ping my floating IP address?". Usually, there are common
steps in the diagnostics required to answer the question involving
determination of relevant namespaces, pinging the instance from that
namespaces etc. Currently, these steps need to be performed manually,
often by crawling the relevant hosts and running tools that require root
access.

Neutron currently provides data on how the resources *should* be
configured. It however provides only a very little diagnostics
information reflecting *actual* resource state. Hence if an issue
occurs, user is often left with only a little details of what works and
what not, and has to manually crawl affected hosts to troubleshoot the
issue.

Proposed Change
===============

This RFE requests an extension of current API that exposes
diagnostics for neutron resources so that it is accessible via API
calls, reducing amount of needed manual work. Further it describes
additions to Neutron CLI necessary to call the newly added API.

Spec
====
https://review.openstack.org/#/c/308973/

Tags: rfe-approved
Revision history for this message
Mark McClain (markmcclain) wrote :

This work has been attempted before (neutron-debug). I think any work in this area should include a comprehensive overhaul of that code which has been lying around since Folsom. Additionally, security has traditionally been a concern when this has been brought up in the past.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This RFE is too big in scope and spans across multiple projects. Consider reducing the scope (with iterative enhancements that can come at later date) so that it can be tackled within a single cycle and within the boundaries of a single project.

As Mark pointed out, neutron-debug is a limited tool that provides the ability to test connectivity. Can you elaborate a bit more on what you have in mind?

Changed in neutron:
status: New → Confirmed
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

To be discussed at the drivers meeting.

Changed in neutron:
importance: Undecided → Wishlist
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I guess I never made the state transition. We should look a this keeping in mind comments made on duplicate bug 1519537

Changed in neutron:
status: Confirmed → Triaged
Revision history for this message
Kyle Mestery (mestery) wrote :

We're considering putting things into tenant VMs? I'm a hard no on this, I can't imagine anyone allowing this.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Duly noted

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Neutron mid-cycle is an a few weeks. We'll have to put together a plan by then, and consolidate all proposal together.

[1] https://wiki.openstack.org/wiki/Sprints

Revision history for this message
Hynek Mlnarik (hmlnarik-s) wrote :
Download full text (3.1 KiB)

My 2 cents.

TL;DR version: I support version of diagnostics provided by individual agents, exposed via RPC/API, and processed by separate CLI/GUI tools. For CLI, I offer a proof of concept implementation that extends existing neutron-debug tool.

Elaborate version:

To enable operators to fast pinpoint the cause of a failure, there needs to be a tool that can query individual agents and provide the operator with aggregated response. The tool should be ideally in place already when neutron starts - when a failure occurs, it is often not time to install any new diagnostic tool onto nodes. Hence neutron itself should offer detailed diagnostics, along the lines of bug 1519537 suggestion.

On the other hand, the diagnostics in neutron should only describe actual state, not attempt to diagnose the root cause or even repair anything. This is task for a separate tool that would request diagnostic input from various agents and based on it dig deeper, obtain further information until it is able to at least limit the potential causes. For example, when diagnosing ping of floating IP, the tool would first try to ping the FIP; if it works, state it, else attempt to ping floating IP from router namespace, ping corresponding fixed IP from there, check security groups etc. Hence there is a kind of hierarchical (bi- or multisect) search.

The separation of the two diagnostic layers - diagnostic information retrieval and aggregation - has benefits for all:
* Agents and developers of agents know best which diagnostic
  information they can offer
* There is no big complexity in implementation these RPC/API
  calls in agents
* When part of agents, the diagnostic information remains
  maintained - contrary to situation when there would be separate
  agent providing all diagnostic information for all agents
* The aggregation tool can handle diagnostic information from
  several sources (agents) to better pinpoint the cause.
* The complexity of searching for the cause would be held away
  from agents, would be in place where it belongs - in the
  diagnostic aggregation tool

To illustrate the idea, I have submitted a work-in-progress patch of a CLI tool that works as diagnostics aggregator [1]. The CLI enhances existing neutron-debug tool with diagnose-router command. Given a router-id and IP address, it attempts to verify that it is possible to ping from router namespace to that IP and connect to port 22. It does so in a sequence of diagnostic steps which obtain information from system tools (ping, netcat) and API. At this moment, it is necessary to run the CLI in a networking node where router namespace is defined. If the agents implemented diagnostics RPC, the system tools usage could be replaced with agent RPC calls. Note that the complexity of the checks on agent side is actually very small, all the logic that handles "in case of failure of current step, check also this and that step" is fully handled in the aggregation tool.

Sample usage: neutron-debug --config-file /etc/neutron/neutron.conf --config-file /etc/neutron/l3_agent.ini <<< 'diagnose-router --include-success <router-id> 172.24.4.5'

[1] https://review.openstack.org/#/q/topic:neutron-debug-diagno...

Read more...

summary: - Centralized Management System for testing the environment
+ [RFE] Centralized Management System for testing the environment
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

We'll have to iterate on this a tad longer.

Revision history for this message
Boden R (boden) wrote :

[1] is also related to this RFE.
I've added some additional notes to [2] as well w/r/t how these *might* fit together.

[1] https://bugs.launchpad.net/neutron/+bug/1563538
[2] https://etherpad.openstack.org/p/neutron-troubleshooting

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

We'll most likely going to have a fist fight at the summit hoping that only one will be left standing.

Revision history for this message
Boden R (boden) wrote :

A recent RFE [1] was marked as a duplicate of this one. While I agree these RFEs are related (as is [2]), I don't necessary agree that [1] is a duplicate of this RFE; therefore I'm hoping we can consolidate all these RFEs into one workstream.

In particular what's missing in this RFE that's covered in [1] is a means for admins to have more visibility into neutron resource <-> backend resource (realization) mappings, including identification and remediation of lost / orphaned backend resource object scenarios. Our customers are asking for it, and google-foo suggests others are as well.

My current thought is that resource mapping approach [1] (or some variation of it) can be used as a foundation for additional admin / ops types of functionality such as those outlined in this RFE. I've outlined an approach in [3] that uses [1] as a basis for the functionality contained in this RFE and I believe it can help us model more complex relationships than whats exposed via neutron top-level resources today.

Note: if you want to see how [1] works in action, you can watch the video demo [4]... Popcorn recommended, but not required.

Does it make sense to consolidate our discussions on this one to the etherpad [3]?? And perhaps on IRC.

[1] https://bugs.launchpad.net/neutron/+bug/1563538
[2] https://bugs.launchpad.net/neutron/+bug/1519537
[3] https://etherpad.openstack.org/p/neutron-troubleshooting
[4] https://www.youtube.com/watch?v=qAY8U8vgbzc

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-specs (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/308973

Assaf Muller (amuller)
description: updated
description: updated
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Looks like we have a plan forward, we need to move the discussion on the spec:

https://review.openstack.org/#/c/308973/

Changed in neutron:
assignee: nobody → Hynek Mlnarik (hmlnarik-s)
milestone: none → newton-1
tags: added: rfe-approved
removed: rfe
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/322158

Changed in neutron:
milestone: newton-1 → newton-2
Changed in neutron:
milestone: newton-2 → newton-3
Changed in neutron:
milestone: newton-3 → newton-rc1
Changed in neutron:
milestone: newton-rc1 → ocata-1
Revision history for this message
Hynek Mlnarik (hmlnarik-s) wrote :

I am removing myself from this task for I will be moving on from neutron (and openstack in general) to another project.

Changed in neutron:
assignee: Hynek Mlnarik (hmlnarik-s) → nobody
Boden R (boden)
Changed in neutron:
assignee: nobody → Boden R (boden)
Changed in neutron:
milestone: ocata-1 → ocata-2
Changed in neutron:
milestone: ocata-2 → ocata-3
Changed in neutron:
milestone: ocata-3 → ocata-rc1
milestone: ocata-rc1 → pike-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron-specs (master)

Reviewed: https://review.openstack.org/308973
Committed: https://git.openstack.org/cgit/openstack/neutron-specs/commit/?id=dc11da5109759d13636aaaef35420fa4ac1d88d6
Submitter: Jenkins
Branch: master

commit dc11da5109759d13636aaaef35420fa4ac1d88d6
Author: Boden R <email address hidden>
Date: Wed Feb 15 15:47:07 2017 -0700

    Neutron resource diagnostics

    This spec proposes the introduction of a neutron diagnostics framework
    and API extension capable collecting resource diagnostics across
    neutron API and agent nodes. To keep the spec containable, the proposal
    suggests only providing a sample diagnostic check and reiterating on
    concrete diagnostics once we get the plumbing in place.

    While this spec has some inspiration from nova diagnostics [1],
    the approach herein is more generic and extensible supporting a
    broader set of use cases longer term.

    Finally it seeks to pave the way for supporting use case / features
    proposed in the related bugs.

    [1] https://wiki.openstack.org/wiki/Nova_VM_Diagnostics

    Related-Bug: #1507499
    Related-Bug: #1519537
    Related-Bug: #1537686
    Related-Bug: #1563538

    Change-Id: Id534acb1593f1fe210c561b1451656dce69514db

Changed in neutron:
milestone: pike-1 → pike-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: master
Review: https://review.openstack.org/322158
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers