[RFE] - Diagnostics Extension for Neutron

Bug #1519537 reported by Ramu Ramamurthy
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Confirmed
Wishlist
Unassigned

Bug Description

Problem
----------

Debugging common networking/neutron problems (1. cannot ping VM, 2. cannot ping FIP),
tends to be manual, and requires root access to look into the state of the agents or the datapath
on different hosts.

Neutron needs to provide a "diagnostics" extension api which can be used for debugging networking problems.
Each agent/driver exposes its own state in a structured (json) format via the diagnostics extension.

Nova "diagnostics" serves as an example here.
https://wiki.openstack.org/wiki/Nova_VM_Diagnostics

Fix
----

A "diagnostics" extension is added to neutron

Each agent and corresponding drivers supports a get_diagnostics() API invoked from neutron-server upon
the following GET APIs limited by policy to admin-only.

GET: /agent/:id/diagnostics
               example output from neutron-ovs agent: OVS bridges, ports and flows

GET: /agent/:id/diagnostics/network/:id
               example output from dhcp-agent (dnsmasq driver): contents of host,lease files

GET: /agent/:id/diagnostics/port/:id
                example output from dhcp-agent: dhcp transactions for that port (from dnsmasq logs)
                example output from ovs-agent: stats on qvo,qbr,tap interfaces

GET: /agent/:id/diagnostics/port/:id/security-groups
                 example output from l2-agent (iptables-firewalldriver): iptables rules programmed (ingress/egress/spoofing) for that port

GET: /agent/:id/diagnostics/port/:id/ping
                   This is an "operational" command - ping the port from the agent (dhcp/l3) network/router namespace

Neutron Command-line Client supports the following new commands
----------------------------------------------------------------

neutron l2-diagnostics --network-id <> --port-id <> agent

neutron dhcp-diagnostics --network-id <> --port-id <> --ping agent

neutron l3-diagnostics --network-id <> --port-id <> --ping agent

Sample Diagnostics Extension Code
------------------------------------------

See Attached Code Diff

Tags: rfe
Revision history for this message
Ramu Ramamurthy (ramu-ramamurthy) wrote :
description: updated
description: updated
description: updated
description: updated
Revision history for this message
Assaf Muller (amuller) wrote :

This is very interesting. Making Neutron more debuggable should be a priority. I think this RFE should go through the full spec process (Double so as it introduces a rich new API).

Revision history for this message
Henry Gessau (gessau) wrote :

Something with similar goals has come up before:
https://blueprints.launchpad.net/neutron/+spec/neutron-introspector

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This is another one:

https://blueprints.launchpad.net/neutron/+spec/cms-to-test-environment

But I have seen it elsewhere too.

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

This seems interesting for our debuggability goal, but doesn't it only target agent-based implementations? could we come with something abstracted from agents?

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

What if we switch agent for hosts?, so we can issue a diagnostic from a certain host?, that would be agent-agnostic, and implementable by all plugins.

Revision history for this message
Ramu Ramamurthy (ramu-ramamurthy) wrote :

 The following diagnostics API maybe more widely applicable to implementations that are agentless.

/<object>/:id/diagnostics

<object> could refer to port,network,agent, etc

In addition for implementations with agents, agent-id could be a parameter.

/<object>/:id/diagnostics/<agent>/:id/<verb>

Revision history for this message
Ramu Ramamurthy (ramu-ramamurthy) wrote :

The diagnostics APIs can be used by automated tools to narrow down problems to specific components or aid the manual debug and speed up the debug time.

Revision history for this message
Sean M. Collins (scollins) wrote :

We also have another RFE that sort of touches this as well - they want to log security group rules to see if their traffic is actually going to the instance.

https://bugs.launchpad.net/neutron/+bug/1468366

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I don't believe we should expose that kind of data thru API. First, because it may expose some internal data, and also because it's just overkill. I agree we need more debug info, and services should allow to provide it, but pushing it thru RPC and API is not needed. Instead, we should adopt oslo.reports library in our services and allow core plugins/drivers/services to extend the output (it's already supported by oslo.reports).

Revision history for this message
Ralf Trezeciak (r-uone) wrote :

Adding support tools for a cloud operations team is necessary and still missing in Openstack.
Everybody deploys scripts to automate troubleshooting of problems. And it can take quite a long time to nail down a problem, e.g. "why has my VM no connectivity to the network".

One goal of the Openstack project is the automation of application deployments. The deployment is done by using an optimistic "fire and forget" strategy. This works mostly - but the real word is not build by using Devstack, where a deployment lives for minutes or hours. In the real world, deployments are far more complex. And Openstack does not offer anything for troubleshooting - ceilometer is not a troubleshooting tool.

So - where is the automation of diagnostics? Currently there are troubleshooting chapters in the documentation available --> manual work on a fully automated platform??? Even for tasks, which can be automated...

Openstack is missing a detailed "target-actual comparison" - e.g.
  * target: the neutron database has an entry, that VM x should be connected to vlan 2
  * actual: is the VM x really connected to vlan 2 on OVS (if used)? Or is the VM connected to the wrong vlan?
Where are those automated checks in Openstack?
The argumentation "our CI/CD tests show, that the configuration order is correct and devstack works" does not help. High system load, long running systems,..... are showing error scenarios, which are not covered by CI/CD.

Diagnostics (operations tools) must be automated as much as possible. And yes - this is a huge and ugly topic.

I do not see a problem with internal data. A non admin tenant running a diagnostics might get the response "NOT OK". But he could take this information and contact the cloud provider.
The cloud provider running the same diagnostics with admin privileges will get the full information and may fix the problem.

A pure passive reporting is not enough - especially in networking troubleshooting requires tools (ping, arping,...) to be used actively to check connectivity, does the router see the mac/IP of the VM or does the software switch have an mac entry for the VM.

Revision history for this message
Ramu Ramamurthy (ramu-ramamurthy) wrote :

We are developing a diagnostics tool which can debug common problems (why cannot i ping the VM), modeled along the lines of Rally. Checks are expressed in json format and each check is a check against expected vs current.
An example set of checks follows.

The tool requires ssh-root access to do its work, but a far better approach would be for it to talk to neutron/nova only via APIs.
Hence the need for a diagnostics extension API.

python main.py vmchecks.json d4b558c6-74c7-4c30-a262-0b4d40e36f9d
+---------------------------------------+------------+
|Check | Status |
+---------------------------------------+------------+
|Summary | fail |
| basic checks | fail |
| getvminfo | fail |
| getvmdiagnostics | fail |
| checkvmdiagnostics | fail |
| port stats | fail |
| getvmport | fail |
| getlbinfo | fail |
| security groups | fail |
| getvmiptables | fail |
| checkvmiptables | pass |
| ovs | fail |
| getovsports | fail |
| getovsflows | fail |
| checkovsflows | fail |
| dhcp | fail |
| getdhcpagents | fail |
| checkdhcpleases | fail |
| checkdhcptransactions | fail |
| pingfromdhcpns | fail |
| floating ip | fail |
| getrouters | fail |
| routercheck | fail |
| getrouterifs | fail |
| checkrouterifs | fail |
| checkrouterha | fail |
| checkrouterfip | fail |
| pingrouterns | fail |
+---------------------------------------+------------+

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Don't get me wrong. I am all for diagnostics options, both passive and active. I am just not buying the need to have it exposed thru neutron REST API.

Speaking of active diagnostics, we have some limited thing for that called neutron-debug: http://docs.openstack.org/cli-reference/content/neutron-debug_commands.html

To validate configuration correctness, we also have neutron-sanity-check tool, but it's not designed for on-going operation monitoring.

I believe we can base new diagnostics tools and reports on neutron-debug (active) and oslo.reports (passive).

Note that I sent oslo.reports integration patch for review, and we will be able to base more reports leveraging the feature: https://review.openstack.org/#/c/250487/

So, let's work on defining the list of needed tools/diagnostics info/reports, and let's see which tooling we use for that.

Revision history for this message
Ramu Ramamurthy (ramu-ramamurthy) wrote :

We looked at the neutron-debug tool - As part of its operation, it creates
new entities on a host (neutron-port, namespace etc) from which debugging can proceed.
The creation of new entities in a tenant network just for debugging is extreme - because,
the dhcp port, and router port(s) are already part of the tenant neutron infrastructure anyway, and so,
debugging (pinging) from those ports is more reasonable. Further, during debug, we often ping to/from
the dhcp/router namespaces to answer the question - does the datapath to the vm appear fine
to/from the dhcp/router.

Doug Wiegley (dougwig)
Changed in neutron:
status: New → Confirmed
Changed in neutron:
importance: Undecided → Wishlist
Changed in neutron:
assignee: nobody → Ramu Ramamurthy (ramu-ramamurthy)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I am linking the two bugs. The use case is the same, the proposed implementation may be different though.

Changed in neutron:
assignee: Ramu Ramamurthy (ramu-ramamurthy) → nobody
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Let's keep the discussion in one place.

Revision history for this message
Boden R (boden) wrote :

[1] is also related to this RFE.
I've added some additional notes to [2] as well w/r/t how these *might* fit together.

[1] https://bugs.launchpad.net/neutron/+bug/1563538
[2] https://etherpad.openstack.org/p/neutron-troubleshooting

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-specs (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/308973

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron-specs (master)

Reviewed: https://review.openstack.org/308973
Committed: https://git.openstack.org/cgit/openstack/neutron-specs/commit/?id=dc11da5109759d13636aaaef35420fa4ac1d88d6
Submitter: Jenkins
Branch: master

commit dc11da5109759d13636aaaef35420fa4ac1d88d6
Author: Boden R <email address hidden>
Date: Wed Feb 15 15:47:07 2017 -0700

    Neutron resource diagnostics

    This spec proposes the introduction of a neutron diagnostics framework
    and API extension capable collecting resource diagnostics across
    neutron API and agent nodes. To keep the spec containable, the proposal
    suggests only providing a sample diagnostic check and reiterating on
    concrete diagnostics once we get the plumbing in place.

    While this spec has some inspiration from nova diagnostics [1],
    the approach herein is more generic and extensible supporting a
    broader set of use cases longer term.

    Finally it seeks to pave the way for supporting use case / features
    proposed in the related bugs.

    [1] https://wiki.openstack.org/wiki/Nova_VM_Diagnostics

    Related-Bug: #1507499
    Related-Bug: #1519537
    Related-Bug: #1537686
    Related-Bug: #1563538

    Change-Id: Id534acb1593f1fe210c561b1451656dce69514db

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.