Bug #1519537 “[RFE] - Diagnostics Extension for Neutron” : Bugs : neutron

Revision history for this message

Ramu Ramamurthy (ramu-ramamurthy) wrote on 2015-11-24:

#1

sample code diff for diagnostics extension Edit (5.9 KiB, text/plain)

description:

updated

Ramu Ramamurthy (ramu-ramamurthy) on 2015-11-24

description:

updated

Ramu Ramamurthy (ramu-ramamurthy) on 2015-11-24

description:

updated

Ramu Ramamurthy (ramu-ramamurthy) on 2015-11-25

description:

updated

Revision history for this message

Assaf Muller (amuller) wrote on 2015-11-25:

#2

This is very interesting. Making Neutron more debuggable should be a priority. I think this RFE should go through the full spec process (Double so as it introduces a rich new API).

Revision history for this message

Henry Gessau (gessau) wrote on 2015-11-25:

#3

Something with similar goals has come up before:
https://blueprints.launchpad.net/neutron/+spec/neutron-introspector

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2015-11-25:

#4

This is another one:

https://blueprints.launchpad.net/neutron/+spec/cms-to-test-environment

But I have seen it elsewhere too.

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2015-11-25:

#5

This seems interesting for our debuggability goal, but doesn't it only target agent-based implementations? could we come with something abstracted from agents?

Revision history for this message

Miguel Angel Ajo (mangelajo) wrote on 2015-11-25:

#6

What if we switch agent for hosts?, so we can issue a diagnostic from a certain host?, that would be agent-agnostic, and implementable by all plugins.

Revision history for this message

Ramu Ramamurthy (ramu-ramamurthy) wrote on 2015-11-25:

#7

The following diagnostics API maybe more widely applicable to implementations that are agentless.

/<object>/:id/diagnostics

In addition for implementations with agents, agent-id could be a parameter.

/<object>/:id/diagnostics/<agent>/:id/<verb>

Revision history for this message

Ramu Ramamurthy (ramu-ramamurthy) wrote on 2015-11-25:

#8

The diagnostics APIs can be used by automated tools to narrow down problems to specific components or aid the manual debug and speed up the debug time.

Revision history for this message

Sean M. Collins (scollins) wrote on 2015-11-25:

#9

We also have another RFE that sort of touches this as well - they want to log security group rules to see if their traffic is actually going to the instance.

https://bugs.launchpad.net/neutron/+bug/1468366

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2015-11-26:

#10

I don't believe we should expose that kind of data thru API. First, because it may expose some internal data, and also because it's just overkill. I agree we need more debug info, and services should allow to provide it, but pushing it thru RPC and API is not needed. Instead, we should adopt oslo.reports library in our services and allow core plugins/drivers/services to extend the output (it's already supported by oslo.reports).

Revision history for this message

Ralf Trezeciak (r-uone) wrote on 2015-11-26:

#11

Adding support tools for a cloud operations team is necessary and still missing in Openstack.
Everybody deploys scripts to automate troubleshooting of problems. And it can take quite a long time to nail down a problem, e.g. "why has my VM no connectivity to the network".

One goal of the Openstack project is the automation of application deployments. The deployment is done by using an optimistic "fire and forget" strategy. This works mostly - but the real word is not build by using Devstack, where a deployment lives for minutes or hours. In the real world, deployments are far more complex. And Openstack does not offer anything for troubleshooting - ceilometer is not a troubleshooting tool.

So - where is the automation of diagnostics? Currently there are troubleshooting chapters in the documentation available --> manual work on a fully automated platform??? Even for tasks, which can be automated...

Openstack is missing a detailed "target-actual comparison" - e.g.
* target: the neutron database has an entry, that VM x should be connected to vlan 2
* actual: is the VM x really connected to vlan 2 on OVS (if used)? Or is the VM connected to the wrong vlan?
Where are those automated checks in Openstack?
The argumentation "our CI/CD tests show, that the configuration order is correct and devstack works" does not help. High system load, long running systems,..... are showing error scenarios, which are not covered by CI/CD.

Diagnostics (operations tools) must be automated as much as possible. And yes - this is a huge and ugly topic.

I do not see a problem with internal data. A non admin tenant running a diagnostics might get the response "NOT OK". But he could take this information and contact the cloud provider.
The cloud provider running the same diagnostics with admin privileges will get the full information and may fix the problem.

A pure passive reporting is not enough - especially in networking troubleshooting requires tools (ping, arping,...) to be used actively to check connectivity, does the router see the mac/IP of the VM or does the software switch have an mac entry for the VM.

Adding support tools for a cloud operations team is necessary and still missing in Openstack.
Everybody deploys scripts to automate troubleshooting of problems. And it can take quite a long time to nail down a problem, e.g. "why has my VM no connectivity to the network".

One goal of the Openstack project is the automation of application deployments. The deployment is done by using an optimistic "fire and forget" strategy. This works mostly - but the real word is not build by using Devstack, where a deployment lives for minutes or hours. In the real world, deployments are far more complex. And Openstack does not offer anything for troubleshooting - ceilometer is not a troubleshooting tool.

So - where is the automation of diagnostics? Currently there are troubleshooting chapters in the documentation available --> manual work on a fully automated platform??? Even for tasks, which can be automated...

Openstack is missing a detailed "target-actual comparison" - e.g. 
  * target: the neutron database has an entry, that VM x should be connected to vlan 2
  * actual: is the VM x really connected to vlan 2 on OVS (if used)?  Or is the VM connected to the wrong vlan?
Where are those automated checks in Openstack?
The argumentation "our CI/CD tests show, that the configuration order is correct and devstack works" does not help. High system load, long running systems,.....  are showing error scenarios, which are not covered by CI/CD.

Diagnostics (operations tools) must be automated as much as possible. And yes - this is a huge and ugly topic.

I do not see a problem with internal data. A non admin tenant running a diagnostics might get the response "NOT OK". But he could take this information and contact the cloud provider.
The cloud provider running the same diagnostics with admin privileges will get the full information and may fix the problem.

A pure passive reporting is not enough - especially in networking troubleshooting requires tools (ping, arping,...) to be used actively to check connectivity, does the router see the mac/IP of the VM or does the software switch have an mac entry for the VM.

Revision history for this message

Ramu Ramamurthy (ramu-ramamurthy) wrote on 2015-11-27:

#12

We are developing a diagnostics tool which can debug common problems (why cannot i ping the VM), modeled along the lines of Rally. Checks are expressed in json format and each check is a check against expected vs current.
An example set of checks follows.

The tool requires ssh-root access to do its work, but a far better approach would be for it to talk to neutron/nova only via APIs.
Hence the need for a diagnostics extension API.

We are developing a diagnostics tool which can debug common problems (why cannot i ping the VM), modeled along the lines of Rally. Checks are expressed in json format and each check is a check against expected vs current. 
An example set of checks follows.

The tool requires ssh-root access to do its work, but a far better approach would be for it to talk to neutron/nova only via APIs.
Hence the need for a diagnostics extension API.

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2015-11-27:

#13

Don't get me wrong. I am all for diagnostics options, both passive and active. I am just not buying the need to have it exposed thru neutron REST API.

Speaking of active diagnostics, we have some limited thing for that called neutron-debug: http://docs.openstack.org/cli-reference/content/neutron-debug_commands.html

To validate configuration correctness, we also have neutron-sanity-check tool, but it's not designed for on-going operation monitoring.

I believe we can base new diagnostics tools and reports on neutron-debug (active) and oslo.reports (passive).

Note that I sent oslo.reports integration patch for review, and we will be able to base more reports leveraging the feature: https://review.openstack.org/#/c/250487/

So, let's work on defining the list of needed tools/diagnostics info/reports, and let's see which tooling we use for that.

Revision history for this message

Ramu Ramamurthy (ramu-ramamurthy) wrote on 2015-11-30:

#14

We looked at the neutron-debug tool - As part of its operation, it creates
new entities on a host (neutron-port, namespace etc) from which debugging can proceed.
The creation of new entities in a tenant network just for debugging is extreme - because,
the dhcp port, and router port(s) are already part of the tenant neutron infrastructure anyway, and so,
debugging (pinging) from those ports is more reasonable. Further, during debug, we often ping to/from
the dhcp/router namespaces to answer the question - does the datapath to the vm appear fine
to/from the dhcp/router.

Doug Wiegley (dougwig) on 2015-12-02

Changed in neutron:
status:	New → Confirmed

Armando Migliaccio (armando-migliaccio) on 2015-12-04

Changed in neutron:
importance:	Undecided → Wishlist

Ramu Ramamurthy (ramu-ramamurthy) on 2015-12-04

Changed in neutron:
assignee:	nobody → Ramu Ramamurthy (ramu-ramamurthy)

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2015-12-22:

#15

I am linking the two bugs. The use case is the same, the proposed implementation may be different though.

Changed in neutron:
assignee:	Ramu Ramamurthy (ramu-ramamurthy) → nobody

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2015-12-22:

#16

Let's keep the discussion in one place.

Revision history for this message

Boden R (boden) wrote on 2016-03-31:

#17

[1] is also related to this RFE.
I've added some additional notes to [2] as well w/r/t how these *might* fit together.

[1] https://bugs.launchpad.net/neutron/+bug/1563538
[2] https://etherpad.openstack.org/p/neutron-troubleshooting

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-04-21: Related fix proposed to neutron-specs (master)

#18

Related fix proposed to branch: master
Review: https://review.openstack.org/308973

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-04-24: Related fix merged to neutron-specs (master)

#19

Reviewed: https://review.openstack.org/308973
Committed: https://git.openstack.org/cgit/openstack/neutron-specs/commit/?id=dc11da5109759d13636aaaef35420fa4ac1d88d6
Submitter: Jenkins
Branch: master

commit dc11da5109759d13636aaaef35420fa4ac1d88d6
Author: Boden R <email address hidden>
Date: Wed Feb 15 15:47:07 2017 -0700

Neutron resource diagnostics

    This spec proposes the introduction of a neutron diagnostics framework
    and API extension capable collecting resource diagnostics across
    neutron API and agent nodes. To keep the spec containable, the proposal
    suggests only providing a sample diagnostic check and reiterating on
    concrete diagnostics once we get the plumbing in place.

    While this spec has some inspiration from nova diagnostics [1],
    the approach herein is more generic and extensible supporting a
    broader set of use cases longer term.

Finally it seeks to pave the way for supporting use case / features
proposed in the related bugs.

[1] https://wiki.openstack.org/wiki/Nova_VM_Diagnostics

    Related-Bug: #1507499
    Related-Bug: #1519537
    Related-Bug: #1537686
    Related-Bug: #1563538

Change-Id: Id534acb1593f1fe210c561b1451656dce69514db

neutron

[RFE] - Diagnostics Extension for Neutron

Bug Description

Other bug subscribers

Related blueprints

Patches

Remote bug watches