log segmentation_id over threshold for monitoring

Bug #1559978 reported by Miguel Angel Ajo
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Expired
Wishlist
Unassigned

Bug Description

Use case
=======
Monitoring of the "segmentation resources".

Logging the status of such resources as we go, (or the pass over a certain threshold) would allow monitoring solutions to identify tripping over
certain levels, and warn the administrator to take action: cleaning up
unused tenant networks, changing configuration, changing segmentation
technologies. etc.

Description
=========
Depending on configuration, and underlaying technologies, the segmentation
ids can be exhausted (vlan/vni/tunnel keys, etc..), making it a consumable
resource.

External monitoring solutions have no easy way to determine the amount of
"segmentation resources" available on the underlaying resource technology.

Alternatives
==========
One alternative could be providing a generic API to retrieve the usage of
resources. That would require the monitoring solution to make API calls
and therefore use credentials, making it harder to leverage standard
deployments and monitoring tools. This could also be considered as a second
step of this RFE.

description: updated
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Logging to syslog? Or emitting a message to notifications.*?

Changed in neutron:
status: New → Confirmed
importance: Undecided → Wishlist
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Waiting on submitter feedback. The use case is not clear.

Changed in neutron:
status: Confirmed → Incomplete
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

I was proposing normal neutron-server log or plain syslog if that could also be an option.

Changed in neutron:
status: Incomplete → Confirmed
Revision history for this message
Akihiro Motoki (amotoki) wrote :

Warning log message sounds good when the usage is over the threshold.
I agree this is a light and easy way to do it.

On the other hand, I think the usage API also looks attractive. I think operators (using VLAN or a technology whose ID space is not so big) are monitoring the usage of ID space, by looking database directly or monitoring the number of networks.

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

@amotoki, you're right, an API could also be beneficial, we could consider that as a second RFE if we find somebody with resources to work on it.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Exposing information of segmentation range is plugin specific, on the other end emitting a trace is a loosely coupled contract. We already emit network creation events etc, why would those not suffice to infer whether you're at capacity or not?

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Hmm, it's plugin specific, that's right.

Armando, what do you mean by network creation events?: notifications?, logs?

If it's logs, it would suffice with making them non-debug. But I agree that's quite loosely coupled,
I will think about it to see if I could come with something better.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

@Ajo: I meant oslo messaging notification that are consumed my something like Ceilometer.

Any update on your thinking process?

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Sorry, I forgot about this one.

Let me explain where the request came from. But I start to lean towards ceilometer integration, I may need get myself documented because I know very little about ceilometer myself.

This request came from a team we have in Red Hat integrating common logging services with openstack (fluentd, logstash, kibana), and they wanted to make filters to identify common situations like this.

I suspect a warning in something like a nagios would be more appropriate.

I'll investigate about our ceilometer integration to see if this could fit there.

The question is, how do we make sure the administrator is alerted of this resource limit being approached or hit?

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

If the only issue you'd want to solve here is to be alerted that you don't not reaching out the limit of the segmentation ids you have at your disposal you could easily do this as a postprocessing thing: you know the side of your id range, you watch how many networks are created destroyed (via ceilometer or other equivalent mechanism), and you can know whether you're running out or not.

Now, we could instrument Neutron to emit alarms and we can agree on a specific format for consumption by third parties etc, but I am not sure if this is getting into Monasca territory (though I feel like it is). How are these types of issues solved across the wider OpenStack deployment? I'd think twice before we go and deep dive in a Neutron specific solution.

Let's see if someone else wants to chime in.

Changed in neutron:
status: Confirmed → Triaged
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Indeed it seems like implementing alarm policy is not a job for neutron. Neutron should merely provide basic data points on resource usage, or even, as suggested above, trigger events on resource creation/deletion (it's already there for ceilometer) and leave analysis up to external solutions consuming those events.

The only problem with just having create/delete events available that I can see is that the underlying resource pool is not known to consumers through any API or notifications, so monitoring tools would need to get access to configuration to be able to validate whether pool is full.

Also I am not sure whether we expose network type as part of notifications. Though I can't find a use case that would require knowing distinct usage level *per network type*. In the end, neither admins nor users should care about whether e.g. VLANs are exhausted if there are still a lot of VNIs available for new tenant networks (?).

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

If the use case is: as an admin I want to know how many Neutron networks I have left at my disposal before my tenants start complaining that their neutron net-create command failed, there's no obvious mechanism address this need.

Depending on the plugin backend you can extrapolate this information more or less easily. We could come up with a similar effort to [1], and we could/must make it backend dependent.

[1] http://docs.openstack.org/developer/neutron/devref/network_ip_availability.html

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Thanks armax for using better words to explain what I was trying to suggest,
comment #13 matches perfectly the need I was trying to express from the admin
feedback I had.

In the context of ML2 we could have an API to return availability details per
"physical network", I'm using quotes, because tunnel networks are not strictly
physnets, so we probably need a better term.

If this makes sense, I will refresh the description.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

An API seems overkill though, as I said in #13 there's no easy mechanism to address this need, but putting something together is not wildly complicated either. The cost involved in designing/agreeing/maintaining an API for the benefit of this use case makes me wonder whether this is worth the effort.

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Yes, this approach has a complicated balance in terms of cost/benefit.

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

Discussed in the Drivers' meeting: http://eavesdrop.openstack.org/meetings/neutron_drivers/2016/neutron_drivers.2016-06-02-22.00.log.html#l-36

Honestly, we spun our wheels a lot in the meeting and didn't really get anywhere. We do think that this is really only an issue for vlans. Is that a good assumption? Other tunnel types probably don't have limitations that anyone is concerned with. Is that right? If so, it seems like overkill to design a mechanism that is backend dependent for this.

There was a very weak consensus suggesting that we could start by just logging segmentation usage to logs which seems to be where this RFE started in the first place.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Removing RFE since this will just be a simple log message to start with.

tags: removed: rfe
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

We're going to go with logging for now. I don't think this needs to be an RFE.

summary: - [RFE] log segmentation_id over threshold for monitoring
+ log segmentation_id over threshold for monitoring
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This bug is > 180 days without activity. We are unsetting assignee and milestone and setting status to Incomplete in order to allow its expiry in 60 days.

If the bug is still valid, then update the bug status.

Changed in neutron:
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.