Bug because some pysnmp version implements GETBULK instead of BULKWALK

Bug #1659400 reported by Victor Eduardo Bazterra
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceilometer
Invalid
Undecided
Unassigned

Bug Description

Dear all,

BLOT:

pysnmp version 4.2.5 (allowed by ceilometer requirements) implements bulkCmd using GETBULK. For pysnmp version 4.3.2 bulkCmd is implemented using BULKWALK. I believe snmp inspector OID caching assumes that bulkCmd is implemented using BULKWALK. Therefore, requirement should be updated to pysnmp<5.0.0,>=4.3.2

DETAILS:

I found an issue with the pysnmp version that it is set in the requirements for ceilometer:

https://git.openstack.org/cgit/openstack/ceilometer/tree/requirements.txt#n32

This bug mostly affects me because I am prototyping some changes snmp inspector. However, I think it is important enough to introduce serious issues if not address now. It is also as I found out, easy to fix.

I am using snmp for collecting information about hardware. The heart of the code it is the snmp inspector, that makes snmp calls and collect previous call's results in a cache.

https://git.openstack.org/cgit/openstack/ceilometer/tree/ceilometer/hardware/inspector/snmp.py#n242

When running the snmp inspector for type_prefix, the inspector uses bulkCmd from pysnmp to collect all the OIDs for a given OID prefix.

https://git.openstack.org/cgit/openstack/ceilometer/tree/ceilometer/hardware/inspector/snmp.py#n159

Each time the inspect_generic is called for a new type_prefix, the inspector checks if it is in cache, by looking for at least one metric sharing the the same prefix. If this is the case, my understanding is that the inspector assumes that all the OID values of the subtrees with common prefix (and only with common prefix) are cached.

This can only true if bulkCmd collects all the OID within the subtrees with same prefix, i.e. it does not collect metrics with different prefixes. This is equivalent to say that bulkCmd implements the equivalent to BULKWALK command. However, I found out this depends on the pysnmp version that is used.

The pysnmp requirement in master branch is pysnmp<5.0.0,>=4.2.3 so in my particular system initially I was using pysnmp==4.2.5. For this version, it seems bulkCmd == GETBULK. To verify this I wrote this small python code:

---
from pprint import pprint as pp
from pysnmp.entity.rfc3413.oneliner import cmdgen

auth = cmdgen.CommunityData('public')
transport = cmdgen.UdpTransportTarget(('hostname', 161))
cmdrun = cmdgen.CommandGenerator()

errorIndication, errorStatus, errorIndex, varBindTable = cmdrun.bulkCmd(
    auth, transport, 0, 100, '1.3.6.1.4.1.2021.13.15.1.1.9', lookupValues=True
)

pp(varBindTable)
---

This code running using pysnmp==4.2.5 returns exactly 100 records including OID that are not in the initial prefix like 1.3.6.1.4.1.2021.13.15.1.1.10.X. This is how a GETBULK is expected to behave, except if the start of the initial SNMPGET call is closer than 100 SNMPNEXT of the End-of-MIB.

If I run the same test code using pysnmp==4.3.2 I get as return ~32 records (depending on the number of disk in the host), all of them with the same prefix as prefix OID, or 1.3.6.1.4.1.2021.13.15.1.1.9.X. This is the expected behavior for BULKWALK !!!

In my case I have the following problem if bulkCmd == GETBULK. I have cases where a call for a prefix 1.3.6.1.4.1.2021.13.15.1.1.9 added to the cache some entries (but not all of them) with prefix 1.3.6.1.4.1.2021.13.15.1.1.10. If after that, I try to collect all the metrics with prefix 1.3.6.1.4.1.2021.13.15.1.1.10, the inspector detect there is at least one entry with that prefix. Therefore, it does not scheduled more snmp call for OID with prefix 1.3.6.1.4.1.2021.13.15.1.1.10. As result, the inspector misses a long number of records resulting in disappearing metrics. Even worse, the resulting behavior depended on the order the prefix metrics were called, introducing sporadic metric disappearing metrics!

Sorry for the long explanation, it was a subtle issue that really consumed a lot my time to understand it. I am new to ceilometer and also pysnmp.

I hope you find this useful!

Victor

Revision history for this message
Julien Danjou (jdanjou) wrote :

While this is a very interesting and detailled report, Ceilometer works correctly with 4.2.5 so the requirements are not wrong. There's nothing to change here.

It's pretty obvious that, yes, installing the latest version of any dependency might fix bug. Updating requirements each time a lib release a new version that fixes some bugs they had is not the job of Ceilometer developers. The requirements list the compatibility. Not which version have bug or not.

I hope I made things clearer!

Changed in ceilometer:
status: New → Invalid
Revision history for this message
Victor Eduardo Bazterra (baites) wrote :

Dear Julien, let me tell you first I am looking forward for the book release :-) and I also enjoy the blog.

In any case, I think this is not a bug in the dependency, but it looks to me more as new implementation.

I do not think using OID prefix to check cache content will work for all the cases when using 4.2.5. In that sense, ceilometer is not working properly for any possible prefix entry in snmp.yaml.

However, it is true ceilometer is working for all the OIDs that are currently support in snmp.yaml file. In this sense, this change is currently no needed as I tried to imply in my bug report. I just wanted to give the heads up.

I am currently monitoring disk I/O using 1.3.6.1.4.1.2021.13.15.1.1. I added a filter option to the type_prefix to avoid streaming a lot metrics in the subtree (I am only interested in hard disks). I may be able to upstream these changes to be considered by the community. In case this happens, I will be bundled also a change on the pysnmp dependency version.

Thanks,
Victor

Revision history for this message
Julien Danjou (jdanjou) wrote :

Hi Victor,

Thanks :)

So if I understand correctly what you're saying, there's a bug with the current dependency in how Ceilometer would work with SNMP ? So my question would be: is it possible to write a unit/functional test that shows that bug with pysnmp < 4.3?

Sorry if I misunderstood the original bug report – I'm changing the status to Incomplete until I have a proper understanding of if/how we should fix it. :)

Changed in ceilometer:
status: Invalid → Incomplete
Revision history for this message
Victor Eduardo Bazterra (baites) wrote :

It would be easy for me to make a test for this. I will work on it and prepare a branch for review.

Thx,
Victor

Revision history for this message
gordon chung (chungg) wrote :

closing as not enough details exist and no test was provided. please reopen if you have more details or the test. thanks.

Changed in ceilometer:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.