radosgw does not capture all usage information when rgw.none exists

Bug #1824360 reported by Eric Miller
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceilometer
New
Undecided
Unassigned

Bug Description

For some reason, RGW includes an "rgw.none" section in the bucket stats for only some buckets.

When this section exists, the following meters in Ceilometer do not poll the usage data and a resource of type "ceph_account" is not created in Gnocchi for this bucket:

        - radosgw.objects
        - radosgw.objects.size
        - radosgw.objects.containers
        - radosgw.api.request
        - radosgw.containers.objects
        - radosgw.containers.objects.size

An example of the usage section in the "radosgw-admin bucket stats" output:

        "usage": {
            "rgw.none": {
                "size": 0,
                "size_actual": 0,
                "size_utilized": 0,
                "size_kb": 0,
                "size_kb_actual": 0,
                "size_kb_utilized": 0,
                "num_objects": 4
            },
            "rgw.main": {
                "size": 503004188538,
                "size_actual": 504763817984,
                "size_utilized": 503004188538,
                "size_kb": 491215028,
                "size_kb_actual": 492933416,
                "size_kb_utilized": 491215028,
                "num_objects": 708993
            }

Some projects include multiple buckets, where some buckets have rgw.none, and others don't. Those that have rgw.none result in nothing being stored in Gnocchi.

Any idea why rgw.none exists? And what the Ceilometer RGW code is doing to ignore the rgw.main usage information?

Thanks!

Eric

Revision history for this message
Eric Miller (erickmiller) wrote :

While looking for a pattern, I figured out that there are "swift_account" resources in Gnocchi that exist for the missing "ceph_account" resources, and under the "swift_account" resources, there are "radosgw.containers.objects" and "radosgw.containers.objects.size" metrics.

We had switched between using native Swift and "Ceph+RGW+Swift API compatibility" recently, and it looks like Swift resources, that had been stored in Gnocchi, remained.

This is an example project that had an issue (project id = 5f6e2a6fcdbc427c81cb462f45b147d7)

gnocchi resource list --detail | grep 5f6e2a6fcdbc427c81cb462f45b147d7

| 5f6e2a6f-cdbc-427c-81cb-462f45b147d7 | swift_account | 5f6e2a6fcdbc427c81cb462f45b147d7 | None | 5f6e2a6fcdbc427c81cb462f45b147d7 | 2019-03-25T04:21:39.139673+00:00 | None | 2019-03-25T04:21:39.139694+00:00 | None | 39ef552637e74e32abdf25dd7454231f:f0a94de0d6184e10814660d470721c31 |
| 00a842f1-646d-5595-8350-670ec2f41c13 | swift_account | 5f6e2a6fcdbc427c81cb462f45b147d7 | None | 5f6e2a6fcdbc427c81cb462f45b147d7_ABC Backup | 2019-03-25T04:21:42.227916+00:00 | None | 2019-03-25T04:21:42.227935+00:00 | None | 39ef552637e74e32abdf25dd7454231f:f0a94de0d6184e10814660d470721c31 |
| 6858266a-0086-5c9b-8627-48ed48afb3db | swift_account | 5f6e2a6fcdbc427c81cb462f45b147d7 | 84a7b6e6bad442278f4c1630f5b4e0c7 | swift_v1_AUTH_5f6e2a6fcdbc427c81cb462f45b147d7 | 2019-03-31T16:36:35.613956+00:00 | None | 2019-03-31T16:36:35.613975+00:00 | None | 39ef552637e74e32abdf25dd7454231f:f0a94de0d6184e10814660d470721c31 |
| cd7c570d-98c7-5ea8-9c91-54d49961e056 | ceph_account | 5f6e2a6fcdbc427c81cb462f45b147d7 | None | 5f6e2a6fcdbc427c81cb462f45b147d7_ABC Backup_segments | 2019-04-01T10:17:18.168959+00:00 | None | 2019-04-01T10:17:18.168980+00:00 | None | 39ef552637e74e32abdf25dd7454231f:f0a94de0d6184e10814660d470721c31 |

So, this likely isn't a bug, but rather a misunderstanding of how resources were handled when switching from Swift to Ceph.

Eric

Revision history for this message
Eric Miller (erickmiller) wrote :

I thought I would try deleting the above resources from Gnocchi, starting with the swift_account resources, thinking that they would be re-created by Ceilometer. However, they did not.

I also deleted the ceph_account resource (using "gnocchi resource delete <resource id>"), and waited a bit, and Ceilometer did not re-create the resources.

"radosgw-admin bucket stats" produces correct usage stats for the missing values, and Ceilometer is polling the RGW on a regular basis to get these stats.

For example:

tail -f client.radosgw.gateway.log | grep 5f6e2a6fcdbc427c81cb462f45b147d7

2019-04-11 21:52:17.953072 7fd189098700 1 civetweb: 0x5572279bb000: 172.17.0.2 - - [11/Apr/2019:21:52:17 -0500] "GET /admin/bucket?stats=true&uid=5f6e2a6fcdbc427c81cb462f45b147d7%245f6e2a6fcdbc427c81cb462f45b147d7 HTTP/1.1" 200 0 - python-requests/2.20.1
2019-04-11 21:52:48.690492 7fd279a79700 1 civetweb: 0x557226fee000: 172.17.0.2 - - [11/Apr/2019:21:52:48 -0500] "GET /admin/bucket?stats=true&uid=5f6e2a6fcdbc427c81cb462f45b147d7%245f6e2a6fcdbc427c81cb462f45b147d7 HTTP/1.1" 200 0 - python-requests/2.20.1
2019-04-11 21:53:18.018313 7fd266252700 1 civetweb: 0x55722709a000: 172.17.0.2 - - [11/Apr/2019:21:53:17 -0500] "GET /admin/bucket?stats=true&uid=5f6e2a6fcdbc427c81cb462f45b147d7%245f6e2a6fcdbc427c81cb462f45b147d7 HTTP/1.1" 200 0 - python-requests/2.20.1

Running this, to see if there are any gnocchi resources, produces nothing:

gnocchi resource list --detail | grep 5f6e2a6fcdbc427c81cb462f45b147d7

Any idea why Ceilometer isn't re-creating these resources?

Thanks!

Eric

Revision history for this message
Eric Miller (erickmiller) wrote :
Download full text (7.4 KiB)

I thought I'd try something a bit more drastic:
1) stopping Ceilometer
2) dropping the gnocchi MySQL table
3) purging all objects in the Ceph gnocchi pool
4) re-deploying (I'm using Kolla Ansible) gnocchi and ceilometer

After this, most everything is working, however, the issue still exists where incorrect values are stored in Gnocchi when a container has an "rgw.none" section under "usage" in the "radosgw-admin bucket stats" output.

For example, this is one of the containers with an rgw.none section (same project as previously described):

[root@kollaansibledeploy000 ~]# gnocchi resource show 00a842f1-646d-5595-8350-670ec2f41c13
+-----------------------+-----------------------------------------------------------------------+
| Field | Value |
+-----------------------+-----------------------------------------------------------------------+
| created_by_project_id | f0a94de0d6184e10814660d470721c31 |
| created_by_user_id | 39ef552637e74e32abdf25dd7454231f |
| creator | 39ef552637e74e32abdf25dd7454231f:f0a94de0d6184e10814660d470721c31 |
| ended_at | None |
| id | 00a842f1-646d-5595-8350-670ec2f41c13 |
| metrics | radosgw.containers.objects.size: 1340f57d-92c1-41c8-88e5-1ce8eeceef6a |
| | radosgw.containers.objects: 0a6ef3a1-f390-4145-b681-4e70d3bfa305 |
| original_resource_id | 5f6e2a6fcdbc427c81cb462f45b147d7_ABC Backup |
| project_id | 5f6e2a6fcdbc427c81cb462f45b147d7 |
| revision_end | None |
| revision_start | 2019-04-12T04:33:45.388028+00:00 |
| started_at | 2019-04-12T04:33:45.388011+00:00 |
| type | ceph_account |
| user_id | None |
+-----------------------+-----------------------------------------------------------------------+

Looking at the radosgw.containers.objects measures:

gnocchi measures show 0a6ef3a1-f390-4145-b681-4e70d3bfa305 | tail -n 4

| 2019-04-11T23:48:00-05:00 | 30.0 | 4.0 |
| 2019-04-11T23:48:30-05:00 | 30.0 | 4.0 |
| 2019-04-11T23:49:00-05:00 | 30.0 | 4.0 |
+---------------------------+-------------+-------+

You can see that it shows 4 objects. BUT - this container has 710,471 objects. The "rgw.none" usage information shows 4, which matches with the above measure output. So, it appears that only the rgw.none usage information is being recorded.

The radosgw-admin bucket stats show this (cropped output to show only this container):

    {
        "bucket": "ABC Backup",
        "zonegroup": "8b24c8a7-611d-4da7-b405-d04bcb7be304",
        "placement_rule": "de...

Read more...

Revision history for this message
Eric Miller (erickmiller) wrote :
Download full text (6.4 KiB)

I think I found the bug.

The pollsters ContianersObjectsPollster and ContainerSizePollster loop through the bucket_info['buckets'] array, as shown here:
https://github.com/openstack/ceilometer/blob/2b69cb40923c994beecdd85c863880ab96b6662e/ceilometer/objectstore/rgw.py#L136

The bucket_info['buckets'] array is created here:
https://github.com/openstack/ceilometer/blob/2b69cb40923c994beecdd85c863880ab96b6662e/ceilometer/objectstore/rgw_client.py#L64

I created a version of this code that can be run in a standard Python command prompt, included at the end of this message.

The value of "stats" is:

{'buckets': [Bucket(name=u'ABC Backup', num_objects=4, size=0), Bucket(name=u'ABC Backup', num_objects=709203, size=491296676), Bucket(name=u'ABC Backup_segments', num_objects=33, size=158347544)], 'size': 649644220, 'num_objects': 709248, 'num_buckets': 2}

Two buckets are included with the same name "ABC Backup", the first being the "rgw.none" usage info (what we do NOT want), the second being the "rgw.main" usage info (what we want).

The sample.Sample() method call (for the ContainersSizePollster) here:
https://github.com/openstack/ceilometer/blob/2b69cb40923c994beecdd85c863880ab96b6662e/ceilometer/objectstore/rgw.py#L137

is run once per bucket in the array, so Gnocchi receives two "measures add" commands, one for each of these buckets, and it must be only storing the first one, which is the "rgw.none" bucket.

The same problem occurs in the ContainersObjectsPollster here:
https://github.com/openstack/ceilometer/blob/2b69cb40923c994beecdd85c863880ab96b6662e/ceilometer/objectstore/rgw.py#L118

A simple fix is to fix the problem in the get_bucket() method here:
https://github.com/openstack/ceilometer/blob/2b69cb40923c994beecdd85c863880ab96b6662e/ceilometer/objectstore/rgw_client.py#L64

by adjusting the for loop to only get the values for the rgw.main usage section:

for it in json_data:
    v = it.get("usage").get("rgw.main")
    stats['num_objects'] += v["num_objects"]
    stats['size'] += v["size_kb"]
    stats['buckets'].append(Bucket(it["bucket"],
                                        v["num_objects"],
                                        v["size_kb"]))

This produces the correct stats:
{'buckets': [Bucket(name=u'ABC Backup', num_objects=709203, size=491296676), Bucket(name=u'ABC Backup_segments', num_objects=33, size=158347544)], 'size': 649644220, 'num_objects': 709236, 'num_buckets': 2}

Note that I had removed the "self" reference, so the patched version for rgw_client.py would actually look like this:

        for it in json_data:
            v = it.get("usage").get("rgw.main")
            stats['num_objects'] += v["num_objects"]
            stats['size'] += v["size_kb"]
            stats['buckets'].append(self.Bucket(it["bucket"],
                                                v["num_objects"],
                                                v["size_kb"]))

To test this, I patched the code in the ceilometer_central container (deployed by Kolla Ansible in our environment) here:
/var/lib/kolla/venv/lib/python2.7/site-packages/ceilometer/objectstore/rgw_client.py

and recompiled that file with:
python -m compileall /var/lib/kolla/ve...

Read more...

Revision history for this message
Eric Miller (erickmiller) wrote :

I forgot to include the latest patch for rgw_client.py in my message above. It should have a try/except to deal with buckets that have no rgw.main, otherwise an error occurs that breaks everything. :)

        for it in json_data:
            try:
              v = it.get("usage").get("rgw.main")
              stats['num_objects'] += v["num_objects"]
              stats['size'] += v["size_kb"]
              stats['buckets'].append(self.Bucket(it["bucket"],
                                                  v["num_objects"],
                                                  v["size_kb"]))
            except Exception:
              pass

Eric

Revision history for this message
Jegor van Opdorp (jopdorp) wrote :

I also hit this bug, and the fix from ericmiller makes sense to me, will try it in our deployment

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.