Facet doc count reports more docs than actual

Bug #1503080 reported by Travis Tripp
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Searchlight
Fix Released
Medium
Steve McLellan

Bug Description

Some facet doc counts report inaccurate numbers. For example, OS::Nova::Server facets has the following:

    {
      "type": "string",
      "name": "networks.name",
      "options": [
        {
          "key": "private",
          "doc_count": 6
        }
      ]
    },
    {
      "type": "string",
      "name": "networks.OS-EXT-IPS:type",
      "options": [
        {
          "key": "fixed",
          "doc_count": 6
        }
      ]
    }

It says doc_count of 6, however, there are only 3 servers that actually are connected to the "private" network. It happens that there are multiple listings for the private network on a single server. This makes the doc_count seem inaccurate when displaying to users without a lot of explanation. See snippet of data below. It should be noted that networks are one of the things we double index to allow for proper searching.

      {
        "_score": 1,
        "_type": "OS::Nova::Server",
          "addresses": {
            "private": [
              {
                "OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:74:6c:2d",
                "version": 4,
                "addr": "10.0.0.4",
                "OS-EXT-IPS:type": "fixed"
              },
              {
                "OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:74:6c:2d",
                "version": 6,
                "addr": "fde1:40c0:201f:0:f816:3eff:fe74:6c2d",
                "OS-EXT-IPS:type": "fixed"
              }
            ]
          },
          "networks": [
            {
              "OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:74:6c:2d",
              "version": 4,
              "ipv4_addr": "10.0.0.4",
              "name": "private",
              "OS-EXT-IPS:type": "fixed"
            },
            {
              "OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:74:6c:2d",
              "ipv6_addr": "fde1:40c0:201f:0:f816:3eff:fe74:6c2d",
              "version": 6,
              "name": "private",
              "OS-EXT-IPS:type": "fixed"
            }
          ],
          "security_groups": [
            {
              "name": "default"
            }
          ],
        },

Revision history for this message
Steve McLellan (sjmc7) wrote :

The solution to this is a 'reverse nested' aggregation: https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-aggregation.html. For instance, in the example above (I have two servers with the current code, I get two networks.name buckets even though it's just one document:

    {
      "type": "string",
      "name": "networks.OS-EXT-IPS:type",
      "options": [
        {
          "key": "fixed",
          "doc_count": 4
        }
      ]
    }

Adding a reverse_nested aggregation (notice the extra _unique_docs) :

         {
            "name": "networks.OS-EXT-IPS:type",
            "options": [
                {
                    "doc_count": 4,
                    "key": "fixed",
                    "networks__OS-EXT-IPS:type_unique_docs": {
                        "doc_count": 2
                    }
                }
            ],
            "type": "string"
        },

We'd need to then transform the results slightly to delete the unique_docs and replace the doc_count. I've not yet found a way to make this the default (i.e. aggregated nested but return the 'reverse' counts by default) which would be better since it avoids meddling with the e-s format overmuch.

Revision history for this message
Travis Tripp (travis-tripp) wrote :
Changed in searchlight:
milestone: none → mitaka-1
assignee: nobody → Steve McLellan (sjmc7)
importance: Undecided → Medium
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to searchlight (master)

Reviewed: https://review.openstack.org/236043
Committed: https://git.openstack.org/cgit/openstack/searchlight/commit/?id=5bf32c6b1988a1e58cf3ab187f67c80d5784462b
Submitter: Jenkins
Branch: master

commit 5bf32c6b1988a1e58cf3ab187f67c80d5784462b
Author: Steve McLellan <email address hidden>
Date: Fri Oct 16 13:13:50 2015 -0500

    Fix doc_count for nested fields

    Aggregations for nested fields return a doc_count of matching
    *nested* documents, which leads to results as described in the bug. This
    patch adds a reverse_nested clause to the aggregation query, and
    overwrites the doc_count with that of the reverse count (i.e. how
    many parent documents match).

    The patch also includes a refactoring of the faceting code and tests.
    The previous implementation in plugins/base.py looked messy and didn't
    lend itself to testing; the refactoring allows testing of the faceting
    code separate from plugins, and means the plugin tests can focus on what
    fields get faceted. There's a small refactoring of the functional tests
    to tidy up some helper functions.

    Change-Id: I258ae8a194a689cc5912d4aeaaf253c6c0b9db9a
    Closes-Bug: #1503080

Changed in searchlight:
status: In Progress → Fix Released
Changed in searchlight:
milestone: mitaka-1 → mitaka-3
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/searchlight 0.2.0.0b3

This issue was fixed in the openstack/searchlight 0.2.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.