Elastic Search Series : Aggregations

Introduction

Elasticsearch Aggregations provide you with the ability to group and perform calculations and statistics (such as sums and averages) on your data by using a simple search query. An aggregation can be viewed as a working unit that builds analytical information across a set of documents.

Let’s see together different types of aggregations !

Metrics Aggregations

Metrics aggregations provide you different calculation like an average, a median, smallest o highest value etc.

It’s a part of the search API and here is the syntax:

GET my_index/_search
{
   "aggs": {
      "my_aggregation": {
         "AGG_TYPE": {
            ...
         }
      }
   }
}

Example :

What is the total number of movies for all actors ?

GET actors/_search
{
   "size": 0,
   "aggs": {
      "sum": {
         "field": "number_of_movies"
      }
   }
}

The “size” argument will speed up you request by providing only the values of the aggregations.

Here are different metric aggregations you can use :

avg : A single-value metrics aggregation that computes the average of numeric values that are extracted from the aggregated documents.
max : A single-value metrics aggregation that keeps track and returns the maximum value among the numeric values extracted from the aggregated documents.
min : A single-value metrics aggregation that keeps track and returns the minimum value among numeric values extracted from the aggregated documents.
percentiles : A multi-value metrics aggregation that calculates one or more percentiles over numeric values extracted from the aggregated documents (alose used to calculate median, as 50%.)

Bucket Aggregations

Bucket aggregations don’t calculate metrics over fields like the metrics aggregations do, but instead, they create buckets of documents. Each bucket is associated with a criterion (depending on the aggregation type) which determines whether or not a document in the current context “falls” into it. In other words, the buckets effectively define document sets. In addition to the buckets themselves, the bucket aggregations also compute and return the number of documents that “fell into” each bucket.

For example :

What are the most popular actors ?
What actors have done between 30 to 100 movies ?

Here are different bucjet aggregations you can use :

histogram : builds a histogram on a given field using a specified interval. The number of buckets is dynamic and depends on the data and the interval.
ranges : allows you to define your own intervals.
terms : A multi-bucket value source based aggregation where buckets are dynamically built – one per unique value.

Examples :

GET /_search
{
    "aggs" : {
        "price_ranges" : {
            "range" : {
                "field" : "price",
                "ranges" : [
                    { "to" : 100.0 },
                    { "from" : 100.0, "to" : 200.0 },
                    { "from" : 200.0 }
                ]
            }
        }
    }
}

GET /_search
{
    "aggs" : {
        "genres" : {
            "terms" : { "field" : "genre" } 
        }
    }
}

GET logs_server*/_search
{
   "size": 0,
   "aggs": {
      "logs_by_day": {
      "date_histogram": {
      "field": "@timestamp",
      "interval": "day"
      }
   }
   }
}

Combining Aggregations

An aggregation can be a combination of bucket and metrics aggregations. For example, if we want to know the sum of something per day ? Or if we want to know the number of something, monthly per host ? And this is why aggregations are so powerful. They can be combined in any number of combinations.

In this example, we compute a metric in a bucket :

GET logs_server*/_search
{
   "size": 0,
   "aggs": {
      "requests_per_day": {
         "date_histogram": {
            "field": "@timestamp",
            "interval": "day"
          },
          "aggs": {
              "daily_number_of_bytes": {
              "sum": {
                 "field": "response_size"
              }
           }
        }
      }
   }
}

Finally, here is an example of multiple combination, called Sub-Buckets.

The log events are bucketed by month, then within each month the events are further bucketed by host :

"aggregations": {
   "logs_by_month": {
      "buckets": [
      {
         "key_as_string" : "2017-03-01T00:00:00.000Z",
         "key" : 1488326400000,
         "doc_count" : 252,
         "host_name" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
            {
               "key" : "server2",
               "doc_count" : 147
            },
            {
               "key" : "server3",
               "doc_count" : 91
            },
            {
               "key" : "server1",
               "doc_count" : 14
            }
]

Sources :

https://www.elastic.co