Elastic Search Series : Queries

Recall and Precision

Through Elasticsearch we have seen previously that we can send queries in order to get results.

There are 3 important principles related to this with Elasticsearch :

Recall is basically the portion of relevant documents that are returned in the results (did we miss any result ?)
Precision is the probability that a document in a result is relevant (did we get irrelevant results ?)
Ranking is an ordering of the documents in the results according to relevance (are the result ordered from most relevant to less relevant ?)

Basic search examples

In this example, we search all documents with the word “morgan” or “freeman” in the field content in the “blogs” index ;

GET blogs/_search
{
   "query": {
      "match": {
        "content": "morgan freeman"
      }
   }
}

In this example we search documents with both words (and) :

GET blogs/_search
{
   "query": {
      "match": {
        "content": "morgan freeman"
        "operator": "and"
      }
   }
}

Finally in this example 2 of the 3 terms need to match :

GET blogs/_search
{
   "query": {
      "match": {
        "content": "morgan freeman actor"
        "minimum_should_match": 2
      }
   }
}

The result will be returned by score (most relevant displayed) like so :

"hits" : {
   "total" : {
      "value" : 10,
      "relation" : "eq"
   },
   "max_score" : 10.1612155,
   "hits" : [
   {
      ...
      "_score" : 10.1603195,
      "_source" : {
         "title" : "Best actors of all time"
      }
   },
   {
      ...
      "_score" : 10.147091,
      "_source" : {
      "title" : "American actors"
      }
   },
   ...

The score (how relevant is a result) is calculated with the BM25 algorithm on 3 criterias :

TF (Term Frequency) : the more a term appears in a field, the more it is relevant.
IDF (inverse document frequency) : the more documents that containt the term, the less important it is.
Field length : short field are more likely to be relavant

Full-Text Queries examples

The match_phrase query is useful when you want to search text where terms are near each other :

GET blogs/_search
{
   "query": {
      "match_phrase": {
         "FIELD": "A PHRASE"
      }
   }
}

With match_phrase the precision will be better but recall worse !

We can increase recall with the parameter “slop” which will tell how far apart terms can be.

GET blogs/_search
{
   "query": {
      "match_phrase": {
         "content": "great actor"
         "slop": 1
      }
   }
}

With this example, “great american actor” will be matched.

Multi match queries

If you want to search on different fields, you can use multi_match :

GET blogs/_search
{
   "query": {
      "multi_match": {
         "query": "morgan freeman"
         "fields": [
          "title",
          "content",
          "author"
         ],
      }
   }
}

Elasticsearch will consider the best scoring field when calculating the score but you can boost the score of a field like so :

GET blogs/_search
{
   "query": {
      "multi_match": {
         "query": "morgan freeman"
         "fields": [
          "title^2",
          "content",
          "author"
         ],
      }
   }
}

Fuzziness

Sometimes you will miss results if the requested match is misspelled. You can use the parameter fuzziness to be less strict on your research.

GET blogs/_search
{
   "query": {
      "match": {
         "content": "freeman"
         "fuzzyness": 1
      }
   }
}

In this example, if it finds a word with only one character which differs (like freemen), it will consider it like a correct match. Fuzziness is an easy solution to misspelling but has high CPU overhead and very low precision

Combining Queries

Finally, let’s see how to combine queries with an example :

GET blogs/_search
{
   "query": {
      "bool": {
         "must": [
            {
               "match": {
                  "content": "morgan freeman"
                }
            },
            {
               "match": {
                  "category": "actors"
                }
            }
          ]
       }
   }
}

We are, in this example, searching “morgan freeman” only in the category “actors”.

We can also use instead of “must” use the following clauses :

must_not : i.e if we want to find results in all categories except “directors”

GET blogs/_search
{
   "query": {
      "bool": {
         "must": 
            {
               "match": {
                  "content": "morgan freeman"
                }
            },
         "must_not":
            {
               "match": {
                  "category": "directors"
                }
            }
       }
   }
}

should : it will score higher when matching the category “actors” :

GET blogs/_search
{
   "query": {
      "bool": {
         "must": 
            {
               "match": {
                  "content": "morgan freeman"
                }
            },
         "should":
            {
               "match": {
                  "category": "actors"
                }
            }
       }
   }
}

filter : doesn’t affect the score

GET blogs/_search
{
   "query": {
      "bool": {
         "must": 
            {
               "match": {
                  "content": "morgan freeman"
                }
            },
         "filter":
            {
               "match": {
                  "category": "actors"
                }
            }
       }
   }
}

We can conclude by saying that :

must : Affect hits and score
must_not : affects hits only
should : affects score only
filter : affects hits only

Sources :

https://www.elastic.co

https://searchitoperations.techtarget.com

Wikipedia