Elastic Search Series : Nodes and Shards

Communication through ElasticSearch

There are 2 important network communication mechanisms in Elasticsearch to understand :

HTTP : which is how the Elasticsearch REST APIs are exposed.

The REST APIs of Elasticsearch are exposed over HTTP. The HTTP module binds to localhost by default and the default port is the first available between 9200-9299.

transport : used for internal communication between nodes within the cluster.

Each call that goes from one node to another uses the transport module. Transport binds to localhost by default and default port is the first available between 9300-9399.

The discovery and cluster formation module is responsible for discovering nodes, electing a master, forming a cluster, and publishing the cluster state each time it changes.

This process runs when you start an Elasticsearch node or when a node believes the master node failed and continues until the master node is found or a new master node is elected.

How Master Nodes work ?

One host to rule them all

Every cluster has one node designated as the master and can have multiple nodes eligibles.

The master node is in charge of cluster-wide settings and changes like :

creating, updating or deleting indices
adding or removing nodes
allocating shards to nodes

It is recommended to have 3 master-eligible nodes so that if the master nodes fails, there are 2 backups.

Availability zones

With Elasticsearch Service, your deployment can be spread across as many as three separate availability zones, each hosted in its own, separate data center. This matters, because data centers can and do encounter issues with availability.

Bootstrapping a Cluster

Starting an Elasticsearch cluster for the very first time requires the initial set of master-eligible nodes to be explicitly defined on one or more of the master-eligible nodes in the cluster. This is known as cluster bootstrapping. This is only required the first time a cluster starts up: nodes that have already joined a cluster store this information in their data folder for use in a full cluster restart, and freshly-started nodes that are joining a running cluster obtain this information from the cluster’s elected master. After the cluster has formed, this setting is no longer required and is ignored.

What roles can have a node ?

There are several roles a node can have :

master-eligible
data
ingest
machine learning (Gold/Platinium License only)

Each node can take on or several roles. By default, a node has all of them.

Data Nodes

They hold the shards that contain the documents you have indexed.
They execute data related operations like CRUD, search and aggregations.

Ingest Nodes

They provide the ability to pre-process a document right before it gets indexed.

Machine Learning Nodes

Machine Learning nodes provide the ability to :

run machine learning jobs
handle machine learning API requests

What are Shards ?

Because Elasticsearch is a distributed search engine, an index is usually split into elements known as shards that are distributed across multiple nodes. Elasticsearch automatically manages the arrangement of these shards. It also rebalances the shards as necessary, so users need not worry about the details. The default number of primary shards for an index is 1. You specify the number of primary shards when you create the index and you cannot change it afterwards. Be carefull not to overshard your cluster. Indeed, too many small shards consume resources for no reason !

What is a replica ?

By default, Elasticsearch creates five primary shards and one replica for each index. This means that each index will consist of five primary shards, and each shard will have one copy.

How to check my Cluster’s health ?

If you run the following command, you can have several information about your Cluster’s health :

GET _cluster/health

{
"cluster_name": "elasticsearch",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 1,
"number_of_data_nodes": 1,
"active_primary_shards": 15,
"active_shards": 15,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 15,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 50
}

The cluster health status is: green, yellow or red. On the shard level, a red status indicates that the specific shard is not allocated in the cluster, yellow means that the primary shard is allocated but replicas are not, and green means that all shards are allocated. The index level status is controlled by the worst shard status. The cluster status is controlled by the worst index status.

How Search in Elasticsearch works with shards ?

Search in Elasticsearch happens in two phases :

Query phase – In the query phase, Elasticsearch collects document ids of the relevant results from each shard. Upon the completion of this phase, only the ids of the documents which are matched against the search are returned, and there will be no other information like the fields or their values, etc.
Fetch phase – In the fetch phase, the documents ids from the query phase are used to fetch the real documents, and with this the search request can be said to be complete.

Sources :

https://www.elastic.co

https://qbox.io/