Brief intro to Elasticsearch.

6 min readOct 11, 2020

Preface.

I have been working on Elasticsearch for some time now and being that my last post on here was about logging and the best practises of logging, I figured i’d continue on that and write about one of my favourite tools out there for achieving this. Elasticsearch is not just for logging, this is just one of its many use cases, it is much more than that.

What is Elasticsearch.

To be concise, Elasticsearch is an opensource full text search engine with a rich restful HTTP API to query and manipulate data. Developed by the Elastic company, it is written in JAVA and based on the Apache Lucene engine.

Why do we use it.

Fast data retrieval

Elasticsearch stores data in a JSON document (hence why it is also regarded as a NoSql document database). The data is then indexed and the Index is then split into shards which are then replicated across a cluster. Elasticsearch uses Routing to know where to put the data using the formula below.

shard = hash(_routing) % number_of_primary_shards

There is a process of Inverted Indexing that happens which means that the data is ordered and tokenised and then mapped to the specific document it belongs to. The location of that document is important so Elasticsearch also stores an ID of the document.

When a GET request is made, Elasticsearch also uses the Routing technique but this time returning the document containing the search term provided with the help of the ID of the document the search term belongs to. Knowing exactly where that term is with the use of the ID is what makes Elasticsearch fast unlike a traditional relational database which has to look through table & columns to fetch the data needed.

Rich data querying and Analysing

Elasticsearch provides an extremely rich full text search capability such as

ID Query

GET /_search
{
  "query": {
    "ids" : {
      "values" : ["1", "2", "3"]
    }
  }
}

This will return the documents which have ID of 1, 2 and 3.

Fuzzy Search/Query

This will return a term similar to that of a query term. For instance if the user enters the phrase “Ball”, it may swap out the “B” for a “C” in order to return “Call”. It can also remove a letter from the phrase, for instance “Cloud” can result to “loud” or the opposite which is to add a letter to search term/phrase. This is to provide the user with more possible search results.

There are also other ways of querying and they can be found on the Elasticsearch documentation. Feel free to have a look.

Aggregation

Aggregation is a way to collect a bunch of data and carry out some sort of meaningful calculation with it. For instance, if we want to monitor web traffic data stored in Elasticsearch, a calculation can be carried out on the amount of traffic coming from a particular location. This may be valuable to a company that can use that data to create some targeted ads.

For the purpose of a simple demonstration, assume we own a movie rental website, and we have an index for storing movies and their variables such as Title, Price and Availability within Elasticsearch, an example of aggregation could simply be calculating the average movie price of the current stock.

Image 1 (3 Documents: Fiday, Coach Carter & Wedding Crashers)

In the image above, i have an index for movies with 3 documents containing 3 different movies (Friday, Coach Carter and Wedding Crashers). The rental price for these movies are £5, £8, and £2.

The below image shows how aggregation can easily be used to fetch the average movie price.

aggs: States that an aggregation is being requested.

avg_price: This is a user defined name, it specifies the name of the aggregation, it can be called anything.

avg: The type of aggregation that’s being performed, in this case it is “avg” because we are calculate an average. It can be count or any other value, see Elasticsearch documentation for more options.

field: Declares the field the aggregation is being calculated on. Hence why the value for that is “Price” as that is the field that holds the data the average will be performed on.

Below are the types of Aggregations defined by Elasticsearch.

Bucketing
A family of aggregations that build buckets, where each bucket is associated with a key and a document criterion. When the aggregation is executed, all the buckets criteria are evaluated on every document in the context and when a criterion matches, the document is considered to “fall in” the relevant bucket. By the end of the aggregation process, we’ll end up with a list of buckets — each one with a set of documents that “belong” to it.
Metric
Aggregations that keep track and compute metrics over a set of documents.
Matrix
A family of aggregations that operate on multiple fields and produce a matrix result based on the values extracted from the requested document fields. Unlike metric and bucket aggregations, this aggregation family does not yet support scripting.
Pipeline
Aggregations that aggregate the output of other aggregations and their associated metrics

Scalability / Availability / High throughput

We cannot speak about this without speaking about clustering. A Cluster is a collection of related Nodes with Nodes being a machine that Elasticsearch is installed on. Elasticsearch is developed to support scaling horizontally as new nodes can be added to a cluster when needed.

The image below shows an Elasticsearch cluster of 3 nodes with one of the nodes being a master node, a document is split into 3 different Shards, you can think of this as creating a partition of 3 (S1, S2, S3 in the image) or the document. Each shard is then replicated 3 times (R-S1, R-S2, R-S3 in the image). A replicated shard is known as a Primary shard. This is how Elasticsearch helps supports high availability, if a node goes down, the data are still going to be made available through the active node’s shards and its replicas. The multi-node cluster infrastructure is also a reason why Elasticsearch can handle concurrent searches.

ELK Stack Ecosystem

ELK Stack is a cross platform, scalable centralised logging framework which includes Elaststicsearch, Logstash & Kibana.

Relaible Data Shippers

There are also a number of data shippers available to automatically fetch and push data to Elasticsearch & these fall under what is known as Beats: A collection of lightweight shippers serving various purposes.

Filebeat : Data shippers for files (e.g logs & other data)
Metricbeat : For shipping computer metrics data (e.g disk space & memory, CPU usage, network statistics)
Packetbeat : For shipping network data.
Winlogbeat : For shipping Windows event logs.
Heartbeat : Provided with a list of URLs, this will monitor their uptime.
Audibeat : For shipping Linux audit framework and checking the integrity of the files.
Functionbeat : Serverless shippers for the cloud.

Logstash

Logstash is used to ingest data, parse & transform them before shipping these data to Elasticsearch. It makes use of the grok pattern which works in a way of matching data with some sort regular expressions.

Kibana

Kibana is a powerful, rich & extensible user interface for visualising data held Elasticsearch. It serves multiple purposes such as being able to connect to LDAP for authentication and authorisation, adding meaning to data by creating charts and graphs, and being able to make requests to an Elasticsearch end endpoint.

Data Analytics & Data Science

Elasticsearch is also a powerful tool for analysing financial data, monitoring performances and acting upon them.