Elasticsearch

By Nimmy Alice on May 17, 2019

Introduction

Elasticsearch is a search engine based on Lucene, which is a free and open source information retrieval software library. It provides a distributed, multi tenant capable, full text search engine with an HTTP web interface and schema free json documents. This can be used as a replacement of document stores like MongoDB. Elasticsearch uses denormalization to improve the search performance and is one of the popular enterprise search engines, which is currently being used by many big organizations like Wikipedia, The Guardian, StackOverflow, GitHub etc

Elasticsearch is developed in Java and is under the terms of the Apache License. Official clients are available in Java, .NET (C#), PHP, Python, Apache Groovy, Ruby and many other languages. According to the DB-Engines ranking(https://en.wikipedia.org/wiki/DB-Engines_ranking), Elasticsearch is the most popular enterprise search engine followed by Apache Solr.

Elastic Stack, formerly the “ELK stack” comprises of

  • LogStash, a data-collection and log-parsing engine
  • Kibana, an analytics and visualization platform
  • Elasticsearch

These products are designed to  work as an integrated solution.

Timeline

  • Shay Banon created Compass, which is the precursor to Elasticsearch, in 2004
  • First version of Elasticsearch was released in February 2010
  • In 2012, Elasticsearch BV was founded to provide commercial services and products around Elasticsearch
  • In June 2014, the company announced raising $70 million in a Series C funding round, which bought total funding of $104M
  • The company Elasticsearch changed their name to Elastic in March 2015
  • In June 2018, Elastic filed for an initial public offering with a estimated valuation of between 1.5 and 3 billion dollars

Advantages

  • developed on Java, which makes it compatible on almost every platform.
  • near real time, in other words after one second the added document is searchable in this engine.
  • distributed, which makes it easy to scale and integrate in any big organization.
  • creating full backups are easy by using the concept of gateway, which is present in Elasticsearch.
  • handling multi-tenancy is very easy in Elasticsearch when compared to Apache Solr.
  • uses JSON objects as responses, which makes it possible to invoke the Elasticsearch server with a large number of different programming languages.
  • supports almost every document type except those that do not support text rendering.

Disadvantages

  • does not have multi-language support in terms of handling request and response data (only possible in JSON) unlike in Apache Solr, where it is possible in CSV, XML and JSON formats.

Basic concepts

Elasticsearch is a near real-time search platform. For a document to be available for search, it takes a slight latency after it is indexed.

Data is stored in nodes. A node can be considered as a server. A cluster comprises of one or more nodes that holds your entire data together. Clusters provide indexing and search capabilities across all nodes.

You can group documents of similar characteristics to an index. Any number of indexes can exist in a single cluster.

One or more more types can be defines on an index. A type is a logical category/partition of the index which can have any developer defined semantics. The basic unit of information  in Elasticsearch is a document. These documents are expressed in JSON.

In an index/type, you van store as many documents as you want. Though a document physically resides in an index, a document must be assigned to a type inside an index.

Backup mechanism

For efficient handling of hardware usage, Elasticsearch provides the ability to subdivide your index into multiple pieces called shards. Sharding is important for two main reasons.

→It allow s you to horizontally split or scale your content volume.

→ It allows you to distribute and parallelizes operations across shards

In a network/cloud environment, we can expect failures any time. So, it is useful and highly recommended to have a fail-over mechanism in case if a shard/node gores off-line or simply fail. To handle this, Elasticsearch allows you to make on one or more copies of your shards into replica shards or replicas.

Replication is important mainly because

→It provides high availability in case if a shard fails. It is important to note that replica shard is never allocated to the same node as the original shard.

→ It allows you to scale out your search volume/throughput since searches can be executed on all replicas in parallel.

Elasticsearch and RDBMS

In Elasticsearch, index is a collection of type just as database is a collection of tables in RDBMS (Relation Database Management System). Every table is a collection of rows just as every mapping is a collection of JSON objects Elasticsearch.

Index Database
Shard Shard
Mapping Table
Field Field
JSON Object Tuple

How full text search works in Elasticsearch?

Analysis

Analysis is the process of converting text, like the body of any email, into tokens or terms which are added to the inverted index for searching. Analysis is performed by an analyzer which can be either a built-in analyzer or a custom analyzer defined per index.

Index time analysis

Let’s consider an example in which a built in ‘english’ analyzer converts the following sentence at the time of indexing.

"The QUICK brown foxes jumped over the lazy dog!"

The ‘english’ analyzer converts this sentence into distinct tokens. It will then lowercase each token, remove frequent stopwords (“the”) and reduce the terms to their word stems (foxes → fox, jumped → jump). In the end, the following terms will be added to the inverted index:

[ quick, brown, fox, jump, over, lazy, dog ]

Search time analysis

This same analysis process is applied to the query string at search time in full text queries like the match query to convert the text in the query string into terms of the same form as those that are stored in the inverted index.

For instance, a user might search for:

"a quick fox"

which would be analyzed by the same 'english‘ analyzer into the following terms:

[ quick, fox ]

As we have applied the same analyzer to both the text and the query string, the terms from the query string exactly match the terms from the text in the inverted index, which means that this query would match our example document.

Usually the same analyzer should be used both at index time and at search time, and full text queries like the search query will use the mapping to look up the analyzer to use for each field.

Analyzer

An analyzer  — whether built-in or custom — is just a package which contains three lower-level building blocks: character filters, tokenizers, and token filters.

Character filters

A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. For instance, a character filter could be used to strip HTML elements like <b> from the stream.

An analyzer may have zero or more character filters, which are applied in order.

Tokenizer

A tokenizer receives a stream of characters, breaks it up into individual tokens, and outputs a stream of tokens. For example, a whitespace tokenizer breaks text into tokens whenever it sees any whitespace. It would convert the text "Quick brown fox!" into the terms [Quick, brown, fox!].

The tokenizer is also responsible for recording the order or position of each term and the start and end character offsets of the original word which the term represents.

An analyzer must have exactly one tokenizer.

Token filters

A token filter receives the token stream and may add, remove, or change tokens. For example, a lowercase token filter converts all tokens to lowercase

An analyzer may have zero or more token filters, which are applied in order.

Built-in analyzers

Elasticsearch ships with a wide range of built-in analyzers, which can be used in any index without further configuration:

Standard Analyzer

The standard analyzer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation, lowercases terms, and supports removing stop words.

Simple Analyzer

The simple analyzer divides text into terms whenever it encounters a character which is not a letter. It lowercases all terms.

Whitespace Analyzer

The whitespace analyzer divides text into terms whenever it encounters any whitespace character. It does not lowercase terms.

Stop Analyzer

The stop analyzer is like the simple analyzer, but also supports removal of stop words.

Keyword Analyzer

The keyword analyzer is a “noop” analyzer that accepts whatever text it is given and outputs the exact same text as a single term.

Pattern Analyzer

The pattern analyzer uses a regular expression to split the text into terms. It supports lower-casing and stop words.

Language Analyzers

Elasticsearch provides many language-specific analyzers like english or french.

Fingerprint Analyzer

The fingerprint analyzer is a specialist analyzer which creates a fingerprint which can be used for duplicate detection.

Custom analyzers

If you do not find an analyzer suitable for your needs, you can create a custom analyzer which combines the appropriate character filters, tokenizer, and token filters.

Full text queries

The high-level full text queries are usually used for running full text queries on full text fields like the body of an email. They understand how the field being queried is analyzed and will apply each field’s analyzer to the query string before executing.

The queries in this group are:

match query

The standard query for performing full text queries, including fuzzy matching and phrase or proximity queries.

match_phrase query

Like the match query but used for matching exact phrases or word proximity matches.

match_phrase_prefix query

The poor man’s search-as-you-type. Like the match_phrase query, but does a wildcard search on the final word.

multi_match query

The multi-field version of the match query.

common terms query

A more specialized query which gives more preference to uncommon words.

query_string query

Supports the compact Lucene query string syntax, allowing you to specify AND|OR|NOT conditions and multi-field search within a single query string. For expert users only.

simple_query_string query

A simpler, more robust version of the query_string syntax suitable for exposing directly to users.

Example:

Configuring an index with ‘english’ analyzer:

PUT /my_index 
{ "mappings": 
 { "blog": 
 { "properties":
  { "title":
  { 
 "type": "string", "analyzer": "english"  
 } 
 }
  } 
 }
}

The, we can add some sample documents to this index.

PUT /my_index/blog/1 
 { "title": "I'm happy for this fox" } 
PUT /my_index/blog/2 
 { "title": "I'm not happy about my fox problem"

Now, search in the documents with ‘english’ analyzer.

GET /_search
{
    "query": {
        "match_phrase" : {
            "message" : {
                "query" : "happy",
                "analyzer" : "english"
            }
        }
    }
}

Now, both the documents will be returned in the search result.

References and courtesy:

https://en.wikipedia.org

http://www.elasticsearchtutorial.com

https://www.elastic.co

https://www.baeldung.com

https://www.tutorialspoint.com

Leave a Reply

SCROLL TO TOP