Skip to content

Newsletter

Help Center

Categories
< All Topics
Print

ElasticSearch Guide

The Elastic stack (or ELK Stack or ElasticSearch) is a framework that hosts, monitors, and populates data following a horizontal NoSQL model. It is used in BT and VT to manage large data volumes and this page acts as an overall guide.

This guide is not intended to be a course on the technology itself, but rather, document how it is used at COSMOS, processes used, etc. To understand the Elastic Stack, there are plenty of resources such as the video below or the slides I presented in the 08/24/2021 meeting.

Server Setup Instructions

For VT, full setup instructions can be found in the repo README here. You will need access to the repo.

Everyday Monitoring

Monitoring is done through the Kibana dashboard (K component of the ELK stack). Accessible locally at http://vtracker-elastic.host.ualr.edu:5601/app/home, using the elastic user (login info in PW sheet).

Monitoring mostly consists of checking the following regularly. For all pages, if there are no results or the page displays an error, make sure to 1. be in the correct dashboard as indicated by the breadcrumbs on top of the below pages, and 2. that the date filter captures enough data.

Metricbeat Hardware Dashboard

Metricbeat Dashboard

This displays general activity, mostly keep an eye on the Disk used. When this gets too high, go in the VM settings and increase the storage size, then allocate it in the server.

Discover Dashboard

Discover Dashboard

This is the “MySQL Workbench” view of Elasticsearch. This is where rows (called documents) can be viewed and queried.

Cluster Dashboard

Elastic Stack Cluster Dashboard

This page shows the dashboard for the entire cluster. It shows an overview of hardware, as well as key components like nodes (which will multiply with a larger cluster), and the Logstash pipeline – the data ingestion.

Logstash Data Pipeline

Logstash Pipeline Overview

This dashboard show which indices (tables) data is being fed into. It is useful to keep track of what is being kept up to date.

Logstash Monitoring

Logstash Pipeline Monitoring

A more in depth look into the Logstash pipeline.

Setting up a Cluster

See Architecture Best Practices. Below is a summary of key hardware components and recommended architecture. See the very comprehensive Benchmarking and sizing your Elasticsearch cluster for logs and metrics for more information.

In ElasticSearch, a cluster contains nodes (machines, physical or virtual) which contains virtual shards. Only one cluster is needed at this stage of production [1]. While machines can be virtual, it is best for them to be separate to insure redundancy.

  1. What architecture to use?

Hot-Warm (Elasticsearch Hot Warm Architecture | Elastic Blog)

“When using elasticsearch for larger time data analytics use cases, we recommend using time-based indices and a tiered architecture with 3 different types of nodes (Master, Hot-Node and Warm-Node), which we refer to as the “Hot-Warm” architecture.”

  1. How many nodes (machines)?
    1. Three (How many nodes should an ElasticSearch cluster have? )
  2. What are the key hardware requirements?
    1. RAM (How many shards should I have in my Elasticsearch cluster? | Elastic Blog)

A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards, but the further below this limit you can keep it the better. 

(Creating an Elasticsearch Cluster: Getting Started)

 “As a rule of the thumb, the maximum heap size should be set up to 50% of your RAM, but no more than 32GB (due to Java pointer inefficiency in larger heaps).”

  1. CPU – Fig. Current ELK Stack CPU usage

See also the following points, from How to build an elastic search cluster for production? | Cloud Native Computing Foundation

Disks

Disks are probably the most essential aspect of a cluster and especially so for indexing-heavy clusters such as those that ingest log data.

Disks are by far the slowest subsystem in a server. This means that if you have write-heavy loads such as logs retention, you are doing a lot of writing, and the disks can quickly become saturated, which in turn becomes the bottleneck of the cluster.

I highly recommend, if you can afford, to use SSDs. Their far superior writing and reading speed significantly increase your overall performance. SSD-backed nodes see an increase in bot query and indexing performance.

CPU

Let’s talk about the last aspect of hardware performance. CPUs are not so crucial with elastic Search as deployments tend to be relatively light on CPU requirements.

The recommended option is to use a modern processor with multiple cores. Common production-grade ElasticSearch clusters tend to utilize between two to eight-core machines.

If you need to choose between faster CPUs or more cores, choose more cores. The extra concurrency that multiple cores offer will far outweigh a slightly faster clock speed.

Setting up Nodes and Shards

As seen above, a cluster is divided into nodes and shard. To evaluate the architecture that should be used, use this calculator. See its github page for more details. See official documentation for execution.

Table of Contents

© 2024 Collaboratorium for Social Media and Online Behavioral Studies