• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Home
  • Contact Us

iHash

News and How to's

  • VYSN RockinPods TWS Waterproof Bluetooth Earbuds for $24

    VYSN RockinPods TWS Waterproof Bluetooth Earbuds for $24
  • VYSN SmartHold 360° Rotation Smart AI Gimbal for $49

    VYSN SmartHold 360° Rotation Smart AI Gimbal for $49
  • Nationwide Annual Golf Membership Player's Pass + $50 Restaurant.com eGift Card for $49

    Nationwide Annual Golf Membership Player's Pass + $50 Restaurant.com eGift Card for $49
  • Radiant Cheval Mirror Jewelry Armoire Beige for $164

    Radiant Cheval Mirror Jewelry Armoire Beige for $164
  • Radiant Cheval Mirror Jewelry Armoire Silver for $164

    Radiant Cheval Mirror Jewelry Armoire Silver for $164
  • News
    • Rumor
    • Design
    • Concept
    • WWDC
    • Security
    • BigData
  • Apps
    • Free Apps
    • OS X
    • iOS
    • iTunes
      • Music
      • Movie
      • Books
  • How to
    • OS X
      • OS X Mavericks
      • OS X Yosemite
      • Where Download OS X 10.9 Mavericks
    • iOS
      • iOS 7
      • iOS 8
      • iPhone Firmware
      • iPad Firmware
      • iPod touch
      • AppleTV Firmware
      • Where Download iOS 7 Beta
      • Jailbreak News
      • iOS 8 Beta/GM Download Links (mega links) and How to Upgrade
      • iPhone Recovery Mode
      • iPhone DFU Mode
      • How to Upgrade iOS 6 to iOS 7
      • How To Downgrade From iOS 7 Beta to iOS 6
    • Other
      • Disable Apple Remote Control
      • Pair Apple Remote Control
      • Unpair Apple Remote Control
  • Special Offers
  • Contact us

How many shards should I have in my Elasticsearch cluster?

Jul 6, 2022 by iHash Leave a Comment


Editor’s Note: The rule of thumb on “Aim for 20 shards or fewer per GB of heap memory” has been deprecated in version 8.3. This blog has been updated to reflect the new recommendation.

Elasticsearch is a very versatile platform that supports a variety of use cases and provides great flexibility around data organisation and replication strategies. This flexibility can, however, sometimes make it hard to determine up-front how to best organize your data into indices and shards, especially if you are new to the Elastic Stack. While suboptimal choices will not necessarily cause problems when first starting out, they have the potential to cause performance problems as data volumes grow over time. The more data the cluster holds, the more difficult it also becomes to correct the problem, as reindexing of large amounts of data can sometimes be required.

When we come across users that are experiencing performance problems, it is not uncommon that this can be traced back to issues around how data is indexed and number of shards in the cluster. This is especially true for use-cases involving multi-tenancy and/or use of time-based indices. When discussing this with users, either in person at events or meetings or via our forum, some of the most common questions are “How many shards should I have?” and “How large should my shards be?”

This blog post aims to help you answer these questions and provide practical guidelines for use cases that involve the use of time-based indices (e.g., logging or security analytics) in a single place.

Table of Contents

  • What is a shard?
  • Index by retention period
  • Are indices and shards not free?
  • How does shard size affect performance?
  • How do I manage shard size?
  • Conclusions

What is a shard?

Before we start, we need to establish some facts and terminology that we will need in later sections.

Data in Elasticsearch is organized into indices. Each index is made up of one or more shards. Each shard is an instance of a Lucene index, which you can think of as a self-contained search engine that indexes and handles queries for a subset of the data in an Elasticsearch cluster.

As data is written to a shard, it is periodically published into new immutable Lucene segments on disk, and it is at this time it becomes available for querying. This is referred to as a refresh. How this works is described in greater detail in Elasticsearch: the Definitive Guide.

As the number of segments grow, these are periodically consolidated into larger segments. This process is referred to as merging. As all segments are immutable, this means that the disk space used will typically fluctuate during indexing, as new, merged segments need to be created before the ones they replace can be deleted. Merging can be quite resource intensive, especially with respect to disk I/O.

The shard is the unit at which Elasticsearch distributes data around the cluster. The speed at which Elasticsearch can move shards around when rebalancing data, e.g. following a failure, will depend on the size and number of shards as well as network and disk performance.

TIP: Avoid having very large shards as this can negatively affect the cluster’s ability to recover from failure. There is no fixed limit on how large shards can be, but a shard size of 50GB is often quoted as a limit that has been seen to work for a variety of use-cases.

Index by retention period

As segments are immutable, updating a document requires Elasticsearch to first find the existing document, then mark it as deleted and add the updated version. Deleting a document also requires the document to be found and marked as deleted. For this reason, deleted documents will continue to tie up disk space and some system resources until they are merged out, which can consume a lot of system resources.

Elasticsearch allows complete indices to be deleted very efficiently directly from the file system, without explicitly having to delete all records individually. This is by far the most efficient way to delete data from Elasticsearch.


TIP: Try to use time-based indices for managing data retention whenever possible. Group data into indices based on the retention period. Time-based indices also make it easy to vary the number of primary shards and replicas over time, as this can be changed for the next index to be generated. This simplifies adapting to changing data volumes and requirements.


Are indices and shards not free?

For each Elasticsearch index, information about mappings and state is stored in the cluster state. This is kept in memory for fast access. Having a large number of indices and shards in a cluster can therefore result in a large cluster state, especially if mappings are large. This can become slow to update as all updates need to be done through a single thread in order to guarantee consistency before the changes are distributed across the cluster.


TIP: In order to reduce the number of indices and avoid large and sprawling mappings, consider storing data with similar structure in the same index rather than splitting into separate indices based on where the data comes from. It is important to find a good balance between the number of indices and shards, and the mapping size for each individual index. Because the cluster state is loaded into the heap on every node (including the masters), and the amount of heap is directly proportional to the number of indices, fields per index and shards, it is important to also monitor the heap usage on master nodes and make sure they are sized appropriately.  


Each shard has data that need to be kept in memory and use heap space. This includes data structures holding information at the shard level, but also at the segment level in order to define where data reside on disk. The size of these data structures is not fixed and will vary depending on the use-case.

One important characteristic of the segment related overhead is however that it is not strictly proportional to the size of the segment. This means that larger segments have less overhead per data volume compared to smaller segments. The difference can be substantial.

In order to be able to store as much data as possible per node, it becomes important to manage heap usage and reduce the amount of overhead as much as possible. The more heap space a node has, the more data and shards it can handle.

Indices and shards are therefore not free from a cluster perspective, as there is some level of resource overhead for each index and shard.


TIP: Small shards result in small segments, which increases overhead. Aim to keep the average shard size between at least a few GB and a few tens of GB. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.

TIP: As the overhead per shard depends on the segment count and size, forcing smaller segments to merge into larger ones through a forcemerge operation can reduce overhead and improve query performance. This should ideally be done once no more data is written to the index. Be aware that this is an expensive operation that should ideally be performed during off-peak hours.

TIP: The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health. (Editor’s note: As of 8.3, we have drastically reduced the heap usage per shard, thus updating the rule of thumb in this blog. Please follow TIP below for 8.3+ versions of Elasticsearch.)

NEW TIP: Allow 1kB of heap per field per index on data nodes, plus overheads
The exact resource usage of each mapped field depends on its type, but a rule of thumb is to allow for approximately 1kB of heap overhead per mapped field per index held by each data node. You must also allow enough heap for Elasticsearch’s baseline usage as well as your workload, such as indexing, searches, and aggregations. Extra heap of 0.5GB will suffice for many reasonable workloads, and you may need even less if your workload is very light while heavy workloads may require more.

For example, if a data node holds shards from 1000 indices, each containing 4000 mapped fields, then you should allow approximately 1000 × 4000 × 1kB = 4GB of heap for the fields and another 0.5GB of heap for its workload and other overheads, and therefore this node will need a heap size of at least 4.5GB.


How does shard size affect performance?

In Elasticsearch, each query is executed in a single thread per shard. Multiple shards can however be processed in parallel, as can multiple queries and aggregations against the same shard.

This means that the minimum query latency, when no caching is involved, will depend on the data, the type of query, as well as the size of the shard. Querying lots of small shards will make the processing per shard faster, but as many more tasks need to be queued up and processed in sequence, it is not necessarily going to be faster than querying a smaller number of larger shards. Having lots of small shards can also reduce the query throughput if there are multiple concurrent queries.


TIP: The best way to determine the maximum shard size from a query performance perspective is to benchmark using realistic data and queries. Always benchmark with a query and indexing load representative of what the node would need to handle in production, as optimizing for a single query might give misleading results.


How do I manage shard size?

When using time-based indices, each index has traditionally been associated with a fixed time period. Daily indices are very common, and often used for holding data with short retention period or large daily volumes. These allow retention period to be managed with good granularity and makes it easy to adjust for changing volumes on a daily basis. Data with a longer retention period, especially if the daily volumes do not warrant the use of daily indices, often use weekly or monthly indices in order to keep the shard size up. This reduces the number of indices and shards that need to be stored in the cluster over time.


TIP: If using time-based indices covering a fixed period, adjust the period each index covers based on the retention period and expected data volumes in order to reach the target shard size.


Time-based indices with a fixed time interval works well when data volumes are reasonably predictable and change slowly. If the indexing rate can vary quickly, it is very difficult to maintain a uniform target shard size.

In order to be able to better handle this type of scenarios, the Rollover and Shrink APIs were introduced. These add a lot of flexibility to how indices and shards are managed, specifically for time-based indices.

The rollover index API makes it possible to specify the number of documents an index should contain and/or the maximum period documents should be written to it. Once one of these criteria has been exceeded, Elasticsearch can trigger a new index to be created for writing without downtime. Instead of having each index cover a specific time-period, it is now possible to switch to a new index at a specific size, which makes it possible to more easily achieve an even shard size for all indices.

In cases where data might be updated, there is no longer a distinct link between the timestamp of the event and the index it resides in when using this API, which may make updates significantly less efficient as each update may need to be preceded by a search.


TIP: If you have time-based, immutable data where volumes can vary significantly over time, consider using the rollover index API to achieve an optimal target shard size by dynamically varying the time-period each index covers. This gives great flexibility and can help avoid having too large or too small shards when volumes are unpredictable.


The shrink index API allows you to shrink an existing index into a new index with fewer primary shards. If an even spread of shards across nodes is desired during indexing, but this will result in too small shards, this API can be used to reduce the number of primary shards once the index is no longer indexed into. This will result in larger shards, better suited for longer term storage of data.


TIP: If you need to have each index cover a specific time period but still want to be able to spread indexing out across a large number of nodes, consider using the shrink API to reduce the number of primary shards once the index is no longer indexed into. This API can also be used to reduce the number of shards in case you have initially configured too many shards.


Conclusions

This blog post has provided tips and practical guidelines around how to best manage data in Elasticsearch. If you are interested in learning more, “Elasticsearch: the definitive guide” contains a section about designing for scale, which is well worth reading even though it is a bit old.

A lot of the decisions around how to best distribute your data across indices and shards will however depend on the use-case specifics, and it can sometimes be hard to determine how to best apply the advice available. For more in-depth and personal advice you can engage with us commercially through a subscription and let our Support and Consulting teams help accelerate your project. If you are happy to discuss your use-case in the open, you can also get help from our community and through our public forum.

This post was originally published on September 18, 2017. It was updated on July 6, 2022.



Source link

Share this:

  • Facebook
  • Twitter
  • Pinterest
  • LinkedIn

Filed Under: News Tagged With: cluster, elasticsearch, shards

Special Offers

  • VYSN RockinPods TWS Waterproof Bluetooth Earbuds for $24

    VYSN RockinPods TWS Waterproof Bluetooth Earbuds for $24
  • VYSN SmartHold 360° Rotation Smart AI Gimbal for $49

    VYSN SmartHold 360° Rotation Smart AI Gimbal for $49
  • Nationwide Annual Golf Membership Player's Pass + $50 Restaurant.com eGift Card for $49

    Nationwide Annual Golf Membership Player's Pass + $50 Restaurant.com eGift Card for $49
  • Radiant Cheval Mirror Jewelry Armoire Beige for $164

    Radiant Cheval Mirror Jewelry Armoire Beige for $164
  • Radiant Cheval Mirror Jewelry Armoire Silver for $164

    Radiant Cheval Mirror Jewelry Armoire Silver for $164

Reader Interactions

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

  • Facebook
  • GitHub
  • Instagram
  • Pinterest
  • Twitter
  • YouTube

More to See

VYSN SmartHold 360° Rotation Smart AI Gimbal for $49

Feb 5, 2023 By iHash

Detecting Lateral Movement activity: A new Kibana integration

Detecting Lateral Movement activity: A new Kibana integration

Feb 5, 2023 By iHash

Tags

* Apple Cisco computer security cyber attacks cyber crime cyber news cybersecurity Cyber Security cyber security news cyber security news today cyber security updates cyber threats cyber updates data breach data breaches google hacker hacker news Hackers hacking hacking news how to hack incident response information security iOS 7 iOS 8 iPhone Malware microsoft network security ransomware ransomware malware risk management Secure security security breaches security vulnerabilities software vulnerability the hacker news Threat update video Vulnerabilities web applications

Latest

VYSN RockinPods TWS Waterproof Bluetooth Earbuds for $24

Expires February 05, 2123 23:59 PST Buy now and get 80% off KEY FEATURES The VYSN RockinPods TWS Waterproof Bluetooth Earbuds are the perfect choice for an active lifestyle. The LED battery display, one-step pairing technology, and high adaptability features make these a device that can be used comfortably in any situation. The easy touch […]

Nationwide Annual Golf Membership Player's Pass + $50 Restaurant.com eGift Card for $49

Expires February 02, 2123 07:00 PST Buy now and get 75% off Nationwide Annual Golf Membership KEY FEATURES Warmer weather is here and it’s time to get outside! It does not matter if you are a Master’s champion or a casual weekend golfer; everyone will enjoy the savings at your favorite courses with the Player’s […]

Radiant Cheval Mirror Jewelry Armoire Beige for $164

Expires February 02, 2123 00:49 PST Buy now and get 30% off KEY FEATURES Our full length free standing armoire with mirror border is a chic makeup vanity and space saving jewelry organizer in one. Stylish and efficient, this contemporary design is constructed of MDF eco-friendly wood, with a sleek finish enhanced by inner LED […]

Radiant Cheval Mirror Jewelry Armoire White for $165

Expires February 02, 2123 00:49 PST Buy now and get 29% off KEY FEATURES Our full length free standing armoire with mirror border is a chic makeup vanity and space saving jewelry organizer in one. Stylish and efficient, this contemporary design is constructed of MDF eco-friendly wood, with a sleek finish enhanced by inner LED […]

New Wave of Ransomware Attacks Exploiting VMware Bug to Target ESXi Servers

Feb 04, 2023Ravie LakshmananEnterprise Security / Ransomware VMware ESXi hypervisors are the target of a new wave of attacks designed to deploy ransomware on compromised systems. “These attack campaigns appear to exploit CVE-2021-21974, for which a patch has been available since February 23, 2021,” the Computer Emergency Response Team (CERT) of France said in an […]

Wireless Bluetooth 5.0 Earbuds, Sport Headphones Matte Design Earbuds with Battery Charging Case for $18

Expires January 27, 2123 20:49 PST Buy now and get 78% off PRODUCT SPECS Reduce Unwanted Noise While Enjoying 5 Hours of Wireless Music & Calls in Every Charge Using advanced noise-reduction technology, Earphones have been designed to reduce unwanted noise during exercise. With an onboard 2,000mAh polymer lithium battery that offers 5 hours of […]

Jailbreak

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.2.0

Pangu has updated its jailbreak utility for iOS 9.0 to 9.0.2 with a fix for the manage storage bug and the latest version of Cydia. Change log V1.2.0 (2015-10-27) 1. Bundle latest Cydia with new Patcyh which fixed failure to open url scheme in MobileSafari 2. Fixed the bug that “preferences -> Storage&iCloud Usage -> […]

Apple Blocks Pangu Jailbreak Exploits With Release of iOS 9.1

Apple has blocked exploits used by the Pangu Jailbreak with the release of iOS 9.1. Pangu was able to jailbreak iOS 9.0 to 9.0.2; however, in Apple’s document on the security content of iOS 9.1, PanguTeam is credited with discovering two vulnerabilities that have been patched.

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.1.0

  Pangu has released an update to its jailbreak utility for iOS 9 that improves its reliability and success rate.   Change log V1.1.0 (2015-10-21) 1. Improve the success rate and reliability of jailbreak program for 64bit devices 2. Optimize backup process and improve jailbreak speed, and fix an issue that leads to fail to […]

Activator 1.9.6 Released With Support for iOS 9, 3D Touch

  Ryan Petrich has released Activator 1.9.6, an update to the centralized gesture, button, and shortcut manager, that brings support for iOS 9 and 3D Touch.

Copyright iHash.eu © 2023
We use cookies on this website. By using this site, you agree that we may store and access cookies on your device. Accept Read More
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT