Our journey began back in the 7.x series where we noticed that doing ad-hoc aggregations on raw transaction data put Elasticsearch® under a lot of pressure in large-scale environments. Since then, we’ve begun to pre-aggregate the transactions into transaction metrics during ingestion. This has helped to keep the performance of the UI relatively stable. Regardless of how busy the monitored application is and how many transaction events it is creating, we’re just querying pre-aggregated metrics that are stored at a constant rate. We’ve enabled the metrics-powered UI by default in 7.15.
However, when showing an inventory of a large number of services over large time ranges, the number of metric data points that need to be aggregated can still be large enough to cause performance issues. We also create a time series for each distinct set of dimensions. The dimensions include metadata, such as the transaction name and the host name. Our documentation includes a full list of all available dimensions. If there’s a very high number of unique transaction names, which could be a result of improper instrumentation (see docs for more details), this will create a lot of individual time series that will need to be aggregated when requesting a summary of the service’s overall performance. Global labels that are added to the APM Agent configuration are also added as dimensions to these metrics, and therefore they can also impact the number of time series. Refer to the FAQs section below for more details.
Within the 8.7 and 8.8 releases, we’ve addressed these challenges with the following architectural enhancements that aim to reduce the number of documents Elasticsearch needs to search and aggregate on-the-fly, resulting in faster response times:
- Pre-aggregation of transaction metrics into service metrics. Instead of aggregating all distinct time series that are created for each individual transaction name on-the-fly for every user request, we’re already pre-aggregating a summary time series for each service during data ingestion. Depending on how many unique transaction names the services have, this reduces the number of documents Elasticsearch needs to look up and aggregate by a factor of typically 10–100. This is particularly useful for the service inventory and the service overview pages.
- Pre-aggregation of all metrics into different levels of granularity. The APM UI chooses the most appropriate level of granularity, depending on the selected time range. In addition to the metrics that are stored at a 1-minute granularity, we’re also summarizing and storing metrics at a 10-minute and 60-minute granularity level. For example, when looking at a 7-day period, the 60-minute data stream is queried instead of the 1-minute one, resulting in 60x fewer documents for Elasticsearch to examine. This makes sure that all graphs are rendered quickly, even when looking at larger time ranges.
- Safeguards on the number of unique transactions per service for which we are aggregating metrics. Our agents are designed to keep the cardinality of the transaction name low. But in the wild, we’ve seen some services that have a huge amount of unique transaction names. This used to cause performance problems in the UI because APM Server would create many time series that the UI needed to aggregate at query time. In order to protect APM Server from running out of memory when aggregating a large number of time series for each unique transaction name, metrics were published without aggregating when limits for the number of time series were reached. This resulted in a lot of individual metric documents that needed to be aggregated at query time. To address the problem, we’ve introduced a system where we aggregate metrics in a dedicated overflow bucket for each service when limits are reached. Refer to our documentation for more details.
The exact factor of the document count reduction depends on various conditions. But to get a feeling for a typical scenario, if your services, on average, have 10 instances, no instance-specific global labels, 100 unique transaction names each, and you’re looking at time ranges that can leverage the 60m granularity, you’d see a reduction of documents that Elasticsearch needs to aggregate by a factor of 180,000 (10 instances x 100 transaction names x 60m x 3 because we’re also collapsing the event.outcome dimension). While the response times of Elasticsearch aggregations isn’t exactly scaling linearly with the number of documents, there is a strong correlation.