It’s been a huge team effort to bring this tool to life and finally to general availability. Several features available in Elasticsearch®’s search analytics engine were essential to make this possible:
The p_value scoring heuristic for significant terms aggregation enables the identification of statistically significant field/value pairs within logs. This aggregation facilitates the comparison of these pairs in relation to a deviation from the baseline log rate. It helps pinpoint which fields have the most impact on the deviation, thereby providing an initial explanation that serves as a valuable starting point for delving deeper into the root causes. For instance, in the analysis of web logs, it permits the identification of the source IPs or URLs contributing to a spike in your logs.
The frequent_item_sets aggregation employs a data mining technique that is able to find frequent and relevant patterns in large data sets. Its implementation as an Elasticsearch aggregation makes it available as a building block for many use cases like recommender systems, behavioral analytics, or fraud detection. For log rate analysis, we use the aggregation to identify groups of correlating statistically significant field/value pairs. Again with web logs for example, this can give you the ability to identify which type of users are accessing certain URLs causing an increase or decrease in log activity.
Finally, the random_sampler aggregation allows us to scale the feature for today’s Observability workloads effectively. The aggregation randomly samples documents in a statistically robust manner and allows us to balance speed and accuracy at query time as opposed to approaches where you’d have to consider sampling upfront as part of ingesting or rolling up data.