• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Home
  • Contact Us

iHash

News and How to's

  • The Flasher™ 2.0 by Nood, IPL Laser Hair Removal Handset (Refurbished) for $119

    The Flasher™ 2.0 by Nood, IPL Laser Hair Removal Handset (Refurbished) for $119
  • The 2023 Travel Hacker Bundle ft. Rosetta Stone Lifetime Subscription for $199

    The 2023 Travel Hacker Bundle ft. Rosetta Stone Lifetime Subscription for $199
  • Apple iPad Air 2, 16GB – Silver (Refurbished: Wi-Fi Only) for $106

    Apple iPad Air 2, 16GB – Silver (Refurbished: Wi-Fi Only) for $106
  • S300 eufyCam (eufyCam 3C) 3-Cam Kit for $579

    S300 eufyCam (eufyCam 3C) 3-Cam Kit for $579
  • eufy Baby Monitor 2 (2K, Smart, Wi-Fi) for $119

    eufy Baby Monitor 2 (2K, Smart, Wi-Fi) for $119
  • News
    • Rumor
    • Design
    • Concept
    • WWDC
    • Security
    • BigData
  • Apps
    • Free Apps
    • OS X
    • iOS
    • iTunes
      • Music
      • Movie
      • Books
  • How to
    • OS X
      • OS X Mavericks
      • OS X Yosemite
      • Where Download OS X 10.9 Mavericks
    • iOS
      • iOS 7
      • iOS 8
      • iPhone Firmware
      • iPad Firmware
      • iPod touch
      • AppleTV Firmware
      • Where Download iOS 7 Beta
      • Jailbreak News
      • iOS 8 Beta/GM Download Links (mega links) and How to Upgrade
      • iPhone Recovery Mode
      • iPhone DFU Mode
      • How to Upgrade iOS 6 to iOS 7
      • How To Downgrade From iOS 7 Beta to iOS 6
    • Other
      • Disable Apple Remote Control
      • Pair Apple Remote Control
      • Unpair Apple Remote Control
  • Special Offers
  • Contact us

Phantom Metrics: Why Your Monitoring Dashboard May Be Lying to You

Jan 4, 2023 by iHash Leave a Comment


Whether you’re a DevOps, SRE, or just a data driven individual, you’re probably addicted to dashboards and metrics. We look at our metrics to see how our system is doing, whether on the infrastructure, the application or the business level. We trust our metrics to show us the status of our system and where it misbehaves. But do our metrics show us what really happened? You’d be surprised how often it’s not the case.

In this post I will look into the math and mechanics behind metrics, some common misconceptions, what it takes to have accurate metrics, and if there even is such a thing.

Table of Contents

  • Metrics essentials
  • The mechanics of metrics
  • The math of metrics in a nutshell
  • Determine your questions, design your metrics accordingly
  • The measurement problem
  • Mean time to detection
  • Varying resolution and downscaling
  • Summary

Metrics essentials

Metrics are essentially roll ups of raw events. During this roll up process, the events are translated into numerical data points. A simple example is errors occurring in the system, with a simple metric to count the errors. Metrics can also involve multiple variables, such as a count of requests with response time higher than 1 second. When measured over time, these data points form a time series.

Metrics can be of various types, such as Counters, Gauges and Histograms. Counters are used for the cumulative counting of events, as we saw in the above examples. Gauges typically represent the latest value of measurement. And then there are more elaborate types such as Histograms that can sample the distribution of metric values, by counting events in configurable “buckets” or “bins”. For example, you may want to understand the memory usage percent segmented by pods across your cluster in given points in time.

The mechanics of metrics

In an ideal world, we would ingest and store all the raw events, and then calculate the metrics on query time. This would allow us to slice and dice the events in any way we need, and ask any ad-hoc question we desire.

In the real world, however, keeping all the raw events for extended periods of time can be prohibitively expensive, due to the high volumes of data. To overcome this, events are oftentimes rolled up into metrics in the collection pipeline, while discarding the raw events or retaining them for short periods only. This is oftentimes a matter of a simple configuration in your metrics collector agent.

In addition to reducing cost, aggregation upon collection can improve the performance of real-time analytics with higher metric transmission and ingestion rates at higher frequency, and by avoiding heavy aggregations and calculations on query time.

The math of metrics in a nutshell

This rolling up process involves some math. We might want to calculate the mean or median of the response times, or maybe a percentile, or an aggregation over a time window. We might also want to roll up multiple events into one composite metric. For example, I may want to calculate the 95th percentile (commonly known as P95) of all the pods of a specific service across my cluster.

Even if you don’t like math, you cannot avoid it with metrics. You need to understand the different aggregation functions, and the relation between the question you wish to ask and the metric and aggregate you need in order to answer it. Let’s look at the Average function as an example, as many tend to start there. Averages, by definition, smoothen things up, and will be less suitable for flushing out anomalous behavior and outliers. When investigating latency problems, for example, it will be quite useless to look at average metric values, and you’d be better off looking at percentiles.

OpenObservability Talks: All Metrics Are Wrong, Some Are Useful

Determine your questions, design your metrics accordingly

In a way, you can think about these metrics as a lossy compression, during which we lose data and context from the raw events. If we don’t keep the raw events, then we need to determine upfront what’s important for us. For example, if I only calculate the average value over the data, I will not be able to ask about the P95 (95th percentile) later over the pre-aggregated data.

You need to determine what questions you want to answer, what’s important for you, and design your metrics and aggregations accordingly. A common mistake is that people avoid this design phase, and just use the preset metrics and default values provided out of the box with their metrics collector of choice. While you may think these defaults represent some industry standard, these are oftentimes quite legacy, and in most cases won’t be in tune with your specific needs.

The measurement problem

Just like in physics, the measurement problem occurs when we measure a (seemingly) continuous property at discrete intervals, often called the sampling interval, which determine the sampling rate. This creates a distorted representation, whereby the metric may not actually reflect the original measured property. For example, if we measure the CPU utilization every 60 seconds, then any CPU outlier happening between these sampling points will be invisible to us. Moreover, in order to draw a consecutive line, visualization tools oftentimes average over consecutive data points, which gives the misleading appearance of a smooth line.

On some occasions the opposite can occur, where you can get artifacts in your metrics that aren’t real, like peaks in your metrics that don’t really exist. This can happen when running aggregations within the storage backend, due to the in which the calculation is being made.

Mean time to detection

The sampling period also influences how fast a change in the system will be visible in the metrics. Most algorithms require five data points to detect a trend. If the sampling interval is 60 sec, then the simple math determines that it will take five minutes (that is, 60 sec X 5 data points) before we see something is wrong. Could you afford waiting 5 minutes to know that your system crashed? Using shorter sampling intervals (i.e. higher sampling rates) will shorten this period and enable us to detect and react faster. Of course, higher sampling rates incur overhead in CPU and storage, so we need to find the configuration that strikes the right balance for our needs.

Varying resolution and downscaling

A common practice is to save metrics in different resolutions in a tiered approach, to reduce cost. For example, you may want to save the metric every 10 seconds for the first day, but then every 5 minutes for the next week, and perhaps every 1 hour for the month or more ahead. This practice assumes that we need the finest granularity for the near real time period, in which we may need it if there’s an issue in the system, while investigations over longer periods require larger scale trends.

The different granularities can be achieved with downscaling the metrics, namely calculating the less granular metric off of the higher granularity one. While this sounds perfectly reasonable, math can interfere here, as some aggregation functions are not compatible with certain computations, and can therefore not be aggregated later. For example, percentiles are not additive and cannot be summed up. So, following the above example, if you have a P99 percentile sampled with 10 seconds resolution, you can’t roll them up to a 5 minute resolution. It’s important to be cognizant of the compatibility of the aggregation functions, and when using non-compatible functions such as percentiles, to make design decisions about which resolutions we require, and calculate these time series upfront.

The varying resolution is not limited only to the time factor. Another example is saving per-pod data, and then wishing to “group by” nodes or clusters. The same constraint applies here, meaning that if we expect to be interested in slicing and dicing a percentile based metric per node, per region, per namespace, or across the entire cluster, we need to pre-aggregate accordingly.

Another approach is to give up the accuracy of measurements to gain compatibility in computation, by using histograms. You can take histograms of a few servers and sum them up, or histograms of several time windows and sum them up, and then to downscale. The problem is that in this case percentiles will be estimates rather than accurate. It’s also important to note that histograms are more consuming in storage and in throughput, as every sample is not just a single number but rather a few samples (one per bucket).

Summary

Metrics are a powerful way to monitor our applications. But they are not necessarily representative of the actual system’s state. It requires understanding of the math and nature of metrics, as well as careful design, to make sure our metrics are indeed useful to answer the questions we need. Having access to the raw data in addition to the metrics is always good, as this is ultimately the source of truth.

Want to learn more? Check out the OpenObservability Talks episode: All Metrics Are Wrong, Some Are Useful.



Source link

Share this:

  • Facebook
  • Twitter
  • Pinterest
  • LinkedIn

Filed Under: News Tagged With: dashboard, lying, Metrics, Monitoring, Phantom

Special Offers

  • The Flasher™ 2.0 by Nood, IPL Laser Hair Removal Handset (Refurbished) for $119

    The Flasher™ 2.0 by Nood, IPL Laser Hair Removal Handset (Refurbished) for $119
  • The 2023 Travel Hacker Bundle ft. Rosetta Stone Lifetime Subscription for $199

    The 2023 Travel Hacker Bundle ft. Rosetta Stone Lifetime Subscription for $199
  • Apple iPad Air 2, 16GB – Silver (Refurbished: Wi-Fi Only) for $106

    Apple iPad Air 2, 16GB – Silver (Refurbished: Wi-Fi Only) for $106
  • S300 eufyCam (eufyCam 3C) 3-Cam Kit for $579

    S300 eufyCam (eufyCam 3C) 3-Cam Kit for $579
  • eufy Baby Monitor 2 (2K, Smart, Wi-Fi) for $119

    eufy Baby Monitor 2 (2K, Smart, Wi-Fi) for $119

Reader Interactions

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

  • Facebook
  • GitHub
  • Instagram
  • Pinterest
  • Twitter
  • YouTube

More to See

The Flasher™ 2.0 by Nood, IPL Laser Hair Removal Handset (Refurbished) for $119

Jan 31, 2023 By iHash

DoD + DevSecOps: A path toward speed and agility

DoD + DevSecOps: A path toward speed and agility

Jan 30, 2023 By iHash

Tags

* Apple Cisco computer security cyber attacks cyber crime cyber news cybersecurity Cyber Security cyber security news cyber security news today cyber security updates cyber threats cyber updates data breach data breaches google hacker hacker news Hackers hacking hacking news how to hack incident response information security iOS 7 iOS 8 iPhone Malware microsoft network security ransomware ransomware malware risk management Secure security security breaches security vulnerabilities software vulnerability the hacker news Threat update video Vulnerabilities web applications

Latest

QNAP Fixes Critical Vulnerability in NAS Devices with Latest Security Updates

Jan 31, 2023Ravie LakshmananData Security / Vulnerability Taiwanese company QNAP has released updates to remediate a critical security flaw affecting its network-attached storage (NAS) devices that could lead to arbitrary code injection. Tracked as CVE-2022-27596, the vulnerability is rated 9.8 out of a maximum of 10 on the CVSS scoring scale. It affects QTS 5.0.1 […]

Why AutoML Isn’t Enough to Democratize Data Science 

You can cook food in a microwave in minutes. But we don’t say that microwaves “democratized” cooking. Preparing a meal requires much more: selecting and preparing ingredients, optimizing the cooking method, and creating the right ambiance. The microwave just accelerates one part of the process. Just as microwaves don’t handle the entire meal, automated machine […]

The 2023 Travel Hacker Bundle ft. Rosetta Stone Lifetime Subscription for $199

Expires January 30, 2024 23:59 PST Buy now and get 94% off Rosetta Stone: Lifetime Subscription (All Languages) KEY FEATURES The benefits of learning to speak a second language (or third) are immeasurable! With its intuitive, immersive training method, Rosetta Stone will have you reading, writing, and speaking new languages like a natural in no […]

S300 eufyCam (eufyCam 3C) 3-Cam Kit for $579

Expires January 03, 2123 19:28 PST Buy now and get 0% off KEY FEATURES See 4K Detail Day and Night 180-Day Battery Life Up to 16 TB Expandable Local Storage (Additional Storage Drive Not Included) BionicMind AI Differentiates Family and Strangers HomeBase 3 Centralize Security Management PRODUCT SPECS Resolution 4K (3840×2160)° Night Vision Infrared & […]

eufy Baby Monitor 2 (2K, Smart, Wi-Fi) for $119

Expires January 04, 2123 21:35 PST Buy now and get 0% off KEY FEATURES It’s in the 2K Details: The 2K high-resolution camera with 330 pan, 110 tilt, and 4x zoom features lets you watch over your baby in stunning detail. Night Vision: The non-invasive infrared night vision lets you check on your baby at […]

Charlie Klein

Reduce MTTR with Logz.io’s Single-Pane-of-Glass Observability Data Analytics

Observability data provides the insights engineers need to make sense of increasingly complex cloud environments so they can improve the health, performance, and user experience of their systems. These insights can quickly answer business-critical questions like, “what is causing this latency in my front end?” Or, “why is my checkout service returning errors?” Observability is […]

Jailbreak

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.2.0

Pangu has updated its jailbreak utility for iOS 9.0 to 9.0.2 with a fix for the manage storage bug and the latest version of Cydia. Change log V1.2.0 (2015-10-27) 1. Bundle latest Cydia with new Patcyh which fixed failure to open url scheme in MobileSafari 2. Fixed the bug that “preferences -> Storage&iCloud Usage -> […]

Apple Blocks Pangu Jailbreak Exploits With Release of iOS 9.1

Apple has blocked exploits used by the Pangu Jailbreak with the release of iOS 9.1. Pangu was able to jailbreak iOS 9.0 to 9.0.2; however, in Apple’s document on the security content of iOS 9.1, PanguTeam is credited with discovering two vulnerabilities that have been patched.

Pangu Releases Updated Jailbreak of iOS 9 Pangu9 v1.1.0

  Pangu has released an update to its jailbreak utility for iOS 9 that improves its reliability and success rate.   Change log V1.1.0 (2015-10-21) 1. Improve the success rate and reliability of jailbreak program for 64bit devices 2. Optimize backup process and improve jailbreak speed, and fix an issue that leads to fail to […]

Activator 1.9.6 Released With Support for iOS 9, 3D Touch

  Ryan Petrich has released Activator 1.9.6, an update to the centralized gesture, button, and shortcut manager, that brings support for iOS 9 and 3D Touch.

Copyright iHash.eu © 2023
We use cookies on this website. By using this site, you agree that we may store and access cookies on your device. Accept Read More
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT