Identifying performance bottlenecks and wasteful computations can be a complex and challenging task, particularly in modern cloud-native environments.
As the complexity of cloud-native environments increases, so does the need for effective observability solutions. Organizations typically rely on the established three pillars of observability –– metrics, logs, and traces –– to improve the reliability and performance of their applications and infrastructure. While the pillars are valuable for observability, they are not enough on their own.
In this blog post, we will discuss why continuous profiling signals are a must-have in your observability toolbox, and how they complement metrics, logs, and traces. In addition, we will also discuss why it is crucial to consolidate all four observability signals in a unified platform, as opposed to dispersed, siloed tools.
What are the different types of observability signals?
Before we go into more detail, let’s briefly describe the four types of observability signals and their common use cases.
Profiles: Profiles (also referred to as profiling signals in this blog post) are stacktraces that provide a detailed view of where a code spends resources, typically CPU cycles or memory. They provide an overview of the most expensive areas of a system, down to a single line of code.
Metrics: Metrics are numerical values that represent the state or performance of a system at a particular point in time. They are used to monitor health, identify trends, and trigger alerts.
Logs: Logs are records of events and messages emitted by a system. They provide insights into system behavior and help identify issues.
Traces: Traces are detailed records of a request’s path through a system. They are used to understand dependencies and interactions between system components.
When are metrics, logs, and traces not enough?
Observability is more than just monitoring your system; it is about gaining a comprehensive understanding of it — this is measured by how well practitioners can answer the “why” questions. To effectively understand a system, however, developers, SREs, and CloudOps engineers need granular visibility of where compute resources are spent across their entire fleet, including the unknown-unknowns that may be lurking beneath the surface. This is where profiling (in production) becomes a crucial signal in your observability stack.
Metrics, logs, and traces all have their own unique strengths in providing insight into the performance of a system, but profiling offers a deeper level of visibility that goes beyond what these other signals can provide. Profiles allow for the identification of even the most obscure issues, such as those related to data structures and memory allocation, as well as code visibility at the kernel and userspace level.
Put another way, metrics, logs, and traces are analogous to measuring and monitoring the vital signs of the human body — they provide general information about health and performance, such as body temperature, weight, and heart rate, including records of events leading to symptoms. But profiling is like taking an X-ray — it allows you to see the inner workings of the body and understand how different systems interact, giving more detailed information and potentially identifying issues that would not be visible just by looking at macro-level indicators.
Further, profiling provides unprecedented breadth and depth of visibility that unlocks the ability to surface unknown-unknowns of your system. This deeper level of system-wide visibility enables users to ditch the guesswork; it opens up the ability to quickly get to the heart of the “why” questions –– why are we spending x% of our CPU budget on function y? Why is z happening? What is the most expensive function across our entire fleet?
Elastic Universal ProfilingTM extends the benefits of profiling to the DevFinOps persona by providing a better understanding of how specific lines of code are impacting their cloud costs and carbon footprint. They can identify specific areas where resources are being wasted and take action to optimize and reduce costs, as well as reduce the environmental impact of their application. This ultimately results in cost savings and a reduction in the carbon footprint of their organization.
To summarize, in most scenarios, metrics and traces provide visibility into the known-unknowns of a system. Logs, on the other hand, provide visibility into the known-knowns of a system. Together, the three pillars of observability provide macro-level visibility into the system. Observability without profiling leaves a significant gap in visibility, as there are always unknown-unknowns in any system. Profiling signals close that gap by providing micro-level visibility into a system –– this level of visibility is a must-have in modern cloud-native environments.
The next section dives deeper into the unknown-unknown concept using the Johari Window framework.
What is the relationship between the Johari Window framework and observability?
The Johari Window framework was developed by Joseph Luft and Harry Ingham, and it is widely adopted by professionals in national defense and risk management to access and evaluate threats and risks.1 According to the framework, the knowledge of a system can be categorized into known-knowns, known-unknowns, and unknown-unknowns.
Observability is anchored on the collection and analysis of data to gain knowledge of a system, so we can utilize the Johari Window framework to classify observability signals as follows: