Machine Learning In Observability — Fact or Fiction?

Krish
AI Sutra
Published in
3 min readApr 7, 2018

--

With Cloud Native becoming the hot topic in the modern enterprise space, the focus is shifting from the nature of underlying infrastructure for cloud native applications to topics like monitoring, log analytics, tracing, etc.. Not just the tools but the focus is also shifting on what one should be doing with these tools in the modern IT stacks. The focus is shifting from symptoms to debuggability. With the modern IT stacks being more distributed and the delineation between the application and the underlying infrastructure going away, thanks to DevOps becoming mainstream, it is critical for us to go beyond knowing what is happening in the stack to why something is happening and what can be done to mitigate it. Enter observability, the hottest term in the industry today.

At Rishidot Research, we prefer to go with Twitter’s definition of Observability and we consider it to be still evolving as we define the best practices in the industry. According to Twitter, the four pillars of Observability are:

  • Monitoring
  • Alerting/Visualization
  • Distributed systems tracing infrastructure
  • Log aggregation/analytics

The topic of this post is not about Observability per se but about the means to Observability.

Traditionally, monitoring depended on understanding the failure modes to decide what needs to be monitored. Even in the SRE world, the decision on what to monitor was based on the common failure modes wrapped with the knowledge of the systems and needs in a particular organization. With the mandate of observability being debuggability and with both the underlying infrastructure and the applications becoming more distributed and loosely coupled, using a more human centric approach to observability is pretty limited. One need to think beyond known failure modes and prepare for unpredictable behaviors. Preparing for unpredictable behaviors and unknown failure modes, requires the ability to predict taking into account large swath of data available across monitoring systems, tracing and logs.

This is where it is our strong belief that Machine Learning and AI has a critical role to play. Using the traditional approach to Observability will be very limited and will be irrelevant as the scale increases dramatically with the digitization of business across multiple verticals. Already organizations are employing machine learning models and AI to tackle Observability data but we are still scratching the surface. With a declarative approach to operations becoming more and more fashionable and approaches like GitOps becoming the norm for orchestrating infrastructure, the role of debuggability becomes more and more critical. Not only it is important from the point of view of blameless postmortems, it is also important to be proactive and fix issues well before they can become a headache. Even thinking further, ML/AI in Observability is the first step towards an increased role of AI in Operations.

Read the full post and find out about the Virtual Panel on the topic at StackSense.io where it was originally posted.

--

--

Future Asteroid Farmer, Analyst, Modern Enterprise, Startup Dude, Ex-Red Hatter, Rishidot Research, Modern Enterprise Podcast, and a random walker