Intuit recently shared insights on how they navigated the challenges of monitoring and debugging Kubernetes clusters by utilizing Generative AI (GenAI). These GenAI initiatives aimed to enhance the processes surrounding detection, debugging, and remediation.
Lili Wan, a senior staff software engineer alongside Anusha Ragunathan, a principal software engineer at Intuit, explained the experiments and shared background information on Intuit’s Kubernetes Service platform.
With more than 325 Kubernetes clusters that support over 7,000 applications and services, Intuit encountered significant challenges in maintaining the health of the clusters while also reducing alert exhaustion among on-call engineers.
Intuit’s Kubernetes Service platform is extensive and intricate, which makes effective observation and debugging quite difficult. The swift expansion of applications and the regular modifications within the clusters added additional layers of complexity. Engineers frequently dealt with alert fatigue due to the sheer volume of data sources and alerts, hindering the process of identifying and resolving issues.
Intuit’s team pinpointed three primary areas needing enhancement: detection, debugging, and remediation.
To improve their detection capabilities, Intuit introduced a system named “Cluster Golden Signals,” which reflects the principles of service golden signals. This system offers a comprehensive overview of a cluster’s health by filtering out irrelevant data and concentrating on essential signals for alerts.
The core elements of Kubernetes clusters are monitored using dashboards that compile metrics into a unified health status—Healthy, Degraded, or Critical—by applying Prometheus expressions. This method enables engineers to swiftly identify problematic clusters and ascertain whether the issues stem from the service or the platform, ultimately lowering the mean time to detect issues (MTTD).
For more in-depth debugging, Intuit adopted an open-source tool called K8sGPT. This tool examines Kubernetes clusters to identify and categorize issues by utilizing knowledge accumulated from site reliability engineers. K8sGPT employs resource-specific analyzers to pull pertinent error messages from clusters, enhancing them with AI-driven insights. By merging Prometheus metrics with Golden Signals, K8sGPT can engage public models to seek further information regarding errors.
This integration offers additional insights to help pinpoint potential root causes of alerts.
Source: GenAI Experiments: Monitoring and Debugging Kubernetes Cluster Health
Additionally, K8sGPT was recognized as one of the top 10 most contributed projects from CNCF, with its first commit made in March 2023. The project has garnered 5.6K stars and has 88 contributors. When deployed in a Kubernetes Cluster, K8sGPT is compatible with models such as OpenAI, Azure, Cohere, Amazon Bedrock, Google Gemini, and various local models. It was showcased alongside other projects like kube-burner, Kuasar, KRKN, and easgress at the KubeCon EU 2024 conference.
K8sGPT is compatible with Windows, Mac, and Linux systems and can be installed using brew, RPM, DEB, or APK.
After debugging the issues, the next phase is remediation. K8sGPT collaborates with public Large Language Models (LLMs) from organizations such as OpenAI, Google, and Microsoft, providing suggestions for remediating Kubernetes-related errors. Nevertheless, these public LLMs do not possess the specific context of Intuit’s platform configurations.
To fill this void, Intuit has created a unique GenAI operating system (GenOS), which incorporates local models enriched with Intuit-specific data via retrieval-augmented generation (RAG).
Intuit intends to keep an eye on advancements in reducing mean time to detection (MTTD) and mean time to resolution (MTTR). The company also plans to investigate the potential uses of GenAI in additional fields such as traffic management and debugging Java virtual machines.
Welcome to DediRock, your trusted partner in high-performance hosting solutions. At DediRock, we specialize in providing dedicated servers, VPS hosting, and cloud services tailored to meet the unique needs of businesses and individuals alike. Our mission is to deliver reliable, scalable, and secure hosting solutions that empower our clients to achieve their digital goals. With a commitment to exceptional customer support, cutting-edge technology, and robust infrastructure, DediRock stands out as a leader in the hosting industry. Join us and experience the difference that dedicated service and unwavering reliability can make for your online presence. Launch our website.