KubeRCA โ AI-Driven Kubernetes Root Cause Analysis
์๋์ด ์ธ๋ฆฌ๋ฉด, AI๊ฐ ์์ธ์ ๋ถ์ํฉ๋๋ค.
[KubeRCA] Official Wiki
KR: ์ด ํ์ด์ง๋ KubeRCA์ ๋น์ , ๊ธฐ์ ์ ๋ฐฉํฅ์ฑ, ๊ทธ๋ฆฌ๊ณ ํ์ ๋ฐฉ์์ ์ ์ํ๋ ํตํฉ ๋ฌธ์์ ๋๋ค. ํ์๊ณผ ์ธ๋ถ ๊ธฐ์ฌ์๋ค์ด ์กฐํ๋กญ๊ฒ ํ์ ํ ์ ์๋๋ก ๋๋ ๊ณต์ ๊ฐ์ด๋๋ผ์ธ์ ๋๋ค.
EN: This page serves as the comprehensive documentation defining the vision, technical direction, and collaboration methods for ์ ๋๊ฐ์. It is an official guideline to ensure seamless collaboration between the team and external contributors.
1. ํ๋ก์ ํธ ๊ฐ์ (Project Overview)
-
Purpose: AI Agent๋ฅผ ํ์ฉํ์ฌ Kubernetes ํ๊ฒฝ์์ ๋ฐ์ํ๋ Incident Alarm์ ๋ถ์ํ๊ณ , ์ค์ ํด๋ฌ์คํฐ ๋ด Application๊ณผ Node์ ์ํ๋ฅผ ์ข ํฉ์ ์ผ๋ก ํด์ํ์ฌ ํ์คํ๋ RCA ํ ํ๋ฆฟ ๊ธฐ๋ฐ์ ๋ถ์ ๊ฒฐ๊ณผ๋ฅผ ์ ๊ณตํ๋ ๊ฒ์ ๋ชฉํ๋ก ํฉ๋๋ค.
-
Background / Introduction (KR):
- ์์ ์๋ํ๋ ์ฅ์ ๋์๋ณด๋ค๋ ์ฅ์ ๋์ ๊ณผ์ ์ค ๋ฐ์ํ๋ ๋ค์ํ Alert๋ฅผ AI Agent์ ํตํด ๋ถ์ํ์ฌ ์์ง๋์ด๊ฐ ์ฅ์ ์ ๊ทผ๋ณธ ์์ธ์ ๋ ๋น ๋ฅด๊ณ ๋ช ํํ๊ฒ ํ์ ํ๊ณ ์ฌํ ์ฌ๋ฐ ๋ฐฉ์ง ๋์ฑ ์ ์ฒด๊ณ์ ์ผ๋ก ์๋ฆฝํ ์ ์๋๋ก ๋๋ ๊ฒ์ ๋ชฉํ๋ก ํฉ๋๋ค.
- ์ฅ์ ๋ฐ์ ์ Prometheus Alert, ๋ก๊ทธ, ๋ฉํธ๋ฆญ ๋ฑ ๊ด์ธก ๋ฐ์ดํฐ๋ฅผ ์๋์ผ๋ก ์์งํ๊ณ , LLM๊ณผ Vector Database๋ฅผ ํ์ฉํด ๊ณผ๊ฑฐ ์ ์ฌ ์ฅ์ ์ฌ๋ก Top 3๋ฅผ ๋น๊ต ๋ถ์ํจ์ผ๋ก์จ ํ์ฌ ์ํฉ์ ๊ฐ์ฅ ์ ํฉํ ์ ์ํ๊ณ ์ผ๊ด๋ ๋์ ๊ฐ์ด๋๋ฅผ ์ ๊ณตํฉ๋๋ค.
- ์ด๋ฅผ ํตํด ๊ฐ์ธ์ ๊ฒฝํ์ ์์กดํ๋ ์ฅ์ ๋์ ๋ฐฉ์์ ์ค์ด๊ณ , ํนํ ์ฃผ๋์ด ์์ง๋์ด๋ ์์ ์ ์ธ ํ๋จ์ ๋ด๋ฆด ์ ์๋ ์ด์ ํ๊ฒฝ์ ๋ง๋๋ ๊ฒ์ ์งํฅํฉ๋๋ค.
-
Background / Introduction (EN):
-
Rather than pursuing fully automated incident remediation, this project focuses on analyzing the diverse alerts generated during incident response through AI agent-based analysis, enabling engineers to identify root causes more quickly and clearly and to systematically establish post-incident prevention strategies.
-
When an incident occurs, the system automatically aggregates observability data, including Prometheus alerts, logs, and metrics. By leveraging LLMs and Vector Databases, it compares the Top 3 most similar historical incidents to provide a rapid, consistent, and context-aware response guide tailored to the current situation.
-
Through this approach, the project aims to reduce reliance on individual experience and to create an operational environment in which junior engineers can make stable and well-informed decisions.
-
-
Core Values:
- Open Source, Human-in-the-Loop by Design, AI-Forward Architecture
2. ํด๊ฒฐํ๊ณ ์ ํ๋ ๋ฌธ์ (Problem Statement)
KR:
- Kubernetes MSA ํ๊ฒฝ์์ ์ฅ์ ๋ฐ์ ์, ์์ง๋์ด๋ kubectl, Grafana, Slack์ ๋ฐ๋ณต ์ ํํ๋ฉฐ ์๋์ผ๋ก ์ปจํ ์คํธ๋ฅผ ์์งํด์ผ ํฉ๋๋ค.
- ๊ณผ๊ฑฐ ์ ์ฌ ์ฅ์ ๊ธฐ๋ก์ด Slack ์ค๋ ๋, Wiki, Notion ๋ฑ์ ๋ถ์ฐ๋์ด ์์ด ์ฒด๊ณ์ ๊ฒ์์ด ๋ถ๊ฐ๋ฅํฉ๋๋ค.
- ์ฅ์ ๋์ ํ์ง์ด ๋ด๋น์์ ๊ฐ์ธ ๊ฒฝํ์ ํฌ๊ฒ ์์กดํ์ฌ, ์ฃผ๋์ด ์์ง๋์ด๋ ๋์ผ ์ฅ์ ์๋ ๋ถ์ ์๊ฐ์ด ์๋ฐฐ ์์๋ฉ๋๋ค.
- MTTR(Mean Time To Resolve)์ด ๋ถํ์ํ๊ฒ ๊ธธ์ด์ง๊ณ , ๋์ผ ์ ํ ์ฅ์ ๊ฐ ๋ฐ๋ณต๋์ด๋ ๋งค๋ฒ ์ฒ์๋ถํฐ ์กฐ์ฌํ๋ ์ ์ํ์ด ๋ฐ์ํฉ๋๋ค.
EN:
- When incidents occur in Kubernetes MSA environments, engineers must manually gather context by switching between kubectl, Grafana, and Slack repeatedly.
- Historical incident records are scattered across Slack threads, Wikis, and Notion, making systematic search impossible.
- Incident response quality heavily depends on individual experience, causing junior engineers to spend significantly more time on the same issues.
- MTTR increases unnecessarily, and even recurring incident types require investigation from scratch each time.
3. ์ ์ฉ ๊ฐ๋ฅํ ๋์ ๊ธฐ์ /์กฐ์ง (Target Organizations)
KR:
- Kubernetes ๊ธฐ๋ฐ MSA๋ฅผ ์ด์ํ๋ ๋ชจ๋ ๊ท๋ชจ์ ์กฐ์ง (์คํํธ์ ~ ๋๊ธฐ์ )
- ์ฅ์ ๋ถ์์ ์ด๋ ค์์ ๊ฒช๋ ํ: SRE/DevOps ํ ๊ท๋ชจ๊ฐ ์๊ฑฐ๋, ์ฃผ๋์ด ์์ง๋์ด ๋น์จ์ด ๋์ ์ฅ์ ๋์ ํ์ง ํธ์ฐจ๊ฐ ํฐ ์กฐ์ง
- ๊ณ ๊ฐ์ฉ์ฑ ํ์ ๋๋ฉ์ธ: ํํ ํฌ, ์ด์ปค๋จธ์ค, SaaS, ํฌ์ค์ผ์ด ๋ฑ ์๋น์ค ์ค๋จ์ด ๋น์ฆ๋์ค์ ์ง์ ์ ์ํฅ์ ๋ฏธ์น๋ ํ๊ฒฝ
- MSP/SI ์ฌ์ ์: ๋ค์์ ๊ณ ๊ฐ ํด๋ฌ์คํฐ๋ฅผ ๋์์ ์ด์ํ๋ฉฐ ํจ์จ์ ์ธ ์ฅ์ ๋์์ด ํ์ํ ์กฐ์ง
- Observability ์คํ(Prometheus, Alertmanager)์ ์ด๋ฏธ ๊ตฌ์ถํ ์กฐ์ง: ๊ธฐ์กด ์ธํ๋ผ์ ์ถ๊ฐ ๋ฐฐํฌ๋ง์ผ๋ก ์ฆ์ ํ์ฉ ๊ฐ๋ฅ
EN:
- Organizations of all sizes running Kubernetes-based MSA (startups to enterprises)
- Teams struggling with incident analysis: Organizations with small SRE/DevOps teams or high junior engineer ratios
- High-availability domains: Fintech, e-commerce, SaaS, healthcare โ where service disruption directly impacts business
- MSP/SI providers: Organizations managing multiple customer clusters simultaneously
- Organizations with existing observability stacks (Prometheus, Alertmanager): Can be deployed as an add-on with minimal setup
2. ํ ๊ตฌ์ฑ (The Team)
Roles and responsibilities for the member team.
| ์ด๋ฆ (Name) | ID | ์ญํ (Role) | SNS | ์ฃผ์ ์ฑ ์ (Responsibilities - KR/EN) |
|---|---|---|---|---|
| ๊นํ์ง | @Taeji_Kim | Team Leader | Link | ๋ก๋๋งต ๋ฐ ์ต์ข ์์ฌ๊ฒฐ์ / Roadmap & Final decision-making |
| ๊นํ์ | @user116 | DevOps | Link | ์ธํ๋ผ ๋ฐ CI/CD ๊ด๋ฆฌ / Infrastructure & CI/CD management |
| ํฉ์ฐ๋น | @Binoo | BE/FE | Link | ํต์ฌ ๊ธฐ๋ฅ ๊ตฌํ / Core logic & API implementation |
| ์ต๋ณดํ | @brilly | BE/FE | Link | ํต์ฌ ๊ธฐ๋ฅ ๊ตฌํ / Core logic & API implementation |
5. ๊ธฐ์ ์คํ (Tech Stack)
- Backend: Go 1.24 + Gin
- Agent: Python 3.10+ (FastAPI, Strands Agents)
- Frontend: React 18 + TypeScript, Vite, Tailwind CSS
- Database: PostgreSQL + pgvector
- AI/LLM: Gemini, OpenAI, Anthropic (Multi-Provider)
- Infra: Kubernetes, Terraform (AWS), Helm 3, GitHub Actions
- Chaos Engineering: Chaos Mesh (+ Istio Fault Injection)
- Observability: Prometheus + Alertmanager, Loki, Tempo, Grafana, Alloy
- Communication: Discord, Slack
6. ๋ก๋๋งต (Roadmap)
- Phase 1 (2025.11 ~ 2025.12): ํ๋ก์ ํธ ๋ถํธ์คํธ๋ฉ ๋ฐ ํด๋ผ์ฐ๋ ์ธํ๋ผ ๊ตฌ์ถ (Project Bootstrap & Cloud Infrastructure)
- Phase 2 (2025.12 ~ 2026.01): ํต์ฌ ์๋ ๋ถ์ ํ์ดํ๋ผ์ธ ๊ตฌํ (Core Alert Analysis Pipeline)
- Phase 3 (2026.01): Incident ๊ด๋ฆฌ ์ฒด๊ณ ๊ตฌ์ถ ๋ฐ Chaos Engineering ๋์ (Incident Management & Chaos Engineering)
- Phase 4 (2026.02): Multi-LLM ํ์ฅ ๋ฐ ๋ณด์ ๊ฐํ (Multi-LLM & Security Hardening)
- Phase 5 (2026.03 ~): ์ค์๊ฐ ๋๊ธฐํ ๋ฐ ๋์๋ณด๋ UX ๊ณ ๋ํ (Real-time Sync & UX Enhancement)
- Phase 6 (2026.04.04): ๊ธ๋ก๋ฒ ์ปค๋ฎค๋ํฐ ๊ณต๊ฐ (Global Community Launch)
7. ์๋ ๋ฐฉ์ (How It Works)
KR:
KubeRCA๋ ์๋ ์์ ๋ถํฐ ๋ถ์ ๊ฒฐ๊ณผ ์ ๋ฌ๊น์ง End-to-End ์๋ํ ํ์ดํ๋ผ์ธ์ ์ ๊ณตํฉ๋๋ค:
- Alert ์์ : Alertmanager โ Backend(Go/Gin)๋ก Webhook ์ ๋ฌ
- ์ปจํ ์คํธ ์๋ ์์ง: Backend๊ฐ Agent(Python/FastAPI)์ ๋ถ์ ์์ฒญ โ Agent๊ฐ Kubernetes API, Prometheus์์ Pod ์ํ, ์ด๋ฒคํธ, ๋ฉํธ๋ฆญ ์๋ ์์ง
- AI ๋ถ์: Strands Agents ํ๋ ์์ํฌ๋ก Gemini/OpenAI/Anthropic ๋ฑ Multi-LLM ๊ธฐ๋ฐ Root Cause Analysis ์ํ
- ๊ฒฐ๊ณผ ์ ๋ฌ: ๋ถ์ ๊ฒฐ๊ณผ๋ฅผ PostgreSQL์ ์ ์ฅ, Slack ์ค๋ ๋ ์๋ฆผ, ์น ๋์๋ณด๋(React) SSE ์ค์๊ฐ ๋ฐ์
- ์ง์ ์ถ์ : Incident ์ข ๋ฃ ์ ์ข ํฉ ๋ถ์ + pgvector ์๋ฒ ๋ฉ โ ์ ์ฌ ์ธ์๋ํธ ๊ฒ์์ ํ์ฉ
EN:
KubeRCA provides an end-to-end automated pipeline from alert reception to analysis delivery:
- Alert Reception: Alertmanager โ Backend (Go/Gin) via Webhook
- Automatic Context Collection: Backend requests analysis from Agent (Python/FastAPI) โ Agent collects Pod status, events, and metrics from Kubernetes API and Prometheus
- AI Analysis: Root Cause Analysis via Strands Agents framework with Multi-LLM support (Gemini/OpenAI/Anthropic)
- Result Delivery: Analysis stored in PostgreSQL, Slack thread notifications, real-time SSE updates to React web dashboard
- Knowledge Accumulation: On incident closure, comprehensive analysis + pgvector embeddings โ enables similar incident search
8. ์์ฑ๋ ๋ฐ PoC ๊ฐ๋ฅ ๋ฒ์ (Completion Status & PoC Scope)
์์ฑ๋: 90%
ํ์ฌ ์์ฑ๋ (Current Status)
| ์์ญ (Area) | ์ํ (Status) | ๋น๊ณ (Notes) |
|---|---|---|
| Backend (Go + Gin) | Production Ready | 40+ API, JWT + OIDC ์ธ์ฆ, SSE, Slack ์ฐ๋ |
| Agent (Python + FastAPI) | Production Ready | Multi-LLM (Gemini / OpenAI / Anthropic), K8s ยท Prometheus ยท Tempo ์ปจํ ์คํธ ์์ง |
| Frontend (React + TypeScript) | Feature Complete | Incident/Alert ๋์๋ณด๋, AI Chat, ๋คํฌ ๋ชจ๋, SSE ์ค์๊ฐ ๊ฐฑ์ |
| Helm Charts | Deployable | ํ ์ค ๋ฐฐํฌ, PostgreSQL + pgvector ์๋ ์ด๊ธฐํ, RBAC |
| Chaos Testing | Fully Operational | 8 ์๋๋ฆฌ์ค (Chaos Mesh 4 + Istio Fault Injection 4) |
| Observability | Selectively Deployable | Prometheus, Loki, Tempo, Grafana Alloy |
PoC ๊ฐ๋ฅ ๋ฒ์ (PoC Scope)
KR: Helm ํ ์ค ๋ฐฐํฌ๋ก ์ ์ฒด ํ์ดํ๋ผ์ธ์ ๊ฒ์ฆํ ์ ์์ต๋๋ค. Kubernetes ํด๋ฌ์คํฐ ์ค๋น ์ํ ๊ธฐ์ค 2 ~ 3์๊ฐ ๋ด End-to-End ์์ฐ์ด ๊ฐ๋ฅํฉ๋๋ค.
EN: A single Helm install deploys the full pipeline. End-to-end demonstration is possible within 2โ3 hours given a ready Kubernetes cluster.
helm install kube-rca ./charts/kube-rca -n kube-rca --create-namespace
๊ฒ์ฆ ๊ฐ๋ฅํ ๊ธฐ๋ฅ (Demonstrable Features):
- Alert Webhook ์์ โ AI Root Cause Analysis ์๋ ์คํ โ Slack ์ค๋ ๋ ๊ฒฐ๊ณผ ์ ์ก
- ์น ๋์๋ณด๋ (Incident/Alert ์กฐํ, RCA ๊ฒฐ๊ณผ, SSE ์ค์๊ฐ ๊ฐฑ์ )
- AI Chat (Incident ์ปจํ ์คํธ ๊ธฐ๋ฐ ์ง์์๋ต)
- ์ ์ฌ Incident ๋ฒกํฐ ๊ฒ์ (pgvector)
- OIDC ์ธ์ฆ (Google SSO)
- Chaos Engineering ์์ฐ (OOMKilled, CrashLoopBackOff ๋ฑ ์ฅ์ ์ฃผ์ โ ์๋ ๋ถ์)
ํ์ ํ๊ฒฝ (Required Environment):
- Kubernetes 1.28+ (์ต์ 2 CPU / 2 GB RAM, ๊ถ์ฅ 4 CPU / 4 GB RAM)
- PostgreSQL 14+ (pgvector ํ์ฅ)
- LLM API Key (Gemini ๋ฌด๋ฃ ํฐ์ด ๊ฐ๋ฅ)
- Slack Bot Token (
chat:write,channels:manage) - ์ ํ: Chaos Mesh, Istio, Ingress + ์ธ๋ถ ๋๋ฉ์ธ
9. ์ฐธ์ฌ ๋ฐฉ๋ฒ (How to Contribute)
- Issues: ๋ฒ๊ทธ๋ ๊ธฐ๋ฅ ์ ์์ GitHub Issues๋ฅผ ํ์ฉํ์ธ์. (Please use GitHub Issues for bug reports or feature requests.)
- PRs: ๋ชจ๋ Pull Request๋ Tech Lead์ ๊ฒํ ํ ๋ณํฉ๋ฉ๋๋ค. (All PRs will be merged after review by the Tech Lead.)
- Guide: [CONTRIBUTING.md] ํ์ผ์ ์ฐธ๊ณ ํ์ธ์. (Please refer to the [CONTRIBUTING.md] file.)
- Discord (Official): [์ ๋๊ฐ์ Invite Link]
- KR: ์ค์๊ฐ ์ํต ๋ฐ ๊ธฐ์ ์ง์์ ์ํ ์ฑ๋์ ๋๋ค.
- EN: Official channel for real-time communication and technical support.
10. ๋ฆฌ์์ค ๋ฐ ๋งํฌ (Resources & Links)
- GitHub Repository: [Link]
- Docs: [Architecture / API Specs]
| This is a space where knowledge is not merely consumed, but respected, sovereign, and connectedโshared together with cloud industry professionals (Bros).|
| ์ง์์ด ์๋น๋์ง ์๊ณ ์กด์คยท์ฃผ๊ถ๋ณด์ฅยท์ฐ๊ฒฐ๋๋ ๊ณต๊ฐ์ผ๋ก ํด๋ผ์ฐ๋ ํ์ ์ ๋ฌธ๊ฐ(Bro)์ ํจ๊ป ๊ณต์ ํ๊ณ ์์ต๋๋ค. |
