📝 [Wiki] KubeRCA→ AI-Driven Kubernetes RCA and Automated Remediation

Taeji_Kim · January 10, 2026, 3:56am

KubeRCA — AI-Driven Kubernetes Root Cause Analysis

알람이 울리면, AI가 원인을 분석합니다.

[KubeRCA] Official Wiki

KR: 이 페이지는 KubeRCA의 비전, 기술적 방향성, 그리고 협업 방식을 정의하는 통합 문서입니다. 팀원과 외부 기여자들이 조화롭게 협업할 수 있도록 돕는 공식 가이드라인입니다.

EN: This page serves as the comprehensive documentation defining the vision, technical direction, and collaboration methods for 자 드가자. It is an official guideline to ensure seamless collaboration between the team and external contributors.

1. 프로젝트 개요 (Project Overview)

Purpose: AI Agent를 활용하여 Kubernetes 환경에서 발생하는 Incident Alarm을 분석하고, 실제 클러스터 내 Application과 Node의 상태를 종합적으로 해석하여 표준화된 RCA 템플릿 기반의 분석 결과를 제공하는 것을 목표로 합니다.
Background / Introduction (KR):
- 완전 자동화된 장애 대응보다는 장애 대응 과정 중 발생하는 다양한 Alert를 AI Agent을 통해 분석하여 엔지니어가 장애의 근본 원인을 더 빠르고 명확하게 파악하고 사후 재발 방지 대책을 체계적으로 수립할 수 있도록 돕는 것을 목표로 합니다.
- 장애 발생 시 Prometheus Alert, 로그, 메트릭 등 관측 데이터를 자동으로 수집하고, LLM과 Vector Database를 활용해 과거 유사 장애 사례 Top 3를 비교 분석함으로써 현재 상황에 가장 적합한 신속하고 일관된 대응 가이드를 제공합니다.
- 이를 통해 개인의 경험에 의존하던 장애 대응 방식을 줄이고, 특히 주니어 엔지니어도 안정적인 판단을 내릴 수 있는 운영 환경을 만드는 것을 지향합니다.
Background / Introduction (EN):
- Rather than pursuing fully automated incident remediation, this project focuses on analyzing the diverse alerts generated during incident response through AI agent-based analysis, enabling engineers to identify root causes more quickly and clearly and to systematically establish post-incident prevention strategies.
- When an incident occurs, the system automatically aggregates observability data, including Prometheus alerts, logs, and metrics. By leveraging LLMs and Vector Databases, it compares the Top 3 most similar historical incidents to provide a rapid, consistent, and context-aware response guide tailored to the current situation.
- Through this approach, the project aims to reduce reliance on individual experience and to create an operational environment in which junior engineers can make stable and well-informed decisions.
Core Values:
- Open Source, Human-in-the-Loop by Design, AI-Forward Architecture

2. 해결하고자 하는 문제 (Problem Statement)

KR:

Kubernetes MSA 환경에서 장애 발생 시, 엔지니어는 kubectl, Grafana, Slack을 반복 전환하며 수동으로 컨텍스트를 수집해야 합니다.
과거 유사 장애 기록이 Slack 스레드, Wiki, Notion 등에 분산되어 있어 체계적 검색이 불가능합니다.
장애 대응 품질이 담당자의 개인 경험에 크게 의존하여, 주니어 엔지니어는 동일 장애에도 분석 시간이 수배 소요됩니다.
MTTR(Mean Time To Resolve)이 불필요하게 길어지고, 동일 유형 장애가 반복되어도 매번 처음부터 조사하는 악순환이 발생합니다.

EN:

When incidents occur in Kubernetes MSA environments, engineers must manually gather context by switching between kubectl, Grafana, and Slack repeatedly.
Historical incident records are scattered across Slack threads, Wikis, and Notion, making systematic search impossible.
Incident response quality heavily depends on individual experience, causing junior engineers to spend significantly more time on the same issues.
MTTR increases unnecessarily, and even recurring incident types require investigation from scratch each time.

3. 적용 가능한 대상 기업/조직 (Target Organizations)

KR:

Kubernetes 기반 MSA를 운영하는 모든 규모의 조직 (스타트업 ~ 대기업)
장애 분석에 어려움을 겪는 팀: SRE/DevOps 팀 규모가 작거나, 주니어 엔지니어 비율이 높아 장애 대응 품질 편차가 큰 조직
고가용성 필수 도메인: 핀테크, 이커머스, SaaS, 헬스케어 등 서비스 중단이 비즈니스에 직접적 영향을 미치는 환경
MSP/SI 사업자: 다수의 고객 클러스터를 동시에 운영하며 효율적인 장애 대응이 필요한 조직
Observability 스택(Prometheus, Alertmanager)을 이미 구축한 조직: 기존 인프라에 추가 배포만으로 즉시 활용 가능

EN:

Organizations of all sizes running Kubernetes-based MSA (startups to enterprises)
Teams struggling with incident analysis: Organizations with small SRE/DevOps teams or high junior engineer ratios
High-availability domains: Fintech, e-commerce, SaaS, healthcare — where service disruption directly impacts business
MSP/SI providers: Organizations managing multiple customer clusters simultaneously
Organizations with existing observability stacks (Prometheus, Alertmanager): Can be deployed as an add-on with minimal setup

2. 팀 구성 (The Team)

Roles and responsibilities for the member team.

이름 (Name)	ID	역할 (Role)	SNS	주요 책임 (Responsibilities - KR/EN)
김태지	@Taeji_Kim	Team Leader	Link	로드맵 및 최종 의사결정 / Roadmap & Final decision-making
김회정	@user116	DevOps	Link	인프라 및 CI/CD 관리 / Infrastructure & CI/CD management
황우빈	@Binoo	BE/FE	Link	핵심 기능 구현 / Core logic & API implementation
최보현	@brilly	BE/FE	Link	핵심 기능 구현 / Core logic & API implementation

5. 기술 스택 (Tech Stack)

Backend: Go 1.24 + Gin
Agent: Python 3.10+ (FastAPI, Strands Agents)
Frontend: React 18 + TypeScript, Vite, Tailwind CSS
Database: PostgreSQL + pgvector
AI/LLM: Gemini, OpenAI, Anthropic (Multi-Provider)
Infra: Kubernetes, Terraform (AWS), Helm 3, GitHub Actions
Chaos Engineering: Chaos Mesh (+ Istio Fault Injection)
Observability: Prometheus + Alertmanager, Loki, Tempo, Grafana, Alloy
Communication: Discord, Slack

6. 로드맵 (Roadmap)

Phase 1 (2025.11 ~ 2025.12): 프로젝트 부트스트랩 및 클라우드 인프라 구축 (Project Bootstrap & Cloud Infrastructure)
Phase 2 (2025.12 ~ 2026.01): 핵심 알람 분석 파이프라인 구현 (Core Alert Analysis Pipeline)
Phase 3 (2026.01): Incident 관리 체계 구축 및 Chaos Engineering 도입 (Incident Management & Chaos Engineering)
Phase 4 (2026.02): Multi-LLM 확장 및 보안 강화 (Multi-LLM & Security Hardening)
Phase 5 (2026.03 ~): 실시간 동기화 및 대시보드 UX 고도화 (Real-time Sync & UX Enhancement)
Phase 6 (2026.04.04): 글로벌 커뮤니티 공개 (Global Community Launch)

7. 작동 방식 (How It Works)

KR:

KubeRCA는 알람 수신부터 분석 결과 전달까지 End-to-End 자동화 파이프라인을 제공합니다:

Alert 수신: Alertmanager → Backend(Go/Gin)로 Webhook 전달
컨텍스트 자동 수집: Backend가 Agent(Python/FastAPI)에 분석 요청 → Agent가 Kubernetes API, Prometheus에서 Pod 상태, 이벤트, 메트릭 자동 수집
AI 분석: Strands Agents 프레임워크로 Gemini/OpenAI/Anthropic 등 Multi-LLM 기반 Root Cause Analysis 수행
결과 전달: 분석 결과를 PostgreSQL에 저장, Slack 스레드 알림, 웹 대시보드(React) SSE 실시간 반영
지식 축적: Incident 종료 시 종합 분석 + pgvector 임베딩 → 유사 인시던트 검색에 활용

EN:

KubeRCA provides an end-to-end automated pipeline from alert reception to analysis delivery:

Alert Reception: Alertmanager → Backend (Go/Gin) via Webhook
Automatic Context Collection: Backend requests analysis from Agent (Python/FastAPI) → Agent collects Pod status, events, and metrics from Kubernetes API and Prometheus
AI Analysis: Root Cause Analysis via Strands Agents framework with Multi-LLM support (Gemini/OpenAI/Anthropic)
Result Delivery: Analysis stored in PostgreSQL, Slack thread notifications, real-time SSE updates to React web dashboard
Knowledge Accumulation: On incident closure, comprehensive analysis + pgvector embeddings → enables similar incident search

8. 완성도 및 PoC 가능 범위 (Completion Status & PoC Scope)

완성도: 90%

현재 완성도 (Current Status)

영역 (Area)	상태 (Status)	비고 (Notes)
Backend (Go + Gin)	Production Ready	40+ API, JWT + OIDC 인증, SSE, Slack 연동
Agent (Python + FastAPI)	Production Ready	Multi-LLM (Gemini / OpenAI / Anthropic), K8s · Prometheus · Tempo 컨텍스트 수집
Frontend (React + TypeScript)	Feature Complete	Incident/Alert 대시보드, AI Chat, 다크 모드, SSE 실시간 갱신
Helm Charts	Deployable	한 줄 배포, PostgreSQL + pgvector 자동 초기화, RBAC
Chaos Testing	Fully Operational	8 시나리오 (Chaos Mesh 4 + Istio Fault Injection 4)
Observability	Selectively Deployable	Prometheus, Loki, Tempo, Grafana Alloy

PoC 가능 범위 (PoC Scope)

KR: Helm 한 줄 배포로 전체 파이프라인을 검증할 수 있습니다. Kubernetes 클러스터 준비 상태 기준 2 ~ 3시간 내 End-to-End 시연이 가능합니다.

EN: A single Helm install deploys the full pipeline. End-to-end demonstration is possible within 2–3 hours given a ready Kubernetes cluster.

helm install kube-rca ./charts/kube-rca -n kube-rca --create-namespace

검증 가능한 기능 (Demonstrable Features):

Alert Webhook 수신 → AI Root Cause Analysis 자동 실행 → Slack 스레드 결과 전송
웹 대시보드 (Incident/Alert 조회, RCA 결과, SSE 실시간 갱신)
AI Chat (Incident 컨텍스트 기반 질의응답)
유사 Incident 벡터 검색 (pgvector)
OIDC 인증 (Google SSO)
Chaos Engineering 시연 (OOMKilled, CrashLoopBackOff 등 장애 주입 → 자동 분석)

필요 환경 (Required Environment):

Kubernetes 1.28+ (최소 2 CPU / 2 GB RAM, 권장 4 CPU / 4 GB RAM)
PostgreSQL 14+ (pgvector 확장)
LLM API Key (Gemini 무료 티어 가능)
Slack Bot Token (chat:write, channels:manage)
선택: Chaos Mesh, Istio, Ingress + 외부 도메인

9. 참여 방법 (How to Contribute)

Issues: 버그나 기능 제안은 GitHub Issues를 활용하세요. (Please use GitHub Issues for bug reports or feature requests.)
PRs: 모든 Pull Request는 Tech Lead의 검토 후 병합됩니다. (All PRs will be merged after review by the Tech Lead.)
Guide: [CONTRIBUTING.md] 파일을 참고하세요. (Please refer to the [CONTRIBUTING.md] file.)
Discord (Official): [자 드가자 Invite Link]
- KR: 실시간 소통 및 기술 지원을 위한 채널입니다.
- EN: Official channel for real-time communication and technical support.

10. 리소스 및 링크 (Resources & Links)

GitHub Repository: [Link]
Docs: [Architecture / API Specs]

| This is a space where knowledge is not merely consumed, but respected, sovereign, and connected—shared together with cloud industry professionals (Bros).|
| 지식이 소비되지 않고 존중·주권보장·연결되는 공간으로 클라우드 현업 전문가(Bro)와 함께 공유하고 있습니다. |