๐Ÿ“ [Wiki] KubeRCAโ†’ AI-Driven Kubernetes RCA and Automated Remediation

:bookmark_tabs: [KubeRCA] Official Wiki

KR: ์ด ํŽ˜์ด์ง€๋Š” KubeRCA์˜ ๋น„์ „, ๊ธฐ์ˆ ์  ๋ฐฉํ–ฅ์„ฑ, ๊ทธ๋ฆฌ๊ณ  ํ˜‘์—… ๋ฐฉ์‹์„ ์ •์˜ํ•˜๋Š” ํ†ตํ•ฉ ๋ฌธ์„œ์ž…๋‹ˆ๋‹ค. ํŒ€์›๊ณผ ์™ธ๋ถ€ ๊ธฐ์—ฌ์ž๋“ค์ด ์กฐํ™”๋กญ๊ฒŒ ํ˜‘์—…ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•๋Š” ๊ณต์‹ ๊ฐ€์ด๋“œ๋ผ์ธ์ž…๋‹ˆ๋‹ค.

EN: This page serves as the comprehensive documentation defining the vision, technical direction, and collaboration methods for ์ž ๋“œ๊ฐ€์ž. It is an official guideline to ensure seamless collaboration between the team and external contributors.

1. ํ”„๋กœ์ ํŠธ ๊ฐœ์š” (Project Overview)

  • Purpose: AI Agent๋ฅผ ํ™œ์šฉํ•˜์—ฌ Kubernetes ํ™˜๊ฒฝ์—์„œ ๋ฐœ์ƒํ•˜๋Š” Incident Alarm์„ ๋ถ„์„ํ•˜๊ณ , ์‹ค์ œ ํด๋Ÿฌ์Šคํ„ฐ ๋‚ด Application๊ณผ Node์˜ ์ƒํƒœ๋ฅผ ์ข…ํ•ฉ์ ์œผ๋กœ ํ•ด์„ํ•˜์—ฌ ํ‘œ์ค€ํ™”๋œ RCA ํ…œํ”Œ๋ฆฟ ๊ธฐ๋ฐ˜์˜ ๋ถ„์„ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

  • Background / Introduction (KR):

    • ์™„์ „ ์ž๋™ํ™”๋œ ์žฅ์•  ๋Œ€์‘๋ณด๋‹ค๋Š” ์žฅ์•  ๋Œ€์‘ ๊ณผ์ • ์ค‘ ๋ฐœ์ƒํ•˜๋Š” ๋‹ค์–‘ํ•œ Alert๋ฅผ AI Agent์„ ํ†ตํ•ด ๋ถ„์„ํ•˜์—ฌ ์—”์ง€๋‹ˆ์–ด๊ฐ€ ์žฅ์• ์˜ ๊ทผ๋ณธ ์›์ธ์„ ๋” ๋น ๋ฅด๊ณ  ๋ช…ํ™•ํ•˜๊ฒŒ ํŒŒ์•…ํ•˜๊ณ  ์‚ฌํ›„ ์žฌ๋ฐœ ๋ฐฉ์ง€ ๋Œ€์ฑ…์„ ์ฒด๊ณ„์ ์œผ๋กœ ์ˆ˜๋ฆฝํ•  ์ˆ˜ ์žˆ๋„๋ก ๋•๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
    • ์žฅ์•  ๋ฐœ์ƒ ์‹œ Prometheus Alert, ๋กœ๊ทธ, ๋ฉ”ํŠธ๋ฆญ ๋“ฑ ๊ด€์ธก ๋ฐ์ดํ„ฐ๋ฅผ ์ž๋™์œผ๋กœ ์ˆ˜์ง‘ํ•˜๊ณ , LLM๊ณผ Vector Database๋ฅผ ํ™œ์šฉํ•ด ๊ณผ๊ฑฐ ์œ ์‚ฌ ์žฅ์•  ์‚ฌ๋ก€ Top 3๋ฅผ ๋น„๊ต ๋ถ„์„ํ•จ์œผ๋กœ์จ ํ˜„์žฌ ์ƒํ™ฉ์— ๊ฐ€์žฅ ์ ํ•ฉํ•œ ์‹ ์†ํ•˜๊ณ  ์ผ๊ด€๋œ ๋Œ€์‘ ๊ฐ€์ด๋“œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
    • ์ด๋ฅผ ํ†ตํ•ด ๊ฐœ์ธ์˜ ๊ฒฝํ—˜์— ์˜์กดํ•˜๋˜ ์žฅ์•  ๋Œ€์‘ ๋ฐฉ์‹์„ ์ค„์ด๊ณ , ํŠนํžˆ ์ฃผ๋‹ˆ์–ด ์—”์ง€๋‹ˆ์–ด๋„ ์•ˆ์ •์ ์ธ ํŒ๋‹จ์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ๋Š” ์šด์˜ ํ™˜๊ฒฝ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์„ ์ง€ํ–ฅํ•ฉ๋‹ˆ๋‹ค.
  • Background / Introduction (EN):

    • Rather than pursuing fully automated incident remediation, this project focuses on analyzing the diverse alerts generated during incident response through AI agent-based analysis, enabling engineers to identify root causes more quickly and clearly and to systematically establish post-incident prevention strategies.

    • When an incident occurs, the system automatically aggregates observability data, including Prometheus alerts, logs, and metrics. By leveraging LLMs and Vector Databases, it compares the Top 3 most similar historical incidents to provide a rapid, consistent, and context-aware response guide tailored to the current situation.

    • Through this approach, the project aims to reduce reliance on individual experience and to create an operational environment in which junior engineers can make stable and well-informed decisions.

  • Core Values:

    • Open Source, Human-in-the-Loop by Design, AI-Forward Architecture

2. ํŒ€ ๊ตฌ์„ฑ (The Team)

Roles and responsibilities for the member team.

์ด๋ฆ„ (Name) ID ์—ญํ•  (Role) SNS ์ฃผ์š” ์ฑ…์ž„ (Responsibilities - KR/EN)
๊น€ํƒœ์ง€ @Taeji_Kim Team Leader Link ๋กœ๋“œ๋งต ๋ฐ ์ตœ์ข… ์˜์‚ฌ๊ฒฐ์ • / Roadmap & Final decision-making
๊น€ํšŒ์ • @user116 DevOps Link ์ธํ”„๋ผ ๋ฐ CI/CD ๊ด€๋ฆฌ / Infrastructure & CI/CD management
ํ™ฉ์šฐ๋นˆ @Binoo BE/FE Link ํ•ต์‹ฌ ๊ธฐ๋Šฅ ๊ตฌํ˜„ / Core logic & API implementation
์ตœ๋ณดํ˜„ @brilly BE/FE Link ํ•ต์‹ฌ ๊ธฐ๋Šฅ ๊ตฌํ˜„ / Core logic & API implementation

3. ๊ธฐ์ˆ  ์Šคํƒ (Tech Stack)

  • Infra: Kubernetes, Terraform, AWS
  • Language: Go, Python, React
  • Database: PostgreSQL
  • Chaos Engineering: K6, Istio, Chaos Mesh
  • Observability: Loki, Grafana, Tempo, Mimir, Kiali
  • Communication: Discord, Slack

4. ๋กœ๋“œ๋งต (Roadmap)

  • Phase 1: MVP ์š”๊ตฌ์‚ฌํ•ญ ์ •์˜ (MVP Requirement Definition)
  • Phase 2: ํ•ต์‹ฌ ๋ชจ๋“ˆ ๊ฐœ๋ฐœ ๋ฐ ์•ŒํŒŒ ํ…Œ์ŠคํŠธ (Core Module Dev & Alpha Test)
  • Phase 3: ๊ธ€๋กœ๋ฒŒ ์ปค๋ฎค๋‹ˆํ‹ฐ ๊ณต๊ฐœ (Global Community Launch)

5. ์ฐธ์—ฌ ๋ฐฉ๋ฒ• (How to Contribute)

  • Issues: ๋ฒ„๊ทธ๋‚˜ ๊ธฐ๋Šฅ ์ œ์•ˆ์€ GitHub Issues๋ฅผ ํ™œ์šฉํ•˜์„ธ์š”. (Please use GitHub Issues for bug reports or feature requests.)
  • PRs: ๋ชจ๋“  Pull Request๋Š” Tech Lead์˜ ๊ฒ€ํ†  ํ›„ ๋ณ‘ํ•ฉ๋ฉ๋‹ˆ๋‹ค. (All PRs will be merged after review by the Tech Lead.)
  • Guide: [CONTRIBUTING.md] ํŒŒ์ผ์„ ์ฐธ๊ณ ํ•˜์„ธ์š”. (Please refer to the [CONTRIBUTING.md] file.)
  • Discord (Official): [์ž ๋“œ๊ฐ€์ž Invite Link]
    • KR: ์‹ค์‹œ๊ฐ„ ์†Œํ†ต ๋ฐ ๊ธฐ์ˆ  ์ง€์›์„ ์œ„ํ•œ ์ฑ„๋„์ž…๋‹ˆ๋‹ค.
    • EN: Official channel for real-time communication and technical support.

6. ๋ฆฌ์†Œ์Šค ๋ฐ ๋งํฌ (Resources & Links)

| This is a space where knowledge is not merely consumed, but respected, sovereign, and connectedโ€”shared together with cloud industry professionals (Bros).|
| ์ง€์‹์ด ์†Œ๋น„๋˜์ง€ ์•Š๊ณ  ์กด์ค‘ยท์ฃผ๊ถŒ๋ณด์žฅยท์—ฐ๊ฒฐ๋˜๋Š” ๊ณต๊ฐ„์œผ๋กœ ํด๋ผ์šฐ๋“œ ํ˜„์—… ์ „๋ฌธ๊ฐ€(Bro)์™€ ํ•จ๊ป˜ ๊ณต์œ ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. |

3 Likes