๐Ÿš€ Autonomous CloudOps Agent ๊ตฌ์ถ• ์Šคํ† ๋ฆฌ ๊ณต์œ  - SRE & DevOps ์—”์ง€๋‹ˆ์–ด๋“ค์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์‹œ๋„!

SRE์™€ DevOps Engineer๋“ค์ด ์žฅ์•  ๋Œ€์‘์„ ์œ„ํ•ด์„œ, Datadog, CloudWatch, Dynatrace, Prometheus ๊ฐ™์€ Monitoring Tool์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ, Monitoring์— ์–ด๋–ค ๊ฒƒ์ด ์ง„์งœ ๋ฌธ์ œ์ธ์ง€, ์–ด๋–ค ๊ฒƒ์€ ๋‹จ์ˆœํ•œ ๋…ธ์ด์ฆˆ์ธ์ง€, ๊ทธ๋ฆฌ๊ณ  ๋ฌด์—‡์„ ๋จผ์ € ํ•ด์•ผ ํ• ์ง€์— ๋Œ€ํ•ด์„œ ๋‹ค์†Œ ํ˜ผ์„ ์„ ์–˜๊ธฐํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์ž๋™ํ™”ํ•˜๊ณ , ๋ฐœ๋น ๋ฅธ ์กฐ์น˜๋ฅผ ํ•˜๊ธฐ ์œ„ํ•ด, AI์™€ ๊ทœ์น™ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ์„ ํ™œ์šฉํ•ด CloudOps ๊ตฌํ˜„์„ ๋งŒ๋“  Story๋ฅผ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค.

[์ถœ์ฒ˜] https://medium.com/@dheerajganeshn/reimagining-sre-workflows-with-agentic-aiintroduction-ae0867ae8b30

CloudOps๋ฅผ ์œ„ํ•œ Agent๋Š” Node.js + React ๊ธฐ๋ฐ˜์œผ๋กœ, ์•Œ๋žŒ์„ ๋ฐ›์•„์„œ ์‚ฌ๊ฑด ๋‹จ์œ„๋กœ ๋ฌถ๊ณ , Pod ์žฌ์‹œ์ž‘์ด๋‚˜ ๋ฆด๋ฆฌ์Šค ๋กค๋ฐฑ ๊ฐ™์€ ๋Œ€์‘์ฑ…์„ ์ œ์•ˆํ•˜๋ฉฐ, ์šด์˜์ž์˜ ํ”ผ๋“œ๋ฐฑ(RLHF-lite)์„ ํ†ตํ•ด ์ ์  ๋” ๋‚˜์€ ์ถ”์ฒœ์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.

์™œ ์‹œ์ž‘ํ–ˆ์„๊นŒ?

  • ์•Œ๋žŒ ํ”ผ๋กœ๊ฐ: ๋™์ผํ•œ ๊ทผ๋ณธ ์›์ธ์ธ๋ฐ๋„ ์ˆ˜์‹ญ ๊ฐœ์˜ ๋ถˆํ•„์š”ํ•œ ์•Œ๋žŒ ๋ฐœ์ƒ
  • ์ˆ˜๋™ ๋ถ„๋ฅ˜: ์–ด๋–ค ์กฐ์น˜๋ฅผ ๋จผ์ € ์ทจํ• ์ง€ ํŒ๋‹จํ•˜๋А๋ผ ์‹œ๊ฐ„ ๋‚ญ๋น„
  • ๋А๋ฆฐ MTTR: ๋‹ค์šดํƒ€์ž„์ด ๊ธธ์–ด์งˆ์ˆ˜๋ก ๋น„์šฉ๊ณผ ์‹ ๋ขฐ๋„ ์†์‹ค ๋ฐœ์ƒ

์ œ์•ˆํ•˜๋Š” ํ•ด๊ฒฐ์ฑ… w/ Autonomous CloudOps Agent

  1. ์•Œ๋žŒ์„ ์‚ฌ๊ฑด ๋‹จ์œ„๋กœ ๋ฌถ์–ด โ†’ ๋…ธ์ด์ฆˆ๋ฅผ ์ค„์ด๊ณ , 20๊ฐœ์˜ ์•Œ๋žŒ์„ 1๊ฐœ์˜ ์นด๋“œ๋กœ ๋‹จ์ˆœํ™”
  2. ์ž๋™ํ™”๋œ ๋Œ€์‘์ฑ… ์ œ์‹œ โ†’ Pod ์žฌ์‹œ์ž‘, ๋ฆด๋ฆฌ์Šค ๋กค๋ฐฑ, ๋ ˆํ”Œ๋ฆฌ์นด ํ™•์žฅ
  3. ํ”ผ๋“œ๋ฐฑ์„ ๋ฐ˜์˜ํ•ด ์ถ”์ฒœ ๊ฐœ์„  โ†’ RLHF-lite Loop๋กœ Confidence ์ ์ˆ˜ ์žฌ์กฐ์ •

์•„ํ‚คํ…์ฒ˜ ๊ฐœ์š”

CloudAgent

์‚ฌ์šฉ๋œ ์ฃผ์š” ๊ตฌ์„ฑ ์š”์†Œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • Backend: Node.js + TypeScript + Fastify
  • Persistence: Prisma ORM + SQLite(๊ฐœ๋ฐœ์šฉ), ์ดํ›„ PostgreSQL
  • Frontend(์˜ˆ์ •): React + Vite ๊ธฐ๋ฐ˜ ๋Œ€์‹œ๋ณด๋“œ
  • ์•Œ๋ฆผ(๊ณ„ํš ์ค‘): Slack ์—ฐ๋™
  • ํ”ผ๋“œ๋ฐฑ ๋ฃจํ”„: :+1:/:-1: ๊ธฐ๋ฐ˜ RLHF-lite

ํ˜„์žฌ๊นŒ์ง€์˜ ์ง„ํ–‰ํ•œ ํ•ญ๋ชฉ

:white_check_mark: Backend ์ดˆ๊ธฐํ™” (Node.js + TypeScript + Prisma)

:white_check_mark: Database ์Šคํ‚ค๋งˆ ์ •์˜ ๋ฐ ์ฒซ Migration ์ ์šฉ

Database Schema with Prisma

model Incident {

id String @id@default(cuid())

createdAt DateTime @default(now())

status String @default(โ€œopenโ€)

severity String @default(โ€œmediumโ€)

title String

description String

alerts Alert[]

suggestions ActionSuggestion[]

}

model Alert {

id String @id@default(cuid())

source String

message String

severity String

ts DateTime @default(now())

incident Incident? @relation(fields: [incidentId], references: [id])

incidentId String?

}

model ActionSuggestion {

id String @id@default(cuid())

incidentId String

actionType String

description String

confidence Float @default(0.5)

feedback Feedback[]

}

model Feedback {

id String @id@default(cuid())

suggestionId String

value Int

comment String?

createdAt DateTime @default(now())

}

:white_check_mark: Fastify API ๋ผ์šฐํŠธ(incidents + feedback)

API Routes

  • GET /api/incidents โ†’ list incidents with alerts & suggestions
  • POST /api/incidents/ingest โ†’ create demo incident from mock alerts
  • POST /api/feedback โ†’ record :+1:/:-1: feedback

Analyzer Service (rules-based)

export function analyzeIncidents(alerts) {

const hasOOM = alerts.some(a => /OOMKilled/.test(a.message));

const has5xx = alerts.some(a => /5\d{2}/.test(a.message));

const suggestions = [];

if (hasOOM) suggestions.push({ actionType: โ€˜restart_podsโ€™, description: โ€˜Restart pods with memory limits.โ€™, confidence: 0.7 });

if (has5xx) suggestions.push({ actionType: โ€˜rollbackโ€™, description: โ€˜Rollback to previous stable release.โ€™, confidence: 0.65 });

return {

title: has5xx ? โ€˜Error-rate spikeโ€™ : hasOOM ? โ€˜Pods OOMKilledโ€™ : โ€˜General alertโ€™,

severity: has5xx || hasOOM ? โ€˜highโ€™ : โ€˜mediumโ€™,

suggestions

};

}

์ถ”๊ฐ€์ ์œผ๋กœ ์ง„ํ–‰ํ•ด์•ผํ•˜๋Š” ํ•ญ๋ชฉ

:white_large_square: Analyzer ์„œ๋น„์Šค(๊ธฐ๋ณธ ๊ทœ์น™)
:white_large_square: React ๋Œ€์‹œ๋ณด๋“œ
:white_large_square: Slack ์•Œ๋ฆผ + CI/CD ํŒŒ์ดํ”„๋ผ์ธ
:white_large_square: RLHF-lite ํ”ผ๋“œ๋ฐฑ ๋ฃจํ”„

์–ป์€ ์—ญ๋Ÿ‰

  • Backend ์—”์ง€๋‹ˆ์–ด๋ง: Fastify, TypeScript ESM, Prisma ORM
  • Database ๋ชจ๋ธ๋ง: ์‚ฌ๊ฑด, ์•Œ๋žŒ, ํ”ผ๋“œ๋ฐฑ ๋ฃจํ”„ ์„ค๊ณ„
  • API-first ๊ฐœ๋ฐœ: ํ…Œ์ŠคํŠธ ๊ฐ€๋Šฅํ•œ ์—”๋“œํฌ์ธํŠธ ์„ค๊ณ„
  • DevOps ์‹ค๋ฌด: Migration, .env ๊ด€๋ฆฌ, CI ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์ถ•
  • SRE ์ง€์‹: ์‚ฌ๊ฑด ๋ชจ๋ธ๋ง, ๋Œ€์‘ ํ”Œ๋ ˆ์ด๋ถ ์„ค๊ณ„

์•ž์œผ๋กœ์˜ ๋กœ๋“œ๋งต

  • React ๋Œ€์‹œ๋ณด๋“œ โ†’ ์‚ฌ๊ฑด ์‹œ๊ฐํ™” ๋ฐ ์›ํด๋ฆญ ํ”ผ๋“œ๋ฐฑ
  • Slack ์—ฐ๋™ โ†’ ์ƒˆ๋กœ์šด ์‚ฌ๊ฑด ์•Œ๋ฆผ ๋ฐ ๋Œ€์‘ ์‹คํ–‰ ์‹œ๋ฎฌ๋ ˆ์ด์…˜
  • Datadog + CloudWatch ์–ด๋Œ‘ํ„ฐ โ†’ Mock ์•Œ๋žŒ ๋Œ€์‹  ์‹ค์ œ ์‹œ๊ทธ๋„ ์—ฐ๊ณ„
  • ํ”ผ๋“œ๋ฐฑ ์žฌ๊ฐ€์ค‘์น˜ โ†’ Confidence ์ ์ˆ˜๋ฅผ ์šด์˜์ž ํ”ผ๋“œ๋ฐฑ ๊ธฐ๋ฐ˜์œผ๋กœ ์—…๋ฐ์ดํŠธ
  • ๋ฐฐํฌ โ†’ Docker + Kubernetes ๋งค๋‹ˆํŽ˜์ŠคํŠธ๋กœ ์‹ค์‚ฌ์šฉ ํ™˜๊ฒฝ ์ ์šฉ

:rocket: ์•ž์œผ๋กœ์˜ ๋น„์ „: LLM๊ณผ Agentic AI MVP ๋‹จ๊ณ„์—์„œ๋Š” ๊ทœ์น™ ๊ธฐ๋ฐ˜ ๋ถ„์„์œผ๋กœ ์‹œ์ž‘ํ•˜์ง€๋งŒ, ์žฅ๊ธฐ์ ์œผ๋กœ๋Š” LLM๊ณผ Agentic AI๋ฅผ ๊ฒฐํ•ฉํ•ด ๋” ๊ณ ๋„ํ™”๋œ ์žฅ์•  ๋ถ„์„์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

LLM ํ™œ์šฉ ๋ฐฉ์‹

  • ์ž์—ฐ์–ด ์š”์•ฝ: ์ˆ˜๋งŽ์€ ์•Œ๋žŒ์„ ์‚ฌ๋žŒ์ด ์ฝ์„ ์ˆ˜ ์žˆ๋Š” ๊ฐ„๊ฒฐํ•œ ์‚ฌ๊ฑด ์š”์•ฝ์œผ๋กœ ๋ณ€ํ™˜
  • Runbook ๊ฒ€์ƒ‰(RAG): ๋‚ด๋ถ€ Runbook/Confluence/Jira ๋ฌธ์„œ๋ฅผ ์งˆ์˜ํ•ด ๊ฒ€์ฆ๋œ ๋Œ€์‘์ฑ… ์ œ๊ณต
  • ๋งฅ๋ฝ์  ์ถ”๋ก : ๋กœ๊ทธ, ์—๋Ÿฌ ํŠธ๋ ˆ์ด์Šค, ๋ฉ”ํŠธ๋ฆญ์„ ์—ฐ๊ฒฐํ•ด โ€œ๊ฐ€๋Šฅํ•œ ์›์ธโ€์„ ์„ค๋ช…

Agentic AI ํ™œ์šฉ ๋ฐฉ์‹

  • ๋‹ค๋‹จ๊ณ„ ์˜์‚ฌ๊ฒฐ์ •: ๋‹จ์ˆœ ์ œ์•ˆ์ด ์•„๋‹ˆ๋ผ ์—ฐ์†์ ์ธ ๊ณ„ํš ์ˆ˜๋ฆฝ โ†’ โ€œ๋กค๋ฐฑ โ†’ Pod ํ™•์žฅ โ†’ ์—๋Ÿฌ์œจ ๊ฒ€์ฆ โ†’ Slack ์•Œ๋ฆผโ€
  • ์ž์œจ ์‹คํ–‰(์•ˆ์ „์žฅ์น˜ ํฌํ•จ): ์Šน์ธ ๊ธฐ๋ฐ˜์œผ๋กœ ์•ˆ์ „ํ•œ ๋Œ€์‘์ฑ… ์ž๋™ ์‹คํ–‰
  • ํ•™์Šต ๋ฃจํ”„: RLHF-lite๋กœ ๋งค๋ฒˆ ๋” ๋‚˜์€ triage ๊ฒฐ์ •

์ด ํ˜์‹ ์ด SRE & DevOps๋ฅผ ๋ฐ”๊พธ๋Š” ์ด์œ 

  • ๋ฐ˜์‘์  โ†’ ์„ ์ œ์ : ๋‹จ์ˆœ ์•Œ๋žŒ ๋Œ€์‘์„ ๋„˜์–ด, ์œ„ํ—˜ํ•œ ๋ฐฐํฌ๋ฅผ ๋ฏธ๋ฆฌ ๊ฐ์ง€
  • ์ธ๊ฐ„-์ค‘์‹ฌ AI: ์—”์ง€๋‹ˆ์–ด๋ฅผ ์••๋„ํ•˜๋Š” ๋ฐ์ดํ„ฐ ๋Œ€์‹ , ๋ช…ํ™•ํ•œ ๋‹ค์Œ ๋‹จ๊ณ„๋ฅผ ์ œ์‹œ
  • ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ์ง€์‹ ์ถ•์ : ๋ชจ๋“  ๊ฒฐ์ •๊ณผ ๊ฒฐ๊ณผ๊ฐ€ ์ง€์‹์œผ๋กœ ์ถ•์ ๋˜์–ด ์žŠํžˆ์ง€ ์•Š์Œ
  • ๋” ๋น ๋ฅธ ๋ณต๊ตฌ, ๋” ์ ์€ ํ”ผ๋กœ: MTTR ๋‹จ์ถ•, ๋ถˆํ•„์š”ํ•œ ์•Œ๋žŒ ๊ฐ์†Œ, ์˜จ์ฝœ ์ŠคํŠธ๋ ˆ์Šค ์™„ํ™”

์ฆ‰, ์˜ค๋Š˜์˜ ๊ทœ์น™ ๊ธฐ๋ฐ˜ triage๊ฐ€ ๋‚ด์ผ์€ ํ•™์Šต-driven ์—์ด์ „ํŠธ ์ฝ”ํŒŒ์ผ๋Ÿฟ์œผ๋กœ ๋ฐœ์ „ํ•ด SRE/DevOps ํŒ€์˜ ๋“ ๋“ ํ•œ ํŒŒํŠธ๋„ˆ๊ฐ€ ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ธฐ์—…์— ์ฃผ๋Š” ๊ฐ€์น˜

  • MTTR ๋‹จ์ถ• โ†’ ๋‹ค์šดํƒ€์ž„ ๋น„์šฉ ์ ˆ๊ฐ
  • ์•Œ๋žŒ ํ”ผ๋กœ๊ฐ ๊ฐ์†Œ โ†’ ์ค‘๋ณต/๋…ธ์ด์ฆˆ ์•Œ๋žŒ ์ตœ์†Œํ™”
  • ์˜์‚ฌ๊ฒฐ์ • ํ’ˆ์งˆ ํ–ฅ์ƒ โ†’ ๊ทœ์น™๊ณผ ๊ณผ๊ฑฐ ํ”ผ๋“œ๋ฐฑ ๊ธฐ๋ฐ˜์˜ ์ œ์•ˆ
  • ์ง€์‹ ๋ฃจํ”„ ๊ตฌ์ถ• โ†’ ๋Œ€์‘ ํ”Œ๋ ˆ์ด๋ถ์ด ์ž๋™์œผ๋กœ ์ง„ํ™”

์ด๋Š” ๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ ์‹œ์Šคํ…œ์„ ์šด์˜ํ•˜๋Š” ๊ธฐ์—…์—๊ฒŒ, ์—”์ง€๋‹ˆ์–ด์˜ ๋ฐค์ƒ˜์„ ์ค„์ด๊ณ  ๊ณ ๊ฐ์—๊ฒŒ๋Š” ๋” ์•ˆ์ •์ ์ธ ์„œ๋น„์Šค๋ฅผ ์ œ๊ณตํ•œ๋‹ค๋Š” ์˜๋ฏธ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.