Argo Workflows에 OpenTelemetry를 통합해서 분산 추적을 수집하려면 어떤 방식이 가장 적절할까요?

chris · May 5, 2025, 1:00am

현재 데이터 파이프라인과 머신러닝 워크플로우를 Argo Workflows 기반으로 운영하고 있습니다. 워크플로우에서 발생하는 작업들의 실행 흐름과 지연 시간을 전반적으로 추적하고 싶어서, OpenTelemetry 기반의 분산 추적 시스템을 도입하려고 합니다.

이미 Prometheus와 Grafana를 통해 메트릭 수집은 하고 있지만, 단일 워크플로우 실행에서 어떤 단계가 병목인지 파악하려면 로그 이상의 추적 정보가 필요합니다.

궁금한 건 다음과 같습니다:

Argo Workflows와 OpenTelemetry를 통합하려면 어떤 구성 요소를 붙여야 하나요?
각 워크플로우 스텝에서 Span 정보를 수집하려면 어떤 코드 변경이나 어노테이션이 필요한가요?
추적 데이터를 수집하고 시각화할 수 있는 가장 적절한 백엔드(예: Jaeger, Tempo 등)는 어떤 것이며, 실무에서 운영해본 경험이 있다면 공유 부탁드립니다.

cloudbro_ai · May 5, 2025, 3:07am

[CloudBro AI Bot Response]
Great questions! Integrating OpenTelemetry-based distributed tracing into Argo Workflows is a solid approach to gain deep visibility into workflow execution and latency bottlenecks beyond metrics and logs.

Here’s a detailed breakdown addressing your points:

Components to integrate Argo Workflows with OpenTelemetry:

Instrumentation libraries: Your workflow steps (containers or scripts) need to emit OpenTelemetry spans. This requires adding OpenTelemetry SDKs to your code or wrapping existing commands.
OpenTelemetry Collector: Deploy an OpenTelemetry Collector in your cluster. It receives trace data from instrumented steps and exports it to a backend like Jaeger or Tempo.
Tracing backend: A system like Jaeger, Tempo, or Zipkin to store and visualize traces.
Argo Workflow controller config: Optionally, configure the Argo Workflow controller to propagate trace context across steps. This can be done by injecting trace headers as environment variables or annotations.

Collecting Span information per workflow step:

Code changes: If your steps run custom code (Python, Go, Java, etc.), add OpenTelemetry SDK instrumentation to create spans around logical operations.
Sidecar or init containers: If you cannot modify the code, consider running an OpenTelemetry sidecar or using automatic instrumentation agents.
Annotations or environment variables: Pass trace context to each step via annotations or environment variables to maintain trace continuity. Argo supports workflow-level metadata injection; you can extend this for tracing headers.
Custom templates: Modify your workflow templates to include OpenTelemetry environment variables or commands that start tracing.

Tracing backend recommendations:

Jaeger: Widely used, mature, and integrates well with Kubernetes and OpenTelemetry Collector. Good UI for trace visualization and analysis.
Tempo (Grafana Tempo): Highly scalable, cost-effective, and integrates seamlessly with Grafana dashboards. It’s optimized for storing large volumes of trace data with minimal operational overhead.
Zipkin: Simpler but less feature-rich than Jaeger or Tempo.
Practical advice from experience:
Start with Jaeger for ease of setup and community support.
Use OpenTelemetry Collector with the Jaeger exporter to centralize trace ingestion.
Ensure trace context propagation across workflow steps by injecting traceparent headers as environment variables.
Combine tracing data with your existing Prometheus/Grafana metrics for full observability.
Monitor the overhead of instrumentation and sampling rates to balance data granularity and performance.

If you want, I can help you with example Argo Workflow templates or OpenTelemetry Collector configurations tailored to your environment. Just let me know!

james · May 5, 2025, 6:53am

Argo Workflows에 OpenTelemetry 기반 분산 추적을 붙이기 위해서는 워크플로우 내 컨테이너가 추적 정보를 수집하고 이를 export할 수 있도록 구성해야 합니다.

OpenTelemetry SDK 또는 Collector 적용

워크플로우 내에서 실행되는 각 스텝은 어플리케이션 레벨에서 OpenTelemetry SDK를 통해 Span을 수집할 수 있습니다.
만약 직접 SDK를 삽입하기 어렵다면, 사이드카(Sidecar)로 OpenTelemetry Collector를 붙이는 방식도 사용됩니다.

자동화된 컨텍스트 전달 구현

각 워크플로우 스텝이 SpanContext를 다음 스텝으로 전달할 수 있도록 환경 변수를 활용하거나, gRPC/HTTP 헤더를 통해 컨텍스트를 이어줘야 합니다.
Kubernetes 환경에서는 otel-instrumentation을 사용하는 initContainer 기반 자동 주입 방식도 고려할 수 있습니다.

추적 백엔드 구성

가장 일반적으로 사용하는 백엔드는 Jaeger이며, 오픈소스이고 CNCF 프로젝트로 안정적입니다.
Grafana Tempo는 Loki와 Prometheus와 통합이 자연스러워 Grafana 기반 모니터링 환경에선 매우 유용합니다.
수집된 데이터는 Grafana에서 tempoQuery를 통해 검색하거나, Trace ID 기반으로 연결된 로그를 추적할 수 있습니다.

운영 팁

전체 trace가 잘 연결되지 않는다면, TraceID 또는 ParentSpanID가 누락된 경우가 많으니 주의하시면 되구요.
워크플로우별로 자동 태깅(workflow.name, step.name, namespace 등)을 설정하면 쿼리 및 시각화가 쉬워집니다.

Topic		Replies	Views
Argo CD와 GitHub Actions를 활용해 Kubernetes에 배포하는 CI/CD 파이프라인을 보안적으로 설계하려면? Ask a Question kubernetes , github , cicd , argo-cd	2	27	May 14, 2025
Kubernetes 환경에서 OpenTelemetry와 Grafana를 활용한 사례에 대한 상세한 정보를 얻을 수 있을까요? Ask a Question kubernetes , grafana , opentelemetry , jaeger	1	19	May 13, 2025
멀티 테넌트 Kubernetes 클러스터에서 안정적인 Argo CD 접근 문의 Ask a Question kubernetes , multi-tenant , argocd	1	10	April 29, 2025
AWS EKS에서 FluxCD를 사용하여 GitOps 배포를 할 때, AWS IAM과 Kubernetes RBAC을 적절히 결합해 보안적으로 설계하려면 어떤 방식이 이상적일까요? Ask a Question kubernetes , aws , eks , rbac , gitops , fluxcd	1	13	May 14, 2025
Terraform을 사용해서 EKS에 Prometheus와 Grafana 모니터링을 설정하려면 어떻게 해야 하나요? Ask a Question grafana , terraform , promethe	3	29	May 7, 2025

Argo Workflows에 OpenTelemetry를 통합해서 분산 추적을 수집하려면 어떤 방식이 가장 적절할까요?

Related topics