Longhorn 볼륨 메트릭스가 expose 장애 문의드립니다

Longhorn Manager의 /metrics 엔드포인트에서 Volume metrics가 노출되지 않아요.

환경 정보

  • Longhorn 버전: 1.1.2 또는 1.1.1
  • Kubernetes 버전: 1.19.9-gke.1900
  • 노드 설정:
    • OS 타입 및 버전: Ubuntu with Docker
    • 디스크 타입: Standard persistent disk 100GB
    • 인프라: GKE (Google Kubernetes Engine)

재현 방법

  1. GKE 클러스터에 Longhorn 설치:

bash

CopyEdit

kubectl apply -f https://raw.githubusercontent.com/longhorn/longhorn/v1.1.1/deploy/longhorn.yaml
  • v1.1.2 버전으로도 시도했지만 같은 문제가 발생함
  1. longhorn-manager Pod에 접속 후 /metrics 엔드포인트 확인:

bash

CopyEdit

kubectl -n longhorn-system exec -it longhorn-manager-9d797 -- curl longhorn-manager-9d797:9500/metrics
  • 이 명령은 Prometheus 형식의 출력은 반환되지만, Volume 관련 metric 들이 없음
# HELP longhorn_disk_capacity_bytes The storage capacity of this disk
# TYPE longhorn_disk_capacity_bytes gauge
longhorn_disk_capacity_bytes{disk="default-disk-4cd3831f07717134",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 1.0388023296e+11
# HELP longhorn_disk_reservation_bytes The reserved storage for other applications and system on this disk
# TYPE longhorn_disk_reservation_bytes gauge
longhorn_disk_reservation_bytes{disk="default-disk-4cd3831f07717134",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 3.1164069888e+10
# HELP longhorn_disk_usage_bytes The used storage of this disk
# TYPE longhorn_disk_usage_bytes gauge
longhorn_disk_usage_bytes{disk="default-disk-4cd3831f07717134",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 5.855387648e+09
# HELP longhorn_instance_manager_cpu_requests_millicpu Requested CPU resources in kubernetes of this Longhorn instance manager
# TYPE longhorn_instance_manager_cpu_requests_millicpu gauge
longhorn_instance_manager_cpu_requests_millicpu{instance_manager="instance-manager-e-523d6b01",instance_manager_type="engine",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 113
longhorn_instance_manager_cpu_requests_millicpu{instance_manager="instance-manager-r-9d8f7ae9",instance_manager_type="replica",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 113
# HELP longhorn_instance_manager_cpu_usage_millicpu The cpu usage of this longhorn instance manager
# TYPE longhorn_instance_manager_cpu_usage_millicpu gauge
longhorn_instance_manager_cpu_usage_millicpu{instance_manager="instance-manager-e-523d6b01",instance_manager_type="engine",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 4
longhorn_instance_manager_cpu_usage_millicpu{instance_manager="instance-manager-r-9d8f7ae9",instance_manager_type="replica",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 4
# HELP longhorn_instance_manager_memory_requests_bytes Requested memory in Kubernetes of this longhorn instance manager
# TYPE longhorn_instance_manager_memory_requests_bytes gauge
longhorn_instance_manager_memory_requests_bytes{instance_manager="instance-manager-e-523d6b01",instance_manager_type="engine",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 0
longhorn_instance_manager_memory_requests_bytes{instance_manager="instance-manager-r-9d8f7ae9",instance_manager_type="replica",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 0
# HELP longhorn_instance_manager_memory_usage_bytes The memory usage of this longhorn instance manager
# TYPE longhorn_instance_manager_memory_usage_bytes gauge
longhorn_instance_manager_memory_usage_bytes{instance_manager="instance-manager-e-523d6b01",instance_manager_type="engine",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 7.29088e+06
longhorn_instance_manager_memory_usage_bytes{instance_manager="instance-manager-r-9d8f7ae9",instance_manager_type="replica",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 1.480704e+07
# HELP longhorn_manager_cpu_usage_millicpu The cpu usage of this longhorn manager
# TYPE longhorn_manager_cpu_usage_millicpu gauge
longhorn_manager_cpu_usage_millicpu{manager="longhorn-manager-9d797",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 13
# HELP longhorn_manager_memory_usage_bytes The memory usage of this longhorn manager
# TYPE longhorn_manager_memory_usage_bytes gauge
longhorn_manager_memory_usage_bytes{manager="longhorn-manager-9d797",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 2.9876224e+07
# HELP longhorn_node_count_total Total number of nodes
# TYPE longhorn_node_count_total gauge
longhorn_node_count_total 3
# HELP longhorn_node_cpu_capacity_millicpu The maximum allocatable cpu on this node
# TYPE longhorn_node_cpu_capacity_millicpu gauge
longhorn_node_cpu_capacity_millicpu{node="gke-longhorn-2-default-pool-277a6687-tjgl"} 940
# HELP longhorn_node_cpu_usage_millicpu The cpu usage on this node
# TYPE longhorn_node_cpu_usage_millicpu gauge
longhorn_node_cpu_usage_millicpu{node="gke-longhorn-2-default-pool-277a6687-tjgl"} 256
# HELP longhorn_node_memory_capacity_bytes The maximum allocatable memory on this node
# TYPE longhorn_node_memory_capacity_bytes gauge
longhorn_node_memory_capacity_bytes{node="gke-longhorn-2-default-pool-277a6687-tjgl"} 2.950684672e+09
# HELP longhorn_node_memory_usage_bytes The memory usage on this node
# TYPE longhorn_node_memory_usage_bytes gauge
longhorn_node_memory_usage_bytes{node="gke-longhorn-2-default-pool-277a6687-tjgl"} 1.22036224e+09
# HELP longhorn_node_status Status of this node
# TYPE longhorn_node_status gauge
longhorn_node_status{condition="allowScheduling",condition_reason="",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 1
longhorn_node_status{condition="mountpropagation",condition_reason="",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 1
longhorn_node_status{condition="ready",condition_reason="",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 1
longhorn_node_status{condition="schedulable",condition_reason="",node="gke-longhorn-2-default-pool-277a6687-tjgl"} 1
# HELP longhorn_node_storage_capacity_bytes The storage capacity of this node
# TYPE longhorn_node_storage_capacity_bytes gauge
longhorn_node_storage_capacity_bytes{node="gke-longhorn-2-default-pool-277a6687-tjgl"} 1.0388023296e+11
# HELP longhorn_node_storage_reservation_bytes The reserved storage for other applications and system on this node
# TYPE longhorn_node_storage_reservation_bytes gauge
longhorn_node_storage_reservation_bytes{node="gke-longhorn-2-default-pool-277a6687-tjgl"} 3.1164069888e+10
# HELP longhorn_node_storage_usage_bytes The used storage of this node
# TYPE longhorn_node_storage_usage_bytes gauge
longhorn_node_storage_usage_bytes{node="gke-longhorn-2-default-pool-277a6687-tjgl"} 5.855387648e+09

샘플 MySQL Pod을 만들고, PersistentVolume(PV)을 연결하여 Longhorn이 이를 프로비저닝 및 관리하고 있는 것을 확인했습니다. 볼륨은 클러스터의 3개 노드에 모두 복제(replica) 되어 있습니다. 하지만 아래 공식 문서에 나와 있는 메트릭들은 /metrics 엔드포인트에서 보이지 않습니다:
:link: Longhorn | Documentation

제가 무엇을 놓치고 있는 걸까요?
도움 주시면 감사하겠습니다 :folded_hands:

다음 사항들을 함께 확인해보는 것이 좋습니다:

  • longhorn-manager 로그에 오류가 있는지
  • 볼륨이 실제로 생성되어 있고 사용 중인지
  • Prometheus 설정 시 longhorn-manager의 메트릭이 스크랩되고 있는지
  • 혹시 오래된 버전 (1.1.x) 에서 volume metrics 노출이 제한된 이슈가 있었는지 (이 경우 업그레이드를 고려)

추가적으로, curl 결과나 로그 내용이 있다면 공유해주실 수 있나요?
그럼 좀 더 구체적인 진단이 가능할 것 같아요.

longhorn-manager Pod해당 노드에서 실행 중인 볼륨에 대한 메트릭만 노출합니다.
즉, 모든 longhorn-manager Pod에 대해 Prometheus가 scrape 하도록 구성해야 전체 볼륨 메트릭을 수집할 수 있습니다.

prometheus-operator를 사용하는 경우에는 이러한 작업이 자동으로 처리되지만,
직접 수동으로 Prometheus를 구성하는 경우, 다음과 같이 scrape_configs 항목을 설정할 수 있습니다:

yaml

CopyEdit

- job_name: 'longhorn'
  kubernetes_sd_configs:
  - role: pod
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_container_name, __meta_kubernetes_pod_container_port_number]
    action: keep
    regex: 'longhorn-manager;9500'