GPU Compute & Telemetry, Data Layer, NuClide Stack

What it is

The GPU-compute tier is the metrics and scheduling plane beneath every LLM training and inference deployment. NVIDIA’s DCGM-exporter publishes Prometheus metrics from each GPU (utilization, memory, temperature, power), with a Hostname tag the operator sets to identify the box. Run:AI (now NVIDIA Run:AI) and NVIDIA Bright Cluster Manager orchestrate fleets of GPUs across clusters. Slurm REST is the HPC-tier scheduler.

What goes wrong

DCGM-exporter is a Prometheus exporter. The framework assumes the metrics endpoint sits inside a private network. There is no application-level authentication; auth is meant to come from the operator’s network ACL. Operators who expose :9400 to the public internet inherit “no auth” by deployment-config mistake, not framework-default mistake. The leak is rich: GPU model, operator-set hostname, utilization timeline. The combination fingerprints what’s being trained (LLM training has a different utilization signature than CV training has a different signature than inference). Operators running H100, H200, A100, RTX PRO 6000 Blackwell-class hardware are exposing six-figure compute fleets at the metrics layer.

How we test

We probe :9400/metrics and parse the Prometheus text for DCGM_FI_DEV_GPU_UTIL, modelName="...", and Hostname="..." labels. Operator hostnames are operator-attribution-rich (video-gpu007-mojo-mia.vs3.com discloses a video-AI rental operator with a Miami location). We do not scrape the time-series; instantaneous metrics suffice for severity. Run:AI dashboards, Bright Cluster Manager, and Slurm REST get their own fingerprint pathways; for each, we read identity-only and never invoke a job-submission endpoint.

Receipts

Research

Every survey, case study, and disclosure we've published that touches this layer of the stack. Counts on the cells above tally these directly.

Cross-cloud surveys

1

Survey May 16, 2026

GPU-Compute Population Survey (2026-05-16)

Survey of the GPU-compute orchestration tier: Run:ai (Nvidia's enterprise GPU scheduler), DCGM-exporter (Prometheus exporter for NVIDIA GPU metrics), NVIDIA Bright Cluster Manager, Slurm REST API. Sma…

Read →

Data Layer

GPU Compute & Telemetry

What it is

What goes wrong

How we test

Research

Cross-cloud surveys

GPU-Compute Population Survey (2026-05-16)

Other categories in this layer

Vector Databases

Search Engines

OLAP / Analytics Backends

MLOps Tracking

Agent Memory

Data Labeling

Object Storage

Compute Orchestration

Container Orchestration

Medical / Edge AI

Backup & Snapshots

Fine-tuning Runtimes

Document Parsers

Model Hubs & Registries