Most recent
navigate open esc close Corpus index built 2026-06-07 23:58 UTC

§ THE STACK / DATA LAYER

GPU Compute & Telemetry

Run:AI, DCGM-exporter, NVIDIA Fleet Command. GPU fleet metrics + scheduling

Vector stores, registries, memory, datasets: what the model knows and remembers.

What it is

The GPU-compute tier is the metrics and scheduling plane beneath every LLM training and inference deployment. NVIDIA’s DCGM-exporter publishes Prometheus metrics from each GPU (utilization, memory, temperature, power), with a Hostname tag the operator sets to identify the box. Run:AI (now NVIDIA Run:AI) and NVIDIA Bright Cluster Manager orchestrate fleets of GPUs across clusters. Slurm REST is the HPC-tier scheduler.

What goes wrong

DCGM-exporter is a Prometheus exporter. The framework assumes the metrics endpoint sits inside a private network. There is no application-level authentication; auth is meant to come from the operator’s network ACL. Operators who expose :9400 to the public internet inherit “no auth” by deployment-config mistake, not framework-default mistake. The leak is rich: GPU model, operator-set hostname, utilization timeline. The combination fingerprints what’s being trained (LLM training has a different utilization signature than CV training has a different signature than inference). Operators running H100, H200, A100, RTX PRO 6000 Blackwell-class hardware are exposing six-figure compute fleets at the metrics layer.

How we test

We probe :9400/metrics and parse the Prometheus text for DCGM_FI_DEV_GPU_UTIL, modelName="...", and Hostname="..." labels. Operator hostnames are operator-attribution-rich (video-gpu007-mojo-mia.vs3.com discloses a video-AI rental operator with a Miami location). We do not scrape the time-series; instantaneous metrics suffice for severity. Run:AI dashboards, Bright Cluster Manager, and Slurm REST get their own fingerprint pathways; for each, we read identity-only and never invoke a job-submission endpoint.