GPU-Compute Population Survey (2026-05-16), NuClide Research

NuClide Research · 2026-05-16 (Survey 6 of the day’s 10-category batch) Closes: category 14 (gpu-compute). Run:ai / DCGM-exporter / NVIDIA Fleet / cluster managers

Summary

Survey of the GPU-compute orchestration tier: Run:ai (Nvidia’s enterprise GPU scheduler), DCGM-exporter (Prometheus exporter for NVIDIA GPU metrics), NVIDIA Bright Cluster Manager, Slurm REST API. Smaller surface than image-gen / vector-DB but operator-rich. These are dashboards and exporters that disclose the operator’s full GPU topology.

DCWF KSAT coverage

Auto-derived from DCWF AI work-role rule files (ksat-tag).

672 (AI Test & Evaluation Specialist): K7003, K7004, K7044, S7068, S7070, S7075, T5858, T5904
733 (AI Risk & Ethics Specialist): S7067, T5868, T5882, T5893, T5904
overlap (Common AI KSATs (all 5 roles)): K108, K1158, K22, K6311, K6935, K7003, K942

439 candidates harvested across Run:ai / NVIDIA-SMI / cluster-manager / DCGM dorks
Probed via fast_enum_gpu_compute.py (threads=80, ~3 min)
9 DCGM-exporter unauth: each exposes GPU model + operator hostname via Prometheus /metrics
2 Run:ai consoles confirmed by HTML shell (auth state on data endpoints partial-open)
0 Bright Cluster Manager unauth, 0 Slurmrestd
428 other (dead, unrelated, generic NVIDIA marketing/documentation pages)

Headline finding: DCGM-exporter is Tier-A (by-design auth-free, framework expects operator-set network ACL)* and the 9 confirmed hosts disclose high-value GPU fleet inventory.

What DCGM-exporter leaks unauth

Every confirmed DCGM-exporter host returns Prometheus text on GET /metrics with:

GPU model (modelName="NVIDIA L40S" etc.)
Operator-set hostname (Hostname="video-gpu007-mojo-mia.vs3.com")
GPU utilization, memory utilization, temperature, power draw (per-GPU, time-series)
DCGM_FI_DEV_COUNT, DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_MEM_COPY_UTIL gauges

Per-host findings:

Host	GPU model	Operator hostname	Inferred operator
`122.11.227.212:9400`	NVIDIA H200	`f8bcefe6ae83`	High-end AI training rig — H200 is current data-center flagship (~$40K MSRP); operator hostname is a container ID, suggesting Kubernetes deployment
`195.242.10.201:9400`	NVIDIA H100 80GB HBM3	`computeinstance-e00kd0mz50zwgst389`	$30K+ H100 — operator using a managed compute service (the long uuid-suffixed hostname is template-generated)
`171.226.10.154:9400`	NVIDIA L40S	`8a7881b0fef9`	Same L40S class as the ComfyUI cluster from earlier today
`199.59.95.140:9400`	NVIDIA A16	`video-gpu007-mojo-mia.vs3.com`	vs3.com video-GPU rental cluster, Miami location
`89.187.191.229:9400`	NVIDIA A16	`video-gpu001-dp-prg.vs3.com`	Same vs3.com operator, Prague location — multi-continent GPU fleet
`88.198.23.47:9400`	NVIDIA RTX 4000 SFF Ada Generation	`gpu01.xaas.int`	”GPU-as-a-Service” internal hostname; commercial rental ops
`107.150.106.98:9400`	Tesla T4	`10-11-86-59`	Inference-tier card on a 10.11.86.59 internal IP — operator’s internal range disclosed via hostname
`93.182.29.250:9400`	Tesla T4	`cd-server-gpu-1`	Generic GPU server
`59.47.238.37:9400`	Tesla P4	`123pan-tc11`	123 Pan (Chinese cloud storage) — `tc11` suggests Tencent Cloud Tianjin-11 region

The vs3.com pattern is the standout operator-attribution find. Two GPU hosts on widely separated IPs (199.59.95.140 in Miami, 89.187.191.229 in Prague), both running NVIDIA A16, both with the operator’s hostname template video-gpu<NNN>-<role>-<city>.vs3.com. This is a video-AI / streaming-AI operator with multi-continent infrastructure, fully exposed at the metrics layer.

Methodology placement

DCGM-exporter is a Prometheus exporter. By-design auth-free. Prometheus exporters are designed for scrape-only metrics endpoints assumed to be inside a private network. The framework expects operator-configured network ACLs (firewall rules, kubernetes NetworkPolicy, Tailscale ACL) to gate access. Operators exposing :9400 to the public internet have made a deployment-config mistake, not a framework-default mistake.

Adds DCGM-exporter to the Tier-A* family (auth optional in framework, off-by-design when reachable):

Tier	Definition	Member platforms
A*	Auth-by-network-not-app, off when reachable	LiveKit `/api/connection-details`, Airflow `AUTH_ROLE_PUBLIC=Admin`, Elasticsearch (Docker default), DCGM-exporter

The risk class is intel-disclosure (operator GPU topology) rather than compute-theft (DCGM-exporter doesn’t accept job submissions). For chained risk, an attacker can:

Fingerprint operator’s GPU fleet via DCGM-exporter metrics
Cross-reference with the operator’s likely orchestration layer (Run:ai / SLURM / native K8s). If also exposed, lateral movement
Time-series mining of utilization can fingerprint what’s being trained (utilization patterns differ for LLM training vs CV training vs inference)

Cross-survey colocation

Pair	Overlap
GPU-compute ∩ ComfyUI (548)	Same-day check — TBD diff
GPU-compute ∩ DCGM cluster intersection	N/A

The 9 DCGM hosts are distinct from the 548 ComfyUI hosts. DCGM exposure is the metrics-layer (port 9400) while ComfyUI is the application-layer (port 8188 + randomized high ports). An operator running ComfyUI on the same host as DCGM-exporter would appear on both, but no overlap surfaced in this small set.

Toolchain Provenance

0. shodan search × 7 dorks → 439 unique candidates
1. fast_enum_gpu_compute.py (threads=80) → 9 DCGM + 2 Run:ai + 0 cluster-mgr + 0 slurmrestd
2. operator-hostname extraction from DCGM metrics → vs3.com multi-region pattern + xaas.int + 123pan
3. (queued) visorlog ingest → 9 events

Honest negative space

DCGM-exporter is severely under-surveyed at population scale. Real DCGM-exporter deployments are likely 1000s. Operators running NVIDIA GPU monitoring at scale. The 9 we found are the slice indexed by Shodan; port-first masscan on :9400 across tier-2 cloud (3.55M IPs) would surface the full population.
Run:ai console population not measured. Run:ai is closed-source enterprise software with a small market footprint; only 2 shell-only matches surfaced. The product is mostly deployed inside customer K8s clusters with operator-managed auth.
NVIDIA Fleet Command / Bright Cluster Manager populations were 0 in the Shodan-seeded survey. These are enterprise-tier deployments (closed source, hardware-bundled) and usually deployed behind operator-managed ingress. Port-first masscan on :22425 / :8081 might surface them.
No metrics-history scraping. Each DCGM-exporter host’s /metrics returns instantaneous data; a longitudinal scrape (every-minute pull) would surface utilization patterns and let an attacker fingerprint workloads. Not pursued. Would cross into multi-day data-collection.

Disclosure posture

Per-host outreach for the 9 confirmed DCGM hosts is appropriate (low N, the disclosure is “your :9400 port is publicly reachable; firewall it”).
The vs3.com operator’s multi-region exposure warrants a coordinated note (their Miami + Prague hosts visible).
123 Pan / Chinese-cloud hosts go via the cloud-provider’s abuse channel.