Most recent
navigate open esc close Corpus index built 2026-06-07 23:58 UTC

§ THE STACK / DATA LAYER

Compute Orchestration

RunPod, Ray, Volcano, Kubeflow, SkyPilot

Vector stores, registries, memory, datasets: what the model knows and remembers.

What it is

You can’t fine-tune a 70B model on a laptop. ML compute orchestrators are how teams rent and schedule expensive GPUs. RunPod (managed) lets a researcher spin up an 8xA100 pod from a Jupyter button; Ray (Anyscale) is the Python-native distributed-compute framework; Volcano is the Kubernetes GPU scheduler; Kubeflow wraps both for an MLOps workflow; SkyPilot abstracts cloud GPU provisioning across providers. Each is the layer between “I need 80GB of VRAM” and “the GPU is now running my code.”

What goes wrong

These systems hold very expensive credentials. RunPod API keys map to billable GPU pods; Ray clusters mount the operator’s full SSH agent and kubeconfig; Kubeflow Pipelines runs as a service account with cluster-wide read on most installs. An exposed Ray dashboard is a one-click ray submit endpoint that runs arbitrary Python on the operator’s GPU fleet. An exposed RunPod control plane lets an attacker spin up new pods for arbitrary workloads on the operator’s bill. The cost vector here is real: we have seen disclosures involving five-figure unauthorised GPU rentals.

How we test

We probe Ray’s dashboard /api/version, Kubeflow’s /pipeline endpoint, and SkyPilot’s API server for fingerprints. Where reachable, we list jobs (no submit, no cancel) to characterise what the operator runs and how much GPU they have available. Job names typically include the model architecture and training step, which is enough to attribute the operator and characterise the loss vector for the disclosure.

Receipts

Research

Every survey, case study, and disclosure we've published that touches this layer of the stack. Counts on the cells above tally these directly.