Most recent
navigate open esc close Corpus index built 2026-06-07 23:58 UTC

§ THE STACK / MODEL LAYER

vLLM

High-throughput batched inference

The runtime that actually executes the model: where the weights run.

What it is

vLLM is the inference engine of choice when you actually have GPUs and want to serve a model at scale. It implements PagedAttention (a memory-management technique that lets a single GPU host serve dozens of concurrent requests without OOMing) plus continuous batching, speculative decoding, prefix caching. It’s what most commercial fine-tune deployments and university research clusters reach for once “Ollama on a laptop” stops being enough.

What goes wrong

vLLM exposes an OpenAI-compatible API on port 8000 by default. There is an --api-key flag. Most operators don’t set it. An exposed vLLM instance is a free GPU compute pool serving whichever model the operator loaded (often a 70B parameter fine-tune that costs $5k/month to host on commercial infra), with token throughput high enough to be useful for an attacker running their own quota-heavy workloads. The /v1/models endpoint reveals the model name and architecture, which is often enough to identify the operator’s research lab.

How we test

We probe /v1/models for the model inventory and /metrics for the Prometheus exposition (vLLM publishes detailed per-model token statistics here, including average request size, which is useful for inferring deployment age and traffic). For research instances we map the model name back to the publishing institution via Hugging Face. Disclosure goes to the lab’s security contact directly, not the cloud abuse desk.

Receipts

Research

Every survey, case study, and disclosure we've published that touches this layer of the stack. Counts on the cells above tally these directly.