What it is
vLLM is the inference engine of choice when you actually have GPUs and want to serve a model at scale. It implements PagedAttention (a memory-management technique that lets a single GPU host serve dozens of concurrent requests without OOMing) plus continuous batching, speculative decoding, prefix caching. It’s what most commercial fine-tune deployments and university research clusters reach for once “Ollama on a laptop” stops being enough.
What goes wrong
vLLM exposes an OpenAI-compatible API on port 8000 by default. There is an
--api-key flag. Most operators don’t set it. An exposed vLLM instance
is a free GPU compute pool serving whichever model the operator loaded
(often a 70B parameter fine-tune that costs $5k/month to host on commercial
infra), with token throughput high enough to be useful for an attacker
running their own quota-heavy workloads. The /v1/models endpoint reveals
the model name and architecture, which is often enough to identify the
operator’s research lab.
How we test
We probe /v1/models for the model inventory and /metrics for the Prometheus
exposition (vLLM publishes detailed per-model token statistics here, including
average request size, which is useful for inferring deployment age and traffic).
For research instances we map the model name back to the publishing institution
via Hugging Face. Disclosure goes to the lab’s security contact directly, not
the cloud abuse desk.