vLLM, Model Layer, NuClide Stack

What it is

vLLM is the inference engine of choice when you actually have GPUs and want to serve a model at scale. It implements PagedAttention (a memory-management technique that lets a single GPU host serve dozens of concurrent requests without OOMing) plus continuous batching, speculative decoding, prefix caching. It’s what most commercial fine-tune deployments and university research clusters reach for once “Ollama on a laptop” stops being enough.

What goes wrong

vLLM exposes an OpenAI-compatible API on port 8000 by default. There is an --api-key flag. Most operators don’t set it. An exposed vLLM instance is a free GPU compute pool serving whichever model the operator loaded (often a 70B parameter fine-tune that costs $5k/month to host on commercial infra), with token throughput high enough to be useful for an attacker running their own quota-heavy workloads. The /v1/models endpoint reveals the model name and architecture, which is often enough to identify the operator’s research lab.

How we test

We probe /v1/models for the model inventory and /metrics for the Prometheus exposition (vLLM publishes detailed per-model token statistics here, including average request size, which is useful for inferring deployment age and traffic). For research instances we map the model name back to the publishing institution via Hugging Face. Disclosure goes to the lab’s security contact directly, not the cloud abuse desk.

Receipts

Research

Every survey, case study, and disclosure we've published that touches this layer of the stack. Counts on the cells above tally these directly.

Cross-cloud surveys

3

Survey May 29, 2026

Model Serving, management-plane and registry, 2026-05-29

The model-serving category is Shodan-dark. vLLM, Triton, TGI, and TorchServe all serve JSON APIs, and their identifying strings live in JSON bodies, not in the HTML Shodan crawls. The dominant self-ho…

Read →

Survey May 28, 2026

Model Serving and Registry Infrastructure Survey

Shodan sweep across 11 model-serving and registry platforms. MLflow is the only platform with a live, indexable population -- 10 confirmed unauthenticated instances spanning 6 cloud providers and 6 co…

Read →

Survey May 3, 2026

vLLM / OpenAI-Compatible LLM Inference Servers on Public Cloud: Auth Posture Survey

Reused the 22,765 port-8000 hits from the prior ChromaDB sweep and fingerprinted them for OpenAI-compatible LLM inference servers via GET /v1/models body match ({"object":"list","data":[{"object":"mod…

Read →

Field cases

3

Case May 22, 2026

ClimateGPT Stack — Unauth vLLM + Opik + Streamlit

Surfaced during Session 30 Agenta survey (S30). The /opik/api/v1/projects endpoint returned HTTP 200 unauthenticated — a candidate, per Insight #16. The candidate was passed to this assessment for dat…

Read →

Case May 3, 2026

National Taiwan University: CSIE MVNL Lab, Llama-3.3-70B vLLM (FP8, 2-Engine)

mvnl-nas.csie.ntu.edu.tw (140.112.91.209) in NTU's Computer Science and Information Engineering (CSIE) department exposes vLLM on port 8080 serving nvidia/Llama-3.3-70B-Instruct-FP8, NVIDIA's FP8-quan…

Read →

Case May 3, 2026

UC Berkeley: vLLM 4-Node Research Cluster, SecAlign + Nemotron

Four vLLM nodes on UC Berkeley's research computing network (128.32.0.0/16) expose OpenAI-compatible inference APIs without authentication. The most significant node (128.32.112.120) serves facebook/M…

Read →