UC Berkeley: vLLM 4-Node Research Cluster, SecAlign + Nemotron
NuClide Research · 2026-05-03
Summary
Four vLLM nodes on UC Berkeley’s research computing network (128.32.0.0/16) expose OpenAI-compatible inference APIs without authentication. The most significant node (128.32.112.120) serves facebook/Meta-SecAlign-8B, Meta’s security-alignment research model, alongside Llama 3.1 8B, with /metrics showing 92,769 completed requests and 78.5M prompt tokens processed. A second node (128.32.48.211) leaks researcher username akshat via model path and has processed 103K+ requests. All nodes expose Prometheus /metrics. The SecAlign node additionally exposes /pause, /resume, and /scale_elastic_ep, unauthenticated administrative endpoints that can abort in-flight inference requests and drain KV cache.
Infrastructure
| Node | IP | Version | Models | Notes |
|---|---|---|---|---|
| SecAlign | 128.32.112.120 | vLLM 0.14.0 | Llama-3.1-8B-Instruct + Meta-SecAlign-8B | Admin endpoints exposed, 78.5M prompt tokens |
| Qwen3.5 | 128.32.43.204 | vLLM | qwen3.5-9b | Short context config (2048 max_len) |
| Akshat-Qwen | 128.32.48.211 | vLLM | Qwen2.5-3B-Instruct | Username akshat in path, 103K+ requests |
| Nemotron | 128.32.48.200 | vLLM | NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 | 30B reasoning model (MoE, 3B active) |
| Millennium | 169.229.48.109 | vLLM 0.1.dev15967 | Qwen2.5-1.5B-Instruct | Berkeley Millennium network, dev build |
All nodes: AS25 (UC Berkeley), 128.32.0.0/16 and 169.229.0.0/16. Port 8000/tcp public.
Node: 128.32.112.120: SecAlign Research Node
Model Inventory
| Model ID | Root Path | max_len | Notes |
|---|---|---|---|
/scratch/public_models/huggingface/meta-llama/Llama-3.1-8B-Instruct | same | 16384 | Meta Llama 3.1 8B |
secalign | /storage_slow/models/huggingface/facebook/Meta-SecAlign-8B | , | Meta’s security alignment model |
Prometheus Metrics: Request Volume
| Metric | Value |
|---|---|
request_success_total[stop] | 92,769 |
prompt_tokens_total | 78,508,878 (78.5M) |
generation_tokens_total | 3,710,842 |
| Avg prompt length | ~846 tokens |
| Avg generation length | ~40 tokens |
time_to_first_token_seconds_sum | 21,294s total |
| Avg TTFT | ~0.23s |
prefix_cache_queries_total | 78,535,811 |
| Prefix cache hit rate | 89.4% |
The prompt/generation ratio (78.5M input → 3.7M output) indicates automated batch evaluation, thousands of security-relevant prompts being fed through and short outputs collected. This is the SecAlign evaluation pipeline.
Admin Endpoints (all unauthenticated)
| Endpoint | Method | Impact |
|---|---|---|
POST /pause | Query params: wait_for_inflight_requests, clear_cache | Aborts in-flight requests, pauses pipeline |
POST /resume | , | Resume paused generation |
POST /scale_elastic_ep | , | Scale the elastic endpoint |
GET /load | , | Returns {"server_load": 0} |
GET /version | , | Returns {"version": "0.14.0"} |
POST /pause description: “Pause generation requests to allow weight updates. When False (default), aborts any in-flight requests immediately.” clear_cache=true additionally destroys the KV/prefix cache, evicting the 89.4% cache efficiency this node has built up.
Filesystem Paths Disclosed
/scratch/public_models/huggingface/meta-llama/Llama-3.1-8B-Instruct, shared scratch storage, publicly accessible model directory/storage_slow/models/huggingface/facebook/Meta-SecAlign-8B, separate slow-tier storage for the SecAlign model
Node: 128.32.48.211: Akshat Qwen2.5 Node
Model Inventory
| Model ID | Root Path | max_len |
|---|---|---|
Qwen2.5-3B-Instruct | /data/akshat/models/Qwen2.5-3B-Instruct/ | 20000 |
Username akshat leaked in model path via unauthenticated /v1/models response.
Metrics
| Metric | Value |
|---|---|
request_success_total[stop] | 95,836 |
request_success_total[length] | 7,838 |
| Total requests processed | 103,674 |
prompt_tokens_total | 159,121,718 (159M) |
generation_tokens_total | 180,242,852 (180M) |
num_requests_running (at probe time) | 1 (live traffic) |
This node was serving an active request at probe time. The balanced prompt/generation ratio (159M / 180M) indicates interactive or conversational workloads, not batch evaluation.
Node: 128.32.48.200: NVIDIA Nemotron-3-Nano Node
| Field | Value |
|---|---|
| Model | nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 |
| Parameters | 30B total, 3B active (MoE) |
| max_model_len | 32,768 |
| Response type | reasoning field present, chain-of-thought model |
Inference confirmed: chat completions return a reasoning field (extended thinking) before final answer. No system prompt observed.
Node: 169.229.48.109: Millennium Network Qwen2.5 Node
| Field | Value |
|---|---|
| IP | 169.229.48.109 |
| Hostname | brewster.millennium.berkeley.edu |
| Network | UC Berkeley Millennium Computing cluster (169.229.0.0/16) |
| vLLM version | 0.1.dev15967+gf7f52215b, ancient dev build from git |
| Model | Qwen/Qwen2.5-1.5B-Instruct |
| max_model_len | 32,768 |
| Traffic | prefix_cache_queries: 36 total, essentially fresh deployment |
The Millennium cluster is Berkeley’s high-performance research computing environment. This node runs an extremely old vLLM dev build (pre-0.1 release, 15,967 commits in) with a small 1.5B model and almost no usage. Likely a test or development setup. The hostname brewster references Brewster Kahle (of Internet Archive). Berkeley has historically named Millennium nodes after notable technologists.
Node: 128.32.43.204: Qwen3.5-9B Node
| Field | Value |
|---|---|
| Model | qwen3.5-9b (root: Qwen/Qwen3.5-9B) |
| max_model_len | 2,048 (constrained) |
Short context configuration (2048 tokens) suggests controlled experiment settings, limiting context deliberately for research comparison.
Findings
F1: Unauthenticated vLLM Inference on Research Cluster (HIGH)
All four nodes expose OpenAI-compatible inference without authentication. Any internet actor can send inference requests to Berkeley’s GPU allocation using models from active research projects.
curl http://128.32.112.120:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"secalign","messages":[{"role":"user","content":"Hello"}],"max_tokens":100}'
F2: Unauthenticated Admin Endpoints on SecAlign Node (HIGH)
POST /pause and POST /resume require no credentials. An adversary can abort all in-flight inference requests and halt the SecAlign evaluation pipeline.
# Would abort in-flight requests (NOT executed):
# POST http://128.32.112.120:8000/pause?wait_for_inflight_requests=false&clear_cache=true
Operational impact: destroys accumulated prefix cache (89.4% hit rate on 78.5M queries) and interrupts any active batch evaluation run.
F3: Prometheus /metrics Exposes Research Telemetry (MEDIUM)
All nodes expose /metrics without authentication. For the SecAlign node, this reveals:
- 92,769 completed inference requests since deployment
- Prompt token distribution showing batch evaluation pattern (avg 846 tokens → avg 40 output)
- Model names and filesystem paths
- TTFT distribution (median ~0.25s), GPU performance benchmarking
F4: Filesystem Path and Researcher Identity Disclosure (MEDIUM)
Three distinct storage tiers and a researcher username disclosed via unauthenticated API responses:
akshat, researcher/student username (128.32.48.211 model path)/scratch/public_models/huggingface/, shared cluster scratch space/storage_slow/models/huggingface/, long-term model storage tier/data/<username>/models/, per-user data directory
F5: Active Research Pipeline Accessible (HIGH)
128.32.112.120 is the infrastructure for active SecAlign security alignment research. With 78.5M prompt tokens processed at 89.4% prefix cache hit rate, this is an automated evaluation pipeline currently running. Any actor can:
- Inject adversarial prompts into the evaluation run
- Measure model responses to edge-case inputs
- Consume GPU allocation from Berkeley’s research budget
- Observe what inputs the research team is using (via query volume patterns in /metrics)
vLLM Attack Surface Notes
Unlike Ollama, vLLM has no /api/create system-prompt injection vector. The attack surface here is:
- Compute theft, unauthenticated inference drains GPU allocation
- Pipeline disruption,
/pauseendpoint (SecAlign node specific) - Operational intelligence,
/metricsreveals research activity patterns - Path/identity disclosure, model root paths in
/v1/modelsresponses
UC Berkeley vLLM Footprint
| Node | IP | Service | Research Context |
|---|---|---|---|
| SecAlign | 128.32.112.120 | vLLM 0.14.0 | Security alignment (Meta-SecAlign-8B), 78.5M tokens |
| Qwen3.5 | 128.32.43.204 | vLLM | Research config (2K context) |
| Akshat | 128.32.48.211 | vLLM | Individual researcher, 103K requests |
| Nemotron | 128.32.48.200 | vLLM | Reasoning model benchmark |
Combined with the existing Ollama node (lal-99-178.reshall.berkeley.edu, 169.229.99.178) and the Course AI Assistant memory injection finding (roar-art.EECS.Berkeley.EDU, documented in CA-berkeley-course-ai.md), Berkeley now has 7 documented unprotected AI inference nodes across residential, research, and production course infrastructure.
Remediation
# Bind vLLM to localhost only:
vllm serve <model> --host 127.0.0.1 --port 8000
# Or add API key authentication:
vllm serve <model> --api-key <secret>
# Disable /metrics endpoint:
vllm serve <model> --disable-log-stats
# For the admin API endpoints (SecAlign node):
# Use --uvicorn-log-level warning and place behind nginx with auth
Disclosure
- Discovered: 2026-05-03
- Status: Pending outreach to security@berkeley.edu and secalign research team