Most recent
navigate open esc close Corpus index built 2026-06-07 23:58 UTC

← All engagement records

Case study May 3, 2026

UC Berkeley: vLLM 4-Node Research Cluster, SecAlign + Nemotron

Sector
Universities
Country
ca

NuClide Research · 2026-05-03


Summary

Four vLLM nodes on UC Berkeley’s research computing network (128.32.0.0/16) expose OpenAI-compatible inference APIs without authentication. The most significant node (128.32.112.120) serves facebook/Meta-SecAlign-8B, Meta’s security-alignment research model, alongside Llama 3.1 8B, with /metrics showing 92,769 completed requests and 78.5M prompt tokens processed. A second node (128.32.48.211) leaks researcher username akshat via model path and has processed 103K+ requests. All nodes expose Prometheus /metrics. The SecAlign node additionally exposes /pause, /resume, and /scale_elastic_ep, unauthenticated administrative endpoints that can abort in-flight inference requests and drain KV cache.


Infrastructure

NodeIPVersionModelsNotes
SecAlign128.32.112.120vLLM 0.14.0Llama-3.1-8B-Instruct + Meta-SecAlign-8BAdmin endpoints exposed, 78.5M prompt tokens
Qwen3.5128.32.43.204vLLMqwen3.5-9bShort context config (2048 max_len)
Akshat-Qwen128.32.48.211vLLMQwen2.5-3B-InstructUsername akshat in path, 103K+ requests
Nemotron128.32.48.200vLLMNVIDIA-Nemotron-3-Nano-30B-A3B-BF1630B reasoning model (MoE, 3B active)
Millennium169.229.48.109vLLM 0.1.dev15967Qwen2.5-1.5B-InstructBerkeley Millennium network, dev build

All nodes: AS25 (UC Berkeley), 128.32.0.0/16 and 169.229.0.0/16. Port 8000/tcp public.


Node: 128.32.112.120: SecAlign Research Node

Model Inventory

Model IDRoot Pathmax_lenNotes
/scratch/public_models/huggingface/meta-llama/Llama-3.1-8B-Instructsame16384Meta Llama 3.1 8B
secalign/storage_slow/models/huggingface/facebook/Meta-SecAlign-8B,Meta’s security alignment model

Prometheus Metrics: Request Volume

MetricValue
request_success_total[stop]92,769
prompt_tokens_total78,508,878 (78.5M)
generation_tokens_total3,710,842
Avg prompt length~846 tokens
Avg generation length~40 tokens
time_to_first_token_seconds_sum21,294s total
Avg TTFT~0.23s
prefix_cache_queries_total78,535,811
Prefix cache hit rate89.4%

The prompt/generation ratio (78.5M input → 3.7M output) indicates automated batch evaluation, thousands of security-relevant prompts being fed through and short outputs collected. This is the SecAlign evaluation pipeline.

Admin Endpoints (all unauthenticated)

EndpointMethodImpact
POST /pauseQuery params: wait_for_inflight_requests, clear_cacheAborts in-flight requests, pauses pipeline
POST /resume,Resume paused generation
POST /scale_elastic_ep,Scale the elastic endpoint
GET /load,Returns {"server_load": 0}
GET /version,Returns {"version": "0.14.0"}

POST /pause description: “Pause generation requests to allow weight updates. When False (default), aborts any in-flight requests immediately.” clear_cache=true additionally destroys the KV/prefix cache, evicting the 89.4% cache efficiency this node has built up.

Filesystem Paths Disclosed

  • /scratch/public_models/huggingface/meta-llama/Llama-3.1-8B-Instruct, shared scratch storage, publicly accessible model directory
  • /storage_slow/models/huggingface/facebook/Meta-SecAlign-8B, separate slow-tier storage for the SecAlign model

Node: 128.32.48.211: Akshat Qwen2.5 Node

Model Inventory

Model IDRoot Pathmax_len
Qwen2.5-3B-Instruct/data/akshat/models/Qwen2.5-3B-Instruct/20000

Username akshat leaked in model path via unauthenticated /v1/models response.

Metrics

MetricValue
request_success_total[stop]95,836
request_success_total[length]7,838
Total requests processed103,674
prompt_tokens_total159,121,718 (159M)
generation_tokens_total180,242,852 (180M)
num_requests_running (at probe time)1 (live traffic)

This node was serving an active request at probe time. The balanced prompt/generation ratio (159M / 180M) indicates interactive or conversational workloads, not batch evaluation.


Node: 128.32.48.200: NVIDIA Nemotron-3-Nano Node

FieldValue
Modelnvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Parameters30B total, 3B active (MoE)
max_model_len32,768
Response typereasoning field present, chain-of-thought model

Inference confirmed: chat completions return a reasoning field (extended thinking) before final answer. No system prompt observed.


Node: 169.229.48.109: Millennium Network Qwen2.5 Node

FieldValue
IP169.229.48.109
Hostnamebrewster.millennium.berkeley.edu
NetworkUC Berkeley Millennium Computing cluster (169.229.0.0/16)
vLLM version0.1.dev15967+gf7f52215b, ancient dev build from git
ModelQwen/Qwen2.5-1.5B-Instruct
max_model_len32,768
Trafficprefix_cache_queries: 36 total, essentially fresh deployment

The Millennium cluster is Berkeley’s high-performance research computing environment. This node runs an extremely old vLLM dev build (pre-0.1 release, 15,967 commits in) with a small 1.5B model and almost no usage. Likely a test or development setup. The hostname brewster references Brewster Kahle (of Internet Archive). Berkeley has historically named Millennium nodes after notable technologists.


Node: 128.32.43.204: Qwen3.5-9B Node

FieldValue
Modelqwen3.5-9b (root: Qwen/Qwen3.5-9B)
max_model_len2,048 (constrained)

Short context configuration (2048 tokens) suggests controlled experiment settings, limiting context deliberately for research comparison.


Findings

F1: Unauthenticated vLLM Inference on Research Cluster (HIGH)

All four nodes expose OpenAI-compatible inference without authentication. Any internet actor can send inference requests to Berkeley’s GPU allocation using models from active research projects.

curl http://128.32.112.120:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"secalign","messages":[{"role":"user","content":"Hello"}],"max_tokens":100}'

F2: Unauthenticated Admin Endpoints on SecAlign Node (HIGH)

POST /pause and POST /resume require no credentials. An adversary can abort all in-flight inference requests and halt the SecAlign evaluation pipeline.

# Would abort in-flight requests (NOT executed):
# POST http://128.32.112.120:8000/pause?wait_for_inflight_requests=false&clear_cache=true

Operational impact: destroys accumulated prefix cache (89.4% hit rate on 78.5M queries) and interrupts any active batch evaluation run.

F3: Prometheus /metrics Exposes Research Telemetry (MEDIUM)

All nodes expose /metrics without authentication. For the SecAlign node, this reveals:

  • 92,769 completed inference requests since deployment
  • Prompt token distribution showing batch evaluation pattern (avg 846 tokens → avg 40 output)
  • Model names and filesystem paths
  • TTFT distribution (median ~0.25s), GPU performance benchmarking

F4: Filesystem Path and Researcher Identity Disclosure (MEDIUM)

Three distinct storage tiers and a researcher username disclosed via unauthenticated API responses:

  • akshat, researcher/student username (128.32.48.211 model path)
  • /scratch/public_models/huggingface/, shared cluster scratch space
  • /storage_slow/models/huggingface/, long-term model storage tier
  • /data/<username>/models/, per-user data directory

F5: Active Research Pipeline Accessible (HIGH)

128.32.112.120 is the infrastructure for active SecAlign security alignment research. With 78.5M prompt tokens processed at 89.4% prefix cache hit rate, this is an automated evaluation pipeline currently running. Any actor can:

  • Inject adversarial prompts into the evaluation run
  • Measure model responses to edge-case inputs
  • Consume GPU allocation from Berkeley’s research budget
  • Observe what inputs the research team is using (via query volume patterns in /metrics)

vLLM Attack Surface Notes

Unlike Ollama, vLLM has no /api/create system-prompt injection vector. The attack surface here is:

  • Compute theft, unauthenticated inference drains GPU allocation
  • Pipeline disruption, /pause endpoint (SecAlign node specific)
  • Operational intelligence, /metrics reveals research activity patterns
  • Path/identity disclosure, model root paths in /v1/models responses

UC Berkeley vLLM Footprint

NodeIPServiceResearch Context
SecAlign128.32.112.120vLLM 0.14.0Security alignment (Meta-SecAlign-8B), 78.5M tokens
Qwen3.5128.32.43.204vLLMResearch config (2K context)
Akshat128.32.48.211vLLMIndividual researcher, 103K requests
Nemotron128.32.48.200vLLMReasoning model benchmark

Combined with the existing Ollama node (lal-99-178.reshall.berkeley.edu, 169.229.99.178) and the Course AI Assistant memory injection finding (roar-art.EECS.Berkeley.EDU, documented in CA-berkeley-course-ai.md), Berkeley now has 7 documented unprotected AI inference nodes across residential, research, and production course infrastructure.


Remediation

# Bind vLLM to localhost only:
vllm serve <model> --host 127.0.0.1 --port 8000

# Or add API key authentication:
vllm serve <model> --api-key <secret>

# Disable /metrics endpoint:
vllm serve <model> --disable-log-stats

# For the admin API endpoints (SecAlign node):
# Use --uvicorn-log-level warning and place behind nginx with auth

Disclosure