vLLM / OpenAI-Compatible LLM Inference Servers on Public Cloud: Auth Posture Survey
NuClide Research · 2026-05-03
Summary
Reused the 22,765 port-8000 hits from the prior ChromaDB sweep and fingerprinted them for OpenAI-compatible LLM inference servers via GET /v1/models body match ({"object":"list","data":[{"object":"model",...}]}). 44 confirmed instances, all unauthenticated. Of these, 19 are confirmed vLLM (via /version returning a vLLM version string); the remaining 25 are generic OpenAI-compatible servers, a mix of vLLM-with-/version-disabled, llama.cpp-server, text-generation-inference, LM Studio, FastChat, and most concerningly commercial-API reseller proxies (operators standing up unauth gateways in front of paid OpenAI / Anthropic / xAI / Zhipu accounts).
DCWF KSAT coverage
Auto-derived from DCWF AI work-role rule files (ksat-tag).
- 672 (AI Test & Evaluation Specialist): K7003, K7004, K7044, S7068, S7070, S7075, T5904
- 733 (AI Risk & Ethics Specialist): K7051, S7067, T5868, T5893
- overlap (Common AI KSATs (all 5 roles)): K1158, K1159, K22, K6311, K6935, K7003, K942
Each unauth instance is a free LLM for anyone who finds it. For the reseller-proxy class, “free LLM” means direct quota theft from the operator’s paid commercial accounts: an attacker submits prompts, the proxy forwards to OpenAI/Anthropic/etc., the operator’s billing meter spins.
Methodology
Reused IPs from prior ChromaDB port-8000 masscan: 22,765 hosts
vllm-probe.py (200-thread fingerprint)
GET /v1/models → match {"object":"list","data":[{"object":"model",...}]}
GET /version → if returns {"version":"x.y.z"}, classify as confirmed vLLM
GET /metrics → if contains "vllm:"-prefixed metrics, double-confirm vLLM
→ 44 confirmed (19 vLLM + 25 generic OpenAI-compat)
NuClide did not submit any prompt to /v1/chat/completions or /v1/completions. Inference would have used the operator’s compute (and for the reseller-proxy class, would have spent the operator’s commercial-API credits). The model-list endpoint and version probe alone are sufficient to prove exposure.
Findings Summary
| Metric | Value |
|---|---|
| Cloud /16 ranges scanned | 28 (DO/Hetzner/Vultr) |
| Masscan hits on :8000 | 22,765 |
| OpenAI-compatible servers confirmed | 44 |
| Unauthenticated | 44 (100%) |
vLLM (confirmed via /version) | 19 |
| Generic OpenAI-compatible | 25 |
Hosting
| Provider | Confirmed |
|---|---|
| Hetzner | 17 |
| DigitalOcean | 14 |
| Vultr | 13 |
Threat Classes
The 44 instances split across four distinct threat classes:
Class A: Commercial-API reseller proxies (CRITICAL: direct billing theft)
These operators run an OpenAI-compatible gateway in front of paid accounts at OpenAI / Anthropic / xAI / Zhipu / etc. The gateway has no authentication on /v1/chat/completions, submitting a prompt routes through to the upstream commercial API, charging the operator’s account. An attacker can run unlimited inference using the operator’s commercial budget.
The proxy software (/openapi.json + /docs + /admin HTML) is fingerprintable:
| Host | Proxy product | Models exposed |
|---|---|---|
178.62.227.102 | AgentBar LLM Gateway v0.1.0 | 126 models, all OpenAI lineup (gpt-3.5/4/4o/4.1/5/5-mini/5-nano + audio/realtime variants), embeddings (text-embedding-ada-002, text-embedding-3-small/large), STT (whisper-1), TTS (tts-1, tts-1-hd), images (dall-e-2, dall-e-3), moderation (omni-moderation) |
157.90.170.99 | (custom router, route/* namespace) | 43 models: Kimi-K2.5/2.6, GLM-5/4.7 + variants, DeepSeek-v3.2/v4, MiniMax-m2.5/2.7, Qwen3.5/3.6, Gemma-4, ElevenLabs (eleven-v3, multilingual-v2), Whisper-large-v3, Hunyuan-Image-3, FLUX-1-schnell, SDXL |
206.189.152.172 | (uvicorn-served, /admin 307) | 31 models: chatgpt-4o-latest, gpt-4.1, gpt-5/5.1/5.2/5.4/5.5, claude-sonnet-4-6, claude-opus-4-6, claude-sonnet-4-5, gemini-2.5/3.1-pro, grok-3-mini, deepseek, kimi |
138.197.121.229 | Kiro-Go (Chinese-origin Anthropic proxy, zh admin UI) | 21 models: claude-sonnet-4.5/4 + thinking, claude-haiku-4.5 + thinking, deepseek-3.2 + thinking, minimax-m2.5/2.1 + thinking, glm-5 + thinking, qwen3-coder-next + thinking |
138.68.228.210 | Grok2API v2.0.0 (Chinese-origin xAI proxy, zh-CN admin UI) | 6 Grok models: grok-3, grok-3-mini, grok-4.1-thinking, grok-4.2-fast, grok-4.2, grok-expert |
167.71.19.51 | Grok2API v2.0.4.rc3 (newer version, served via granian Rust ASGI server) | 11 models: grok-4.20-0309 + reasoning/non-reasoning/fast/auto/expert variants, grok-imagine-image-lite, claude-sonnet-4, claude-opus-4, claude-haiku-4, claude-3-haiku |
104.236.247.58 | (Zhipu proxy with Anthropic-compat headers) | 11 models: GLM-4.5/4.6/4.7 + thinking + tools + V (vision) variants, GLM-4.5-Air |
138.197.17.168 | (nginx-fronted custom proxy) | 4 OpenAI models: gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo |
138.68.0.205 | (same as above, likely same operator) | Same 4 OpenAI models |
65.108.250.0 | (custom proxy) | 4 OpenAI models: gpt-4.1, gpt-4o, gpt-4o-mini, gpt-3.5-turbo |
Pattern: several of these are Chinese-origin open-source LLM proxy projects (Grok2API, Kiro-Go, AgentBar) deployed on cheap cloud VPSes by operators who want to re-package commercial APIs without authentication. The zh / zh-CN admin UIs and product naming confirm the origin community. This is a recognised abuse pattern in the broader AI-tooling underground, these proxies are typically pointed at shared/leaked commercial-API credentials and resold to users who cannot easily obtain foreign API keys directly.
Per-instance financial exposure: GPT-4o pricing ~$5/1M input + $15/1M output tokens; current xAI Grok pricing ~$3-15/1M tokens; Anthropic claude-sonnet-4.5 ~$3/1M input + $15/1M output. A motivated attacker can drain four-figure dollars per day per exposed proxy without rate limits in their way. The 126-model AgentBar proxy at 178.62.227.102 is the highest-value target in this class, it spans embeddings, audio, image, and multiple LLM families, suggesting sizeable quotas across vendors.
Admin UI exposure note: the admin pages render publicly (HTML layout, table headers like “API Key 列表 / API Key List”, login forms) but the data-API endpoints (/admin/api/keys, /admin/api/stats, /admin/api/users) return 401 on probes. The stored credentials are not directly leaked. However, the publicly visible product name + version enables CVE/default-credential lookup against the specific upstream project, and the /openapi.json discloses the full admin API surface for targeted exploitation if a default-credentials match exists.
Class B: Operator-attributed proprietary fine-tunes (HIGH: IP exposure)
The model name discloses the operator and their work. Each fine-tune is the product of expensive training runs on operator-curated data; the weights and behavior are now externally probeable.
| Host | Model | Operator inferred |
|---|---|---|
65.108.33.72 | sipgate/call-analysis-qwen35-9b-20260302-merged-experimental | sipgate GmbH, German VoIP/cloud-telephony provider. Call-analysis fine-tune with March 2026 training date. Anyone can probe the fine-tune to learn what call-content classifications sipgate runs on customer voice data |
65.109.75.57 | Infomaniak-AI/vllm-translategemma-12b-it | Infomaniak, Swiss cloud-hosting provider. Italian-language Gemma-12B translation fine-tune |
168.119.32.186 | /opt/app-root/src/models/granite-3.3-8b-instruct (vLLM 0.13.0+rhai11) | Red Hat AI vLLM distro deployment of IBM Granite-3.3 8B |
159.203.44.226 | Qwen/Qwen2.5-Coder-14B-Instruct-GPTQ-Int4 | Code-generation fine-tune (could be operator product or upstream model) |
206.189.88.219 | vn-accountant, vn-accountant-fast | Vietnamese accounting AI |
135.181.113.224 | deep_researcher | Custom agent fine-tune |
45.63.76.200 | quinn-glm5 | Custom name-tagged fine-tune |
45.76.253.57 | jarvis | Custom name-tagged assistant |
159.69.80.54 | digistent-rag | RAG-tagged custom fine-tune |
157.90.12.151 | qwen3.5-35b | Possibly fine-tuned Qwen 3.5 35B |
Class C: Big-model production deployments (HIGH: compute theft scale)
These operators pay real money (multi-thousand-dollar GPU instances) to serve large models. Free LLM inference at this scale ties up GPU memory the operator paid for; sustained queries by an attacker degrade legitimate user experience.
| Host | Model | Compute footprint |
|---|---|---|
45.76.23.53 | amd/Llama-3.3-70B-Instruct-FP8-KV (vLLM 0.9.2rc2) | AMD MI300-class GPU; 70B FP8 |
149.28.115.34 | RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic (vLLM 0.19.0) | Llama 4 Scout, FP8 dynamic |
65.108.32.167 | openai/gpt-oss-120b (vLLM 0.16.0) | OpenAI’s open-weight GPT, 120B |
45.76.45.65 | mistralai/Pixtral-12B-2409 (vLLM 0.6.1) | Vision-language Pixtral 12B |
135.181.222.37, 135.181.56.61, 65.108.198.21, 65.108.230.168 | Qwen/Qwen3-235B-A22B-Instruct-2507-FP8 + moonshotai/Kimi-K2.6 | 4 Hetzner IPs, identical model list, single operator running a 4-host cluster of MoE LLMs. Qwen3-235B + Kimi-K2 (1T-class) require multi-A100/H100 hosts each |
144.202.51.41, 149.28.212.148, 45.32.59.191, 45.63.39.13 | Llama-3.1-8B, Llama-3.3-70B, Mistral-7B-Instruct-v0.3, Qwen2.5-7B-Instruct | 4 Vultr IPs, identical model list, single operator running a 4-host load-balanced cluster (vLLM 0.6.6.post1 on every host) |
144.202.53.99 | moonshotai/Kimi-K2.6 | Kimi K2 (1T-class MoE) |
165.227.37.82 | moonshotai/Kimi-K2.6 | Same |
Class D: Specialized / smaller deployments (MEDIUM)
| Host | Model | Purpose |
|---|---|---|
149.28.221.64 | gte-Qwen2-1.5B-instruct | Embedding model |
45.76.153.119 | qwen3-embedding-0.6b-q4_k_m.gguf | Embedding (llama.cpp) |
157.90.34.111 | tts-1, tts-1-hd | Text-to-speech (OpenAI-compat) |
65.109.240.42 | Systran/faster-whisper-tiny, speaches-ai/piper-en_US-ryan-high, silero_vad_v5 | Speech stack: Whisper STT + Piper TTS + Silero VAD |
135.181.48.68 | gemma-4-e2b-q4km.gguf | Quantized Gemma-4 (llama.cpp) |
159.69.114.185, 65.108.121.151 | Generic GGUF models | llama.cpp-server |
157.90.170.113 | ./bin/teuken.gguf | Teuken (German Fraunhofer multilingual LLM) |
65.108.32.170 | Qwen/Qwen3-4B | Small Qwen serve |
65.108.32.167 | openai/gpt-oss-120b | (already in Class C) |
165.227.38.203 | meta-llama/Llama-3.1-8B-Instruct | Standard Llama 3.1 8B |
Per-Class Severity
| Class | Count | Severity | Remediation urgency |
|---|---|---|---|
| A, Commercial-API reseller proxies | 10 | CRITICAL | Same-day, financial bleed |
| B, Operator-attributed proprietary fine-tunes | 10 | HIGH | High, IP exposure |
| C, Big-model production deployments | ~12 (incl. clusters) | HIGH | High, compute theft |
| D, Specialized / smaller | 12 | MEDIUM | Standard 30-day window |
For Class A, the operator may not realize free credits are being burned until a billing alert fires, at which point thousands of dollars may already have been charged to their commercial API accounts.
Cross-Survey Pattern (updated)
| Platform | Sample | Unauth |
|---|---|---|
| Qdrant | 61 | 100% |
| ChromaDB | 48 | 100% |
| Milvus | 33 | 100% |
| Triton | 2 | 100% |
| vLLM / OpenAI-compat | 44 | 100% |
The pattern is now overwhelming: every layer of the modern AI stack we have surveyed, vector DB, model-serving, LLM-inference proxy, ships with no authentication and most operators do not enable it.
Remediation
vLLM / vLLM-class servers
# Start vLLM with API key required
vllm serve <model> --api-key <strong-random-token>
# Or front it with an auth-enforcing reverse proxy:
# Caddy/Nginx with HTTP Basic auth or JWT validation in front of port 8000
Firewall port 8000 to the application backend’s CIDR.
Reseller-proxy class
These operators are running a thin OpenAI-API-compatible router in front of paid commercial accounts. Most such routers (LiteLLM, OpenRouter-self-host, OneAPI) support API-key auth via configuration; the operator has not enabled it. Enabling auth + rotating any compromised commercial-API credentials is the immediate fix; longer-term, putting the proxy behind a customer-facing gateway with per-customer rate-limiting is the architectural fix.
Disclosure Posture
The 10 Class-A reseller proxies are time-sensitive, every hour they remain open is more billable spend on the operator’s commercial-API accounts. Disclosure should target the operators directly via WHOIS / brand-domain pivots where possible, with DigitalOcean / Hetzner / Vultr abuse channels as fallback.
The 10 Class-B operator-attributed fine-tunes have identifiable upstream operators (sipgate, Infomaniak, Red Hat AI deployment customer), direct disclosure to those organizations’ security teams is the highest-bandwidth path.
NuClide is not opening 44 individual disclosure threads. Same-day priority is the Class-A reseller proxies (financial bleed) and the Class-B sipgate / Infomaniak findings (operator-attributable IP exposure).
NuClide Pipeline Artifacts
| Stage | Notes |
|---|---|
| Discovery | Reused 22,765 port-8000 IPs from chromadb-cloud-survey-2026-05 |
| Fingerprint | vllm-probe.py, 200-thread /v1/models body-match + /version + /metrics |
| Findings ledger | To be ingested into data/nuclide.db via VisorLog |
| What was NOT done | No /v1/chat/completions calls, no inference performed against any operator’s compute |
References
- vLLM authentication: https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html#api-key
- OpenAI-compatible API spec: https://platform.openai.com/docs/api-reference
- Cross-survey index: index.md