VisorBishop iter-5: LiteLLM Proxy + Argilla + Promptfoo (gateway + annotation + eval tiers), NuClide Research

NuClide Research · 2026-05-11

Summary

Fifth iteration of the Phase 3 loop. Where iter-1/2/3 extended the IP-direct-shadow port set and iter-4 added adjacent observability platforms (Opik, AgentOps, Phospho), iter-5 expands to adjacent TIERS:

DCWF KSAT coverage

Auto-derived from DCWF AI work-role rule files (ksat-tag).

672 (AI Test & Evaluation Specialist): K7003, K7004, K7044, S7068, S7070, S7075, T5858, T5904
733 (AI Risk & Ethics Specialist): S7067, T5868, T5882, T5893
overlap (Common AI KSATs (all 5 roles)): K1158, K22, K6311, K6935, K7003

LiteLLM Proxy (LLM gateway tier, stores provider API keys, serves OpenAI-compatible API)
Argilla (data annotation tier, labels training data for LLM fine-tuning)
Promptfoo (LLM evaluation tier, runs prompt regression tests)

Trulens and LangChain Hub were also scoped but yielded 0 self-hosted hits on Shodan; not pursued in this iter.

Reproduce with VisorBishop ≥ v0.1.5: visorbishop -i hosts.txt -ip-shadow

Headline findings:

LiteLLM: 25 of 269 confirmed instances are CRITICAL unauth (9.3%). Population-scale LLMjacking primitive. From a sample of 500 of the 5,408 Shodan-listed LiteLLM hosts. Extrapolated population-scale unauth count: ~500 instances. The model catalogs reveal operators paying for Claude Sonnet/Haiku, GPT-4o/GPT-5-nano, Gemini 1.5/3.1 Pro/Flash, Bedrock, Azure OpenAI, Ollama, and various fine-tuned deployments. All of which attackers can prompt at the operator’s expense.
Promptfoo: 10 of 17 self-hosted instances are CRITICAL unauth (59%). /api/results/ returns full eval history publicly. One operator (35.227.160.55:3000 / Google Cloud US) exposes 100 evals. Another (evals.dev.generalwisdom.com) exposes 3.
Argilla: 25 of 37 confirmed Argilla instances; 0 unauth. Auth posture uniformly strong (argilla.api.errors::UnauthorizedError returned on every protected route). Notable operators: Climate Policy Radar (4 hosts), 510.global (Red Cross humanitarian AI), Five9 contact-center, algoan fintech.

Population summary

Platform	Shodan dork	Total hits	Confirmed	Critical (unauth)
LiteLLM Proxy	`http.title:"LiteLLM API"`	5,408	269 of 500 sample	25 (9.3%)
Argilla	`http.title:"Argilla"`	37	25	0
Promptfoo	`http.title:"promptfoo"`	17	11	10
Trulens	`http.title:"TruLens"`	1	—	—
LangChain Hub self-host	various	0	—	—

LiteLLM’s 5,408-hit population is the largest of any platform surveyed in this entire research chain. Phoenix was 377; Langfuse 1,333. LiteLLM’s order-of-magnitude larger population reflects its “every dev tries it” posture. The BerriAI/litellm pip install is ~5 lines to run, and the default Swagger UI exposes the title that matches our dork.

CRITICAL: Promptfoo public eval-history disclosure

10 of 17 Promptfoo self-host instances expose /api/results/ without authentication. Promptfoo’s OSS server has no auth layer by default. Operators run promptfoo view to expose their local eval browser, and when they deploy the same command on a public-facing host, the eval list becomes publicly readable.

The eval data class:

Full prompts under test
Model responses and assertion failures
Configuration including model names, temperature, system prompts
Dataset metadata

Top yield by eval count:

Host	Operator	Eval count	Country
`35.227.160.55:3000`	(Google Cloud US)	100	US
`3.13.69.153:443`	(AWS US-East-2)	12	US
`64.112.124.114:3000`	csdi-admin.local (KhanWebHost)	11	US
`172.96.192.235:12380`	(Cluster Logic)	8	US
`3.218.253.194:443`	`evals.dev.generalwisdom.com` (AWS US-East-1)	3	US
`103.177.248.237:3000`	(Hostup AB)	1	Sweden
`43.204.199.18:80`	(AWS Mumbai)	1	India
`38.105.232.166:3000`	`ped.staging.agence.net.br`	0	BR (Contabo)
`147.78.130.168:3000`	(Contabo DE)	0	DE
`49.12.34.32:3000`	`badllama.pranavg.me`	0	DE (Hetzner)

The General Wisdom find is particularly notable: their evals.dev.* naming implies this is their development-tier eval pipeline.

The Promptfoo unauth rate at population scale is 59% (10/17). Higher than Phoenix’s 25%. Different tier, but same shipping-default problem: the OSS server has no built-in auth and the operator must deploy their own.

Argilla: 0 unauth across 25 confirmed

Argilla’s server enforces API-key auth on every protected route by default. All 25 confirmed Argilla instances returned the same {"detail":{"code":"argilla.api.errors::UnauthorizedError","params":{"detail":"Could not validate credentials"}}} shape on /api/v1/me.

Notable Argilla operators (auth-protected, info-class only):

Operator	Hosts	Sector
Climate Policy Radar	4	UK climate policy AI
510.global	1	Red Cross humanitarian data AI
Five9	2	Contact-center SaaS
algoan.com	1	French fintech
bee.ai	1	AI startup
konzierge.at	1	Austrian AI consultancy
naiveneuron.com	1	Slovak AI shop
radai-systems.com	2	(AI systems vendor)
Other AWS/GCP/elestio.app	12	Various

The annotation-training-data exposure surface is real but operator-controlled. Argilla forces auth from the start, so misconfiguration is rare. Same class as Langfuse (mandatory-auth platforms).

LiteLLM Proxy: order-of-magnitude larger population

LiteLLM is the LLM gateway tier, not observability. Operators deploy it to proxy LLM API calls (OpenAI, Anthropic, Bedrock, Vertex AI, etc.) through a single endpoint while applying rate-limiting, cost-tracking, and key management.

The Shodan population at http.title:"LiteLLM API" is 5,408 hosts. The largest population we’ve seen in this entire research chain. The 500-host sample probe is in progress.

500-host sweep result

Result	Count
Total probed	500
Confirmed LiteLLM	269 (54%)
CRITICAL (unauth `/v1/models`)	25 (9.3% of confirmed, 5.0% of all probed)
Auth-fronted	244

Top deployed versions:

1.82.6 (44 hosts)
1.82.3 (27)
1.83.10 (18)
1.81.14 (17)
1.83.14 (13)
1.81.12 (10)
1.82.1 (8)
1.80.8 (7)

Operators run mostly current LiteLLM. Range from 1.77.x to 1.83.x. Last month or two of releases.

Sample of unauth LiteLLM hosts: what’s behind them

The 25 CRITICAL hosts disclose their model catalogs (no auth needed for /v1/models). The catalogs reveal what attacker-supplied prompts would get billed to:

Host	Models exposed	Notable
`103.52.212.89:4000`	15 incl. `asisten-desa`, `cadangan-groq`, `cadangan-gemini-1`	Indonesian public-sector AI (“village assistant”), Groq+Gemini fallback chain
`204.168.237.160:4000`	44 incl. `cloud-fallback-1..n`	Large fleet with redundancy
`45.124.54.69:4000`	5 incl. `claude-3-5-sonnet-20241022`, `claude-3-5-sonnet-20240620`, `claude-sonnet-4-6`	Anthropic Sonnet across versions including current
`144.124.252.251:4000`	4 incl. `gpt-5-nano`, `gpt-4o-mini`, `text-embedding-3-small`	OpenAI tier with embeddings
`159.69.106.84:4000`	2 incl. `gpt-4o`, `claude-3-5-sonnet`	Dual-provider
`74.48.70.26:4000`	1: `bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0`	AWS Bedrock
`172.203.222.133:8000`	1: `azure/gpt-4o-mini`	Azure OpenAI
`47.99.185.244:4000`	7 incl. `WK-qwen3.6-plus`, `WK-qwen3.5-plus`, `WK-qwen3-max-2026-01-23`	Chinese WK-branded Qwen fine-tunes
`121.31.38.238:4000`	6 incl. `deepseek-v4-pro`, `gpt-4o`, `gpt-4`	DeepSeek + OpenAI multi-provider
`110.164.163.24:4000`, `:5555`, `:443`	2-5 Gemini Flash/Pro variants	One Thai operator running 3 instances on different ports
`135.181.26.207:4000`	2 incl. `gemini-3.1-flash-lite`, `gemini-3.1-pro`	Gemini 3.1
`172.232.115.211:4000`	1: `kimi-k2.5`	Moonshot Kimi
Various	Ollama `qwen2.5`, `deepseek-coder`, `gemma`, `mistral`	Self-hosted models

Why LiteLLM unauth is severe

When LITELLM_MASTER_KEY is not set, the LiteLLM proxy:

Accepts arbitrary /v1/chat/completions requests from anyone on the internet
Forwards them to the operator’s configured upstream providers (OpenAI, Anthropic, Bedrock, etc.) using the operator’s stored API keys
Operator pays the bill

This is the classic LLMjacking primitive. Attacker-supplied prompts get billed to the operator’s OpenAI / Anthropic / Bedrock account at the operator’s rate-tier. A single hour of automated probe traffic against an unauth LiteLLM can run thousands of dollars.

Population-scale extrapolation

If the 9.3% unauth rate in the 500-host sample holds across the full 5,408-host Shodan population, the population-scale count is ~500 unauth LiteLLM instances worldwide. Each is a per-instance LLMjacking exposure.

A focused iter-6 against the full 5,408-host population would catalogue the entire global LiteLLM exposure surface. Worth doing.

Why LiteLLM unauth is severe

When LITELLM_MASTER_KEY is not set, the LiteLLM proxy:

Accepts arbitrary /v1/chat/completions requests from anyone on the internet
Forwards them to the operator’s configured upstream providers (OpenAI, Anthropic, etc.) using the operator’s stored API keys
Operator pays the bill

Bug fixed live during iter-5

Iter-5 surfaced a false-positive bug in the LangSmith prober: it matched ZenML’s /api/v1/info response shape (which also has a version field). Fixed in VisorBishop@7b0185a by requiring LangSmith-specific markers (license_expiration_time, customer_info, or known LangSmith instance_flags) before confirming identity.

This is the third correctness gap iter-N has surfaced (after iter-1’s URL-parser and iter-1’s port-parallelism). All three required real population workloads to manifest. Iterating produces real correctness wins, not just methodology refinement.

This concretely demonstrates Methodology Insight #15 (in draft): Generic API shapes are not platform identifiers. Confirmation requires platform-specific markers; otherwise iter-N will surface false positives that the original probe set didn’t anticipate.

Performance

Argilla 37 hosts: 30.8s (only 1 probe per host)
Promptfoo 17 hosts: 35.1s (only 1 probe per host)
LiteLLM 500 hosts: in progress; estimated 10-15min

The LiteLLM prober makes 3-4 probes per host (root for title check, optional /.well-known/litellm-ui-config, /v1/models, optional /openapi.json for version). With 24 concurrency the per-host wall time is the bottleneck on slow Shodan-listed hosts.

Next steps

~~Build VisorBishop (Phase 3)~~ ✓
~~iter-1 / 2 / 3 / 4: port expansion + platform expansion~~ ✓
iter-5: LiteLLM 500-sweep complete → update this case study with full numbers
iter-6 candidate: full LiteLLM 5,408-host sweep (requires Shodan credit budget approval)
Methodology Insight #14, #15 final writeups
Phase 4 (web UI), visorBishop dashboard with cross-platform attribution

Evidence pack

~/recon/2026-05-10-llm-sweep/visorbishop-results/iter5/

argilla-noshadow.json / .csv. 37-host Argilla sweep
promptfoo-noshadow.json / .csv. 17-host Promptfoo sweep
litellm-noshadow.json / .csv. 500-host LiteLLM sample (in progress)

Source: Nicholas-Kloster/VisorBishop@v0.1.5

Cross-references: