Most recent
navigate open esc close Corpus index built 2026-06-07 23:58 UTC

← All research

Survey May 28, 2026

AI Evaluation and Red-Team Platform Survey — Promptfoo Population Pass

NuClide Research · 2026-05-28 · Population sweep of 13 AI eval/red-team platforms. Four confirmed unauthenticated Promptfoo instances. LangSmith, TruLens, Inspect AI, HELM, DeepEval, PyRIT, Garak all null at population scale.

Summary

Promptfoo is the only AI eval/red-team platform in the 13-platform scope that produced confirmed unauthenticated exposure at scale. Four instances returned {"email":null} on GET /api/user/email with eval datasets and provider configurations readable without credentials. The best-characterized instance (evals.dev.generalwisdom.com, AWS Ashburn) exposed 60 LLM provider configurations including the Anthropic Claude 4.x model family and Azure GPT-4o, along with active eval datasets including test case corpora, prompt templates, and token usage statistics from runs as recent as 2026-05-01. LangSmith self-hosted instances were auth-enforced across all 30+ sampled hosts; v0.10+ default auth tightening has held. The six remaining platforms with HTTP server modes (TruLens, Inspect AI, HELM, DeepEval, Arthur Shield, Patronus AI) produced zero confirmed instances on Shodan. Six platforms are CLI-only with no HTTP server; Shodan surface is zero by design.

DCWF KSAT coverage

Auto-derived from DCWF AI work-role rule files (ksat-tag).

  • 672 (AI Test & Evaluation Specialist): K7003, K7004, K7044, K7054, S7068, S7070, S7075, T5858, T5904, T5919
  • 733 (AI Risk & Ethics Specialist): K7040, K7051, S7067, T5868, T5893, T5904
  • overlap (Common AI KSATs (all 5 roles)): K108, K1158, K1159, K22, K6311, K6900, K6935, K7003, K942

Thesis fit

Auth-on-default thesis: mixed signal. Promptfoo ships with no auth gate on the web UI or API routes by default — four confirmed instances prove this is not theoretical. LangSmith, which was the thesis anchor for this category (96 hits in prior pass, known AUTH_TYPE=none default pre-v0.10), shows auth enforcement holding across the current sample. Two contradictory outcomes in the same category survey: one platform’s default hardened over time (LangSmith Insight #40 pattern), one platform’s default remains open (Promptfoo).


Per-finding entries

F1. evals.dev.generalwisdom.com (100.50.218.102)

What was found

GET /api/user/email returned {"email":null} — the Promptfoo identity probe for an unauthenticated instance. GET /api/providers returned a 60-provider configuration list including:

  • anthropic:messages:claude-haiku-4-5-20251001 (max_tokens: 2048, temperature: 0.5)
  • anthropic:messages:claude-opus-4-6 (max_tokens: 2048, temperature: 0.5)
  • anthropic:messages:claude-sonnet-4-5-20250929 (max_tokens: 2048, temperature: 0.5)
  • azure:chat:gpt-4o (api_host: your-resource-name.openai.azure.com)
  • azure:chat:gpt-4o-mini
  • Plus 55 additional configured providers

GET /api/datasets returned:

  • Dataset ID 06fb902..., count 3 test cases
  • Test cases from file://../datasets/regression-core.yaml
  • Most recent eval: eval-OJn-2026-05-01T13:29:42
  • Metrics: 32 pass / 0 fail, 213,085 tokens consumed, 547,050ms total latency, cost: $0 (inferred: local/API key with no cost tracking or free tier)

The providers endpoint response was served as JSON with status 200 and no authentication challenge. The Azure api_host value of your-resource-name.openai.azure.com is a placeholder — Azure credentials are not hardcoded in the provider config as stored in the database.

Why it is bad

Verified: operator’s LLM evaluation configuration is readable by anyone with network access. This includes which model families are being evaluated, the temperature/max_tokens parameters tuned for each, and the structure of the regression test suite.

Inferred (not exercised): individual eval run results accessible via /api/evals if the Express catch-all is bypassed or if the API responds to XHR requests from a browser session against the SPA. The identity probe and providers/datasets endpoints confirmed the platform is unauthenticated; full eval history access is the logical downstream.

Hypothesized: if this instance is used for comparing proprietary model variants or testing red-team robustness, the adversarial prompt library and model response comparison data would be in the eval history. Not extracted.

Who it affects

Operator: General Wisdom (generalwisdom.com). AWS US-East (Ashburn). Commercial sector. The evals.dev subdomain indicates this is a development/staging eval environment rather than production, but the exposure is real and the provider config is active (recent eval run 2026-05-01).

How it got exposed

Promptfoo’s Express server serves both the static React SPA and the API routes from the same process. Authentication is not enabled by default — the CSRF middleware is present but no auth gate exists on /api/* routes. The --share flag (for cloud.promptfoo.app) is the only auth-gating path; local/self-hosted deployments have no equivalent by default. This instance was deployed on AWS with port 443 directly internet-reachable, TLS via Let’s Encrypt, and no reverse proxy enforcing authentication.

Root cause: shipping default (auth off). Matches Insight #13 (shipping-defaults-are-load-bearing). The eval platform category is particularly sensitive because the exposure class includes adversarial prompt libraries, red-team test configurations, and model comparison data — outputs of security testing work that operators presumably want to keep internal.

Which tools contributed to the finding

StageToolContribution
0 — DiscoverJAXEN/Shodan (Playwright)http.title:"promptfoo" — 17 hits; evals.dev.generalwisdom.com in page 1 results
1 — FingerprintcurlHTTP title + X-Powered-By: Express + Access-Control-Allow-Origin: * confirmed Promptfoo
2 — VerifycurlGET /api/user/email{"email":null}; GET /api/providers → 60-provider JSON; GET /api/datasets → eval history with token stats
3 — AttributeShodan resultCN: evals.dev.generalwisdom.com; ASN: Amazon.com Inc. US-East
4 — ClassifymanualCommercial, development eval environment
5 — LedgerVisorLogFinding #162, severity medium
6 — ScoreVisorScubaAI.C1 violation — unauthenticated AI service
6 — Exploit-rankBAREAttempted; binary sandbox-blocked this session; deferred
7 — CodifymanualSee methodology section below

Tools that ran but did not contribute unique signal: aimap (no Promptfoo fingerprint; ports 80/443 confirmed open but no AI service identified); VisorgGraph (sandbox permission denied this session); VisorCorpus (N/A — no LLM corpus surface).

The load-bearing chain: Shodan title dork → HTTP identity probe → API dataset extraction.


F2. 103.177.248.237 (Hostup AB, Sweden)

What was found

GET /api/user/email returned {"email":null}. Response headers: X-Powered-By: Express, Access-Control-Allow-Origin: *. Content-Length: 1034 bytes (index.html). Port 3000.

GET /api/datasets returned an eval dataset: prompt “Tell me a good dadjoke”, vars {model: mistral-small}, 5 test runs, most recent eval eval-8F5-2026-03-30T12:36:45.

Why it is bad

Verified: unauthenticated Promptfoo with eval history readable. Dataset contains LLM evaluation runs using mistral-small, with adversarial/test prompt content visible.

Who it affects

Hostup AB, Sweden. VPS hosting. Likely a personal or small-team eval environment given the “dadjoke” test suite. Most recent eval: March 2026.

How it got exposed

Same root cause as F1 — Promptfoo default auth-off. No reverse-proxy auth layer.

Which tools contributed to the finding

StageToolContribution
0 — DiscoverJAXEN/Shodan (Playwright)http.title:"promptfoo" page 1
1 — FingerprintcurlX-Powered-By: Express + title confirmation
2 — Verifycurl/api/user/email{"email":null}; /api/datasets → eval history
3 — AttributeShodanHostup AB, Sweden
4 — ClassifymanualCommercial/personal, small-scale
5 — LedgerVisorLogFinding #163, severity medium
6 — ScoreVisorScubaAI.C1 violation
6 — Exploit-rankBAREsandbox-blocked this session

F3. 64.112.124.114 (United States)

What was found

GET /api/user/email returned {"email":null}. Port 3000. GET /api/datasets returned: dataset with test case vars {model: mistral-small}, prompt “Tell me a good dadjoke”, 5 runs, eval date 2026-03-30. Functionally identical dataset to F2.

Why it is bad

Same exposure class as F2.

Who it affects

US-hosted, bare IP, unattributed to a specific org from the Shodan result.

Which tools contributed to the finding

StageToolContribution
0 — DiscoverJAXEN/Shodan (Playwright)http.title:promptfoo port:3000 page 2 results
1 — FingerprintcurlExpress headers
2 — Verifycurl/api/user/email{"email":null}; /api/datasets → eval data
3 — AttributeShodanUS, bare IP
5 — LedgerVisorLogFinding #164, severity medium
6 — ScoreVisorScubaAI.C1 violation

F4. 43.204.199.18 (AWS Asia Pacific — Mumbai, India)

What was found

GET /api/user/email returned {"email":null}. HTTP headers: nginx/1.24.0 (Ubuntu) reverse proxy, X-Powered-By: Express, Access-Control-Allow-Origin: *. Port 80. GET /api/datasets returned: test case “Fun animal adventure story” with vars: {animal: penguin, location: tropical island}, assertions including contains-any and llm-rubric checking for child-friendly content. Most recent eval visible.

This is the only instance with an llm-rubric assertion in the exposed dataset — it evaluates model outputs using a second LLM call as a judge. The rubric and evaluation criteria are readable.

Why it is bad

Verified: eval dataset with LLM-rubric configuration exposed. The rubric criteria and test prompts are visible. If the instance is part of an active pipeline, subsequent eval runs and their rubric-graded outputs are also accessible.

Who it affects

AWS ap-south-1 (Mumbai). Commercial operator, unattributed by hostname.

Which tools contributed to the finding

StageToolContribution
0 — DiscoverJAXEN/Shodan (Playwright)http.title:"promptfoo" page 1
1 — Fingerprintcurlnginx + Express response chain
2 — Verifycurl/api/user/email{"email":null}; /api/datasets → llm-rubric eval config
3 — AttributeShodanAmazon.com, Inc. ap-south-1
4 — ClassifymanualCommercial, active dev eval
5 — LedgerVisorLogFinding #165, severity medium
6 — ScoreVisorScubaAI.C1 violation

Null results — platforms that ran but produced no confirmed findings

PlatformShodan QueryHitsVerifiedNotes
LangSmith self-hostedhttp.title:"LangSmith"770 unauthAll sampled hosts returned 401 on /api/v1/runs; v0.10+ auth default is holding
TruLenshttp.title:"TruLens"10 genuinetrulens.asia = Cambodian news website (FP)
Inspect AIhttp.title:"Inspect" port:757500No instances on default port
HELMhttp.title:"HELM" port:800020Both hits = Coolify (Kubernetes Helm package manager) FPs
DeepEval/Confident AIhttp.title:"deepeval"00No instances; enterprise-only self-hosted
PyRIThttp.title:"PyRIT"00CLI-only; no HTTP server
Garakhttp.title:"garak"40FPs (Chatterbox TTS); CLI-primary per prior pass
Patronus AIhttp.title:"Patronus"30FPs (Polish hospital, AWS LBs); K8s-only deployment
Arthur Shieldhttp.title:"Arthur Shield"130Cloudflare blocked scrape; K8s-only, auth-enforced by design
RAGAShttp.title:"RAGAS"40ragas.app cloud SaaS only; no self-hosted surface
PromptBenchN/A00Python library; no HTTP server
OpenAI EvalsN/A00CLI-only; OpenAI deprecated self-hosted
LlamaRiskN/A00Out of scope — DeFi org, not AI eval

Cross-survey analysis

Promptfoo population: 17 title hits, 4 confirmed unauth (24% unauth rate on sample). This is lower than the Ollama population (historically ~60-70% unauth) but comparable to Langfuse (~20% in prior surveys) and higher than LangSmith current (effectively 0% unauth on this pass).

The LangSmith population at 77 title hits is larger than the 2026-05-04 baseline of 96 but represents a different mix — the prior pass used http.html:"langsmith" which would catch page bodies as well as titles. The title-only dork skews toward instances that set a page title explicitly (the nginx-fronted LangSmith Docker deployment does set <title>LangSmith</title>). Auth enforcement is consistent across the sample.

F3 and F2 share an identical dataset (mistral-small dadjoke eval, same eval date 2026-03-30). This could be a template or tutorial dataset that multiple operators used when setting up their instances. If so, the dataset itself is not sensitive — but it confirms the auth state.

F1 (generalwisdom.com) is the operationally significant finding: 60 configured providers including the full current Anthropic model lineup suggests this is an active enterprise evaluation environment, not a tutorial instance.


Methodology — what this case study adds

Candidate Insight #50: Promptfoo default-auth-off is a live exposure class. At 4/17 confirmed unauth (24% on sampled population), Promptfoo joins the roster of AI infrastructure tools where shipping defaults produce real-world exposure. The signal is distinct from Ollama (model execution) and Langfuse (trace data) — Promptfoo exposes the adversarial test design itself: which models are being compared, on what criteria, with what prompts. For operators doing red-team work, this is the methodology, not just the data.

This also reveals a fingerprinting gap in aimap v1.9.36: Promptfoo is not in the fingerprint corpus. The /api/user/email{"email":null} probe is a reliable identity signal. A Promptfoo fingerprint (port 3000/15500, X-Powered-By: Express, Access-Control-Allow-Origin: *, /api/user/email returns {"email":null} or {"email":"user@example.com"}) should be added.


Honest negative space

The http.html: Shodan filter requires a paid plan filter token — not available via the web UI authenticated session. All dorks used http.title: which misses Promptfoo instances running behind a custom title or reverse proxy that strips/rewrites the page title. The 17 hits on http.title:"promptfoo" is a floor, not a ceiling. The ssl.cert.subject.cn:promptfoo (25 hits) population was not fully probed this session.

LangSmith port 1984 (the backend API) returned 0 results via Shodan title filter — this port likely requires http.html: or response-body content filters to enumerate directly. The 77 title-filter hits are also a floor.


Disclosure queue (verified scope)

FindingTargetTierStatus
F1evals.dev.generalwisdom.comMEDIUM (verified)Queued — development environment; provider configs exposed but no API key leakage confirmed
F2103.177.248.237MEDIUM (verified)Queued — low-sensitivity dataset (dadjoke eval); low priority
F364.112.124.114MEDIUM (verified)Queued — low-sensitivity dataset; bare IP operator unknown
F443.204.199.18MEDIUM (verified)Queued — LLM rubric eval exposed; AWS ap-south-1

All four: auth-off by shipping default, not by misconfiguration. Remediation is adding auth middleware or placing the Promptfoo instance behind a reverse proxy with auth (nginx + basic auth, or Authelia/Vouch).


Toolchain provenance

Shodan (Playwright) → 19 dorks executed → 4 platforms with hits (LangSmith 77, Promptfoo 17, TruLens 1, HELM 2)
  → curl direct probes on IP list
  → /api/user/email identity probe → 4 confirmed unauth (Promptfoo)
    → /api/providers → 60 LLM providers (F1)
    → /api/datasets → eval dataset + run history (F1, F2, F3, F4)
  → aimap (target-by-target) → 2 open ports confirmed; no AI service fingerprint match (Promptfoo not in corpus)
  → VisorLog ingest → Findings #162–165 (medium)
  → VisorScuba assess → AI.C1 violation (4/4)
  → BARE → sandbox-blocked this session; deferred
  → VisorGraph → sandbox permission denied this session; deferred

See also

  • shodan/queries/ai-eval-redteam-queries.md — full dork catalog
  • data/platform-intel/ai-eval-redteam-osint-2026-05-27.md — pre-survey platform intel
  • shodan/queries/23-ai-safety-eval.md — prior 2026-05-04 pass (Promptfoo 22, LangSmith 96)