Most recent
navigate open esc close Corpus index built 2026-06-07 23:58 UTC

← All research

Survey May 15, 2026

RAG Framework Servers: Population-Scale Survey (2026-05-15)

NuClide Research · 2026-05-15 Category: 07 RAG Stacks (frameworks tier) Status: complete · 6 platforms targeted · 3 with population data · 2 confirmed Shodan-dark · auth-on-default thesis confirmed across 3 tiers simultaneously


TL;DR

1,782 confirmed RAG framework instances. 538 confirmed unauthenticated via primary-source verification. Auth-on-default thesis confirmed at population scale across three tiers in a single survey:

DCWF KSAT coverage

Auto-derived from DCWF AI work-role rule files (ksat-tag).

  • 672 (AI Test & Evaluation Specialist): K7003, K7004, K7044, S7068, S7070, S7075, T5858, T5904, T5919
  • 733 (AI Risk & Ethics Specialist): K7040, S7067, T5854, T5868, T5893
  • overlap (Common AI KSATs (all 5 roles)): K108, K1158, K22, K6311, K6900, K6935, K7003, K942, S7065
PlatformTierConfirmedUnauthUnauth %Thesis verdict
AnythingLLMA* (auth-optional, signup-open by default)1,24248339%Confirms thesis — default operates at population scale
RAGFlowC (auth-on-default)48500%Confirms thesis by contrapositive
LightRAGA (no auth concept)5555100%Confirms thesis (no auth → 100% unauth)
PrivateGPTA (no auth concept)4(small, not re-verified)
LlamaIndexA (no auth concept)11Shodan-dark; see single-host case study
HaystackA* (FastAPI)0 across 6 queriesShodan-dark — port-first masscan required

Within the AnythingLLM unauth subset:

  • 302 of 483 (63%) have ingested corpora (HasExistingEmbeddings: true)
  • 80+ are wired to paid LLM API keys (OpenAI 43, Gemini 10, OpenRouter 3, Azure 3, Mistral 2, Cohere 2, LiteLLM 2, generic-OpenAI 16, LMStudio 2, LocalAI 1). LLMjacking / quota-drain population
  • Globally distributed: US 134, CN 84, DE 71, FR 23, SG 21
  • Top hosting operators: Hetzner 39, AWS 64, DigitalOcean 30, Aliyun 29, Contabo 19

Premise & platforms in scope

RAG (Retrieval-Augmented Generation) framework servers sit between vector databases and LLM clients, orchestrating the document-ingestion → chunking → embedding → retrieval → context-injection pipeline. The vector DB layer (Qdrant/Chroma/Milvus) has been surveyed; this is the framework layer above it.

PlatformDefault portAuthoritative auth tierIdentity marker
LlamaIndex servers8000A (no auth concept)info.title: "LlamaIndex Chat" in /openapi.json
Haystack (hayhooks)1416A* (auth-optional)/initialized JSON + haystack/document_store in /openapi.json
LightRAG9621A (no auth concept)/api/v1/graph/label/list returns list, /docs contains LightRAG
AnythingLLM3001A* (auth-optional, signup-open default)/api/ping{"online":true} (or older: pong)
RAGFlow9380C (auth-on-default)HTML title RAGFlow, /v1/llm/list JSON code field
PrivateGPT8001A (no auth concept)/openapi.json contains PrivateGPT AND NOT document_store

Methodology

Discovery: Shodan brand-dork (Stage 0)

38 priority queries built per platform with the conjunctive marker-anchored rule (Insight #6): each query combines a platform-unique title, body string, or response pattern. Counts taken first (shodan host count, 1 query credit each) to size the harvest before fan-out.

Hit-volume per platform:

PlatformBest brand-dorkHits
RAGFlowhttp.title:"RAGFlow"1,883
AnythingLLMhttp.title:"AnythingLLM"1,004
LightRAGhttp.title:"LightRAG"88
PrivateGPThttp.title:"PrivateGPT"6
LlamaIndexhttp.title:"LlamaIndex Chat"1
Haystack6 queries (hayhooks, port:1416, deepset-ai/haystack, etc.)0 — all

Two platforms confirmed Shodan-dark. Insight #21 applies:

  • Haystack: all 6 queries returned 0. Port-1416 alone has 672 worldwide listeners but none match hayhooks or uvicorn strings. Per Insight #21, port-first masscan against tier-2 ranges is the only path.
  • LlamaIndex Chat: 1-2 hits across all queries. The create-llama generated HTML’s <title> tag is the only brand marker, and it isn’t always indexed. Population requires port-first.

The two Shodan-dark platforms validate Insight #21 empirically, in a single survey, on two platforms simultaneously.

Harvest

shodan download --limit N (paginating beyond JAXEN’s 50/query cap) into per-query .json.gz files. Total spend: ~50 query credits for the productive queries. Deduped to 3,773 unique IP:port candidates across the four productive platforms.

Probe iterations (three rounds: each caught a FP class)

RoundProbeYieldBug found
1Existing rag-framework-probe.py (HTTP-only)538 confirmed (RAGFlow 504, LightRAG 32, PrivateGPT 1, LlamaIndex 1; AnythingLLM 0/1505)HTTP-only; can’t probe 443 SNI vhosts
2New probe-https.py (HTTPS-aware, both schemes)545 (AnythingLLM still 0)/api/ping checked for pong (old release) but newer AnythingLLM returns {"online":true}
3probe-https.py with corrected /api/ping marker1,787 confirmedHTTP status ≠ auth state (Insight #16) — auth_required=False was based on 200 status, but AnythingLLM /api/system/check-token returns 200 + body "No auth token found" and RAGFlow returns 200 + body "code:401"
4 (re-classify)reprobe-anyllm-strict.py + reprobe-ragflow.py — parse JSON body fields directlyCORRECT countslogic refined: AnythingLLM auth from /api/setup-complete results.RequiresAuth; RAGFlow auth from /v1/llm/list JSON code field

Final auth-state classification uses only primary-source body fields, never HTTP status alone. This is the Insight #16 discipline operationalized.


Per-platform findings

AnythingLLM: 483 unauth / 1,242 confirmed (39%)

Probe: GET /api/ping returns {"online":true} or pong. Cross-check GET /api/setup-complete results.RequiresAuth field.

Auth-state distribution (primary-source /api/setup-complete):

  • 483 (38.9%) UNAUTH-ANONYMOUS (RequiresAuth: false). Anyone visiting can register / become admin / use the LLM and corpora
  • 732 (58.9%) auth-required (RequiresAuth: true). Properly locked down
  • 20 non-JSON responses (probe degradation, likely proxy-fronted)
  • 7 unreachable

Within the 483 unauth subset. Enriched per-host metadata:

EmbeddingEngineHostsLLM-quota-drain risk class
native (bundled MintplexLabs)307low — local embedder, no operator API cost
ollama (local Ollama)82low — local
openai43HIGH — operator’s OpenAI key is wired
generic-openai (custom endpoint)16HIGH — operator’s custom LLM-provider key
gemini10HIGH
unknown/missing10unknown
openrouter, azure, mistral, cohere, litellm, lmstudio, localai14 combinedmixed

302 of 483 (63%) have HasExistingEmbeddings: true, the operator has already ingested documents into the workspace and persisted embeddings. The corpus exists and is queryable via the chat UI on an unauthenticated session.

Geographic distribution (unauth subset):

CountryCountCountryCount
United States134India12
China84Canada10
Germany71Hong Kong10
France23Netherlands9
Singapore21UAE9
Japan13Brazil8

Top hosting orgs (unauth subset):

Hetzner 39 · AWS 64 (Amazon Technologies 34 + Amazon.com 30) · DigitalOcean 30 · Aliyun 29 · Contabo 19 · OVH 14 · Google 13 · Oracle 12 · Tencent 8.

No single dominant operator on the unauth subset. This is a platform-default-misconfiguration class, not a single-operator misdeploy.

RAGFlow: 0 unauth / 485 confirmed (0%). Auth-on-default thesis confirmed by contrapositive

Probe: HTML title RAGFlow plus /v1/dataset/list reachability. Auth-state re-classification: GET /v1/llm/list and parse JSON code field. code:401 = auth-required, code:0 = unauth-with-data, code:100 = method/not-found.

Auth-state distribution (re-verified):

  • 462 (95.3%) auth-required (code: 401). Proper login gate
  • 17 other-code-None. Non-standard responses, likely proxy-fronted variants
  • 6 non-JSON responses
  • 0 confirmed unauthenticated

This is a negative result that confirms the thesis by its contrapositive: RAGFlow is Tier-C (auth-on-default), and the population follows the default at 100%. This parallels the prior 2026-05 findings on Langfuse and Phoenix (both Tier-C with 0% unauth in their respective surveys).

Heavy Chinese deployment: 343 of 485 (71%) China-hosted. Aliyun 100, Tencent 33+22, Huawei 32, China Mobile 23, Volcano Engine 19. RAGFlow’s developer InfiniFlow is China-based; the population follows the brand-origin.

LightRAG: 55 unauth / 55 confirmed (100%)

Probe: /api/v1/graph/label/list returns a JSON list, OR /docs swagger contains LightRAG.

LightRAG has no authentication concept in its default deploy. Every confirmed instance is, by construction, unauthenticated. The 100% rate is the expected Tier-A result.

Geographic spread: Germany 13, China 11, US 5, Russia 5, France 4, Netherlands 3, Finland 3. Diverse. No operator cluster.

Port distribution: 80 (14), 443 (13), 8000 (6), 8080 (2), 9621 (2, the documented default). Operators routinely run LightRAG behind reverse proxies; default-port-9621 is the minority deployment pattern. 42 HTTP / 13 HTTPS.

PrivateGPT: 4 confirmed, auth state not re-verified

Small population. Confirmed via /openapi.json containing PrivateGPT and NOT document_store (to discriminate from Haystack). Re-verification deferred. Sample size is too small to be population-meaningful.

LlamaIndex: 1 confirmed (via Shodan), Shodan-dark at population scale

Only 1-2 hits across 6 brand-dorks. The confirmed instance is the single-host case study at 23.239.19.219 (Linode, operator gochatus.org). See the dedicated case study.

LlamaIndex Chat is brand-dork-extinct on Shodan. The create-llama-generated HTML title is inline, Vite-bundled, and not consistently indexed. Population study requires masscan-tier-2 + uvicorn-fingerprint + /openapi.json post-probe with info.title:"LlamaIndex Chat" verification. Exactly the Insight #21 lane.

Haystack: 0 confirmed via Shodan, complete brand-dork blackout

All 6 Haystack queries returned 0 hits. Even raw port:1416 (the canonical hayhooks default) has 672 worldwide listeners, but none of them carry the hayhooks or uvicorn strings that would Shodan-identify them as Haystack.

The 672 port-1416 listeners are most likely IBM Tivoli Storage Manager (port 1416 is the IANA-registered TSM port). Haystack/hayhooks operators have either deployed near-zero instances reachable from the internet, deployed entirely behind reverse proxies that strip the brand string, or both.

Haystack at population scale needs:

  1. masscan tier-2 (Scaleway/OVH/Linode = 3.55M IPs) on port 1416 + 8000
  2. Probe /initialized for {"initialized": true/false} JSON
  3. Cross-check /openapi.json for haystack/document_store

Not run in this survey. Flagged as the Haystack-specific follow-up.


Cross-platform queries: universal 0

QueryHits
"openapi" "RAG" "document_store"0
"X-Powered-By: LlamaIndex"0
"X-Powered-By: Haystack"0

No RAG-framework operator sets a brand-tagging response header. Cross-platform “is this thing a RAG server” detection from raw banner alone is not feasible. Every detection is platform-by-platform.


Insight codifications

Insight #21 re-confirmed on 2 simultaneous platforms (Haystack + LlamaIndex)

The 2026-05-14 AutoGen Studio survey produced Insight #21. port-first beats brand-dork for low-footprint platforms. This survey produces two new confirming cases in one session:

  • Haystack: 6 brand queries → 0 hits each. Confirmed Shodan-dark.
  • LlamaIndex Chat: 6 brand queries → 1-2 total hits across all of them. Confirmed Shodan-dark.

Both have well-known default ports (1416 / 8000) and well-known FastAPI surface, but the brand string sits in HTML titles that Shodan doesn’t reliably crawl through. Port-first masscan is the only path to population data.

Insight #16 re-applied: auth state from JSON body, never HTTP status

Both AnythingLLM and RAGFlow returned HTTP 200 even when auth was required, with the actual auth-state signal in the response body:

  • AnythingLLM /api/system/check-token returns 200 with body {"error":"No auth token found."} when auth is on but no token presented. Probing on HTTP status alone marks this as “unauth”. Wrong.
  • RAGFlow /v1/llm/list, /v1/conversation/list, /v1/user/info all return HTTP 200 with body {"code":401,"data":null,"message":"<Unauthorized '401: Unauthorized'>"} when auth required. The HTTP-status signal is uniformly wrong; the body’s code field is the truth.

Two platforms in one survey, both with this exact pattern. The first iteration of this survey published numbers off the HTTP-status signal and was corrected during sample-verification (8/8 AS63949 AnythingLLM sample showed RequiresAuth: true despite probe marking them unauth). The corrected probe parses /api/setup-complete results.RequiresAuth for AnythingLLM and /v1/llm/list code field for RAGFlow.

New Insight candidate #23: fingerprint marker drift across platform versions

AnythingLLM /api/ping returns:

  • older versions: plain text pong
  • newer versions: JSON {"online":true}

The existing rag-framework-probe.py checked for pong only and missed the entire current-release population (0 of 1,505 candidates confirmed against the canonical pong marker). A platform fingerprint is not static. Markers drift across versions. Probes must check for every documented historical marker, and a fingerprint should be flagged for re-validation when a survey’s confirmation rate is suspiciously low.

This pairs naturally with Insight #6 (conjunctive markers). The conjunction is the catch, but each conjunct’s exact string is version-dependent and needs maintenance.


Auth-on-default thesis: single-survey triple confirmation

This is the rare survey where three different platform tiers were measured at population scale in one pass:

Tier-A   (no auth concept)        — LightRAG     → 55/55  unauth (100%)  ✓ confirmed
Tier-A*  (auth-optional, default-open) — AnythingLLM → 483/1242 unauth (39%) ✓ confirmed
Tier-C   (auth-on-default)        — RAGFlow      → 0/485   unauth (0%)   ✓ confirmed (contrapositive)

The thesis predicts each. The data delivers each. The pattern is not platform-specific. It tracks the shipping default, not the operator’s skill.

This is also the first survey to put a number on Tier-A* “auth-optional with signup-open” at population scale. The prediction was somewhere between 0% and 100% depending on how aggressive the platform’s first-run wizard is. AnythingLLM lands at 39%. Almost exactly midway, consistent with a tutorial-default where the wizard nudges but doesn’t enforce.


Operator-side risk surface: within the 483 AnythingLLM unauth

For the 483 confirmed unauth AnythingLLM instances, the operator-side risk surface decomposes into three classes:

  1. Corpus exposure (302 hosts): Operator has ingested documents. Anyone visiting the web UI can register, become admin, and read the entire vector store contents via the chat UI. Document classes will vary widely. Personal notes, internal-business docs, course materials, legal/medical/financial corpora.
  2. LLM-quota-drain / LLMjacking (80+ hosts): Operator wired a paid-tier LLM key (OpenAI, Gemini, OpenRouter, Azure, Mistral, Cohere, LiteLLM, etc.) into the workspace. Anonymous users can drive completions against that key. Direct billing impact.
  3. Local-LLM compute theft (389 hosts on native+ollama): Less monetary cost to the operator but free inference for whoever finds the host.

Restraint ethic: no extraction was performed. The 302/80/389 counts are derived from the /api/setup-complete metadata only (HasExistingEmbeddings, EmbeddingEngine), not from reading any corpus content. The metadata IS the finding (Insight #2, collection/experiment/project names ARE the finding).


Toolchain provenance

2026-05-15 15:42Z  Shodan API key rotation (basic plan, 9072 query credits)
2026-05-15 15:43Z  shodan count × 38 queries                            → hit-volume sizing
2026-05-15 15:46Z  jaxen hunt --clean × 10 productive queries           → first 50/query into empire.db
2026-05-15 15:48Z  shodan download --limit N × full populations         → 14 .json.gz files, 3,773 candidates deduped
2026-05-15 15:53Z  rag-framework-probe.py (HTTP-only) on full corpus    → 538 confirmed, AnythingLLM 0
2026-05-15 15:55Z  probe-https.py (HTTPS aware) iter-1                  → 545 confirmed, AnythingLLM still 0
2026-05-15 15:56Z  manual debug on AnythingLLM corpus                   → /api/ping returns {"online":true}, not "pong"
2026-05-15 15:57Z  probe-https.py iter-2 (corrected marker)             → 1,787 confirmed total
2026-05-15 16:00Z  sample-verify 8 AS63949 AnythingLLM                  → all RequiresAuth=true; HTTP-status auth signal is bogus
2026-05-15 16:02Z  sample-verify 5 unauth RAGFlow                       → /v1/dataset/list returns 200 + "404 not found" body
2026-05-15 16:03Z  reprobe-anyllm-strict.py (RequiresAuth from body)    → 483 UNAUTH-ANONYMOUS / 1,242 confirmed
2026-05-15 16:04Z  reprobe-ragflow.py (/v1/llm/list code field)         → 0 UNAUTH / 485 confirmed
2026-05-15 16:05Z  ASN/country/org enrichment from shodan metadata      → operator clusters identified
2026-05-15 16:06Z  visorlog ingest population-findings.ndjson            → rows 1038-1041 in nuclide.db
2026-05-15 16:08Z  case-study writeup                                    → this document

See also

  • llamaindex-chat-23-239-19-219-2026-05-15.md: single-host arsenal-fanout case study; the LlamaIndex confirmed instance in this survey
  • rag-framework-cloud-survey-2026-05.md: the prior cross-cloud survey (2026-05-04); this 2026-05-15 survey supersedes it on population data and corrects its 119-host PrivateGPT bucket as FastAPI-FP class
  • autogen-studio-survey-2026-05-14.md: the survey that produced Insight #21; re-confirmed here for Haystack + LlamaIndex
  • SYNTHESIS-2026-05.md: auth-on-default thesis evidence base; this survey adds three simultaneous tier confirmations
  • New: Insight candidate #23, fingerprint marker drift across platform versions (AnythingLLM pong{"online":true})