Library› Reference› Research Corpus

research corpus · SOURCE DATA · 37 categories · 9 layers · 84 methodology insights

The research corpus.

§ 01 Reference topology

09layer

Chat UIs user

● Open WebUI ● AnythingLLM ● LobeChat LibreChat custom front-ends

3,400+ unauthenticated chat front-ends

08layer

Agent / RAG APIs orchestration

● LiteLLM ● LangServe ● LangFlow ● Flowise ● custom routers

1,200+ open Agent / RAG endpoints

07layer

Model servers inference

● Ollama ● llama.cpp ● vLLM ● TGI Triton ● LocalAI

16,473 unauthenticated Ollama · 1,200+ vLLM

06layer

Vector DBs retrieval

● Qdrant ● Milvus ● Weaviate ● Chroma Pinecone (hosted)

2,100+ open vector indices

05layer

Search / docs retrieval

● Elasticsearch ● ClickHouse ● Solr ● Meilisearch Typesense

5,037 ES with dense_vector schema

04layer

Browser automation agents

● Browserless ● Selenium Grid ● Playwright ● CDP proxies ● ComfyUI

548 unauthenticated ComfyUI · 6 live CDP sessions

03layer

Data layer storage

Postgres MongoDB MinIO / S3 Redis ● etcd ● Vault

3,014 etcd · 912 Vault · 4,105 Consul

02layer

Orchestration compute

Kubernetes ● Docker Compose Nomad systemd

Docker defaults are the proximate cause across most layers above

01layer

GPU compute hardware

H100 H200 L40S A100 RTX 5090 consumer cards

10× L40S in one fleet observed

the public IPv4 internet

§ 02 Ports

port	service	note
80 / 443	Generic HTTP(S), Dify, Flowise, reverse-proxied everything	Filter by `http.title:` / HTML fingerprint
1337	Jan, Devika	Hacker-cute defaults
1984	LangSmith
2375	Docker daemon (unauth)	RCE → host foothold
2379	etcd (Milvus metadata), Kubernetes control plane
3000	Flowise, Open WebUI, AnythingLLM, AgentGPT, SuperAGI, Langfuse, Promptfoo, OpenDevin, Grafana	Most crowded port in AI
3001	AnythingLLM
4000	LiteLLM Proxy	Provider keys live here
4040	Apache Spark UI	Often co-deployed with ML pipelines
4317	OpenTelemetry gRPC (OTLP)	LLM observability transport
4318	OpenTelemetry HTTP (OTLP)	LLM observability transport
4567	Rivet
5000	MLflow	Models, artifacts, experiments
5001	KoboldCpp
5050	pgAdmin	Often default creds
5432	PostgreSQL + pgvector, Supabase, Neon, Timescale
5500	ChromaDB (alt)
5601	Kibana, OpenSearch Dashboards	Vector index admin
5678	n8n	AI workflow automation
6006	Phoenix/Arize, TensorBoard	Traces + training viz
6333	Qdrant (HTTP)	Snapshots downloadable
6334	Qdrant (gRPC)
6379	Redis / Redis Stack (vector search)	Often no auth
6443	Kubernetes API server	ML workload orchestration
6900	Argilla	RLHF/annotation data
7474	Neo4j Browser	Graph memory stores
7501	Lightning AI
7687	Neo4j Bolt, Memgraph
7860	Gradio, LangFlow, unsloth, text-generation-webui	HuggingFace Spaces default
7997	Infinity (embeddings)
8000	LangChain, vLLM, Triton, FastAPI generic, ChromaDB, AutoGPT, BentoML, Ray Serve, MetaGPT, Mem0, many `/v1/*` OpenAI-compat	Single most common LLM port
8001	RedisInsight
8008	ClearML
8080	LocalAI, llama.cpp, Vespa, BabyAGI, Axolotl, Determined AI, Kubeflow, Airflow, Helicone, Dgraph, NVIDIA, Vast.ai, HF TEI/TGI, Phidata	Generic "alt-HTTP"
8081	mongo-express
8088	Hadoop YARN ResourceManager	Training data pipelines
8089	Splunk HEC	Sometimes LLM log sink
8108	Typesense	API key enumeration risk
8123	LangGraph Studio, ClickHouse
8161	ActiveMQ Web Console	ML pipeline message broker
8265	Ray Dashboard	Cluster job submission, RCE
8443	SageMaker Notebook, alt-HTTPS
8501	Streamlit
8529	ArangoDB
8787	Cloudflare AI Gateway, Portkey, RStudio Server
8882	Marqo
8888	Jupyter, RunPod	RCE if no token
9000	MinIO (Milvus backing), Portainer	Vector blobs in buckets
9090	Prometheus	Every ML stack exports metrics here
9091	Milvus metrics, Zilliz
9092	Apache Kafka	LLM event streams, training pipelines
9200	Elasticsearch / OpenSearch	`dense_vector` / kNN
9400	NVIDIA DCGM	GPU telemetry
9870	Hadoop NameNode (HDFS)	Training data at rest
9998	Apache Tika	Document ingestion
10250	Kubelet	K8s node attack surface
11434	Ollama	Most-exposed LLM runtime in 2025-26
19530	Milvus (gRPC)
27017	MongoDB	Increasingly used as vector store
50070	Hadoop NameNode (legacy)

§ 03 Catalogue

Every workflow is a JSON document containing the credentials of every service it touches. The default n8n install exposes the editor at / with no auth on first boot; the operator is supposed to enable basic auth themselves. Many don't. Flowise has the same shape: visit the IP, see the canvas, see the API keys baked into the OpenAI node, see which CRM is wired to which Gmail account. A single exposed instance can leak the API keys for the operator's entire SaaS stack, plus a list of every workflow they run.

How we test

We fingerprint the editor by its asset bundle, then read the workflow list through the public REST API (no auth in the default config). Each workflow's JSON exposes credentials by reference. We resolve the reference through the credentials endpoint and confirm the secret is present without ever exfiltrating it. We catalogue the workflow names because they tell the operator's story better than any banner: "Daily-report-to-CEO", "Sync-Stripe-to-Notion", etc.

A Jupyter server with no token (or a token shared in a public Slack, or a token from a screencast, or a token in a Docker Compose file pushed to GitHub) is a remote Python shell with the operator's full filesystem, GPU, and cloud credentials available via the imported boto3/google-cloud SDKs. The exposure isn't the notebook. It's the kernel behind it. Anyone reaching the port can spawn a new kernel and run arbitrary code under the operator's identity.

How we test

We probe for the token-prompt page, then the API at /api/sessions to enumerate live kernels (this works without auth in surprisingly many configs, and the response is a perfect operator-attribution payload: kernel paths contain user homedirs, repo names, and dataset filenames). We never spawn a new kernel on the target. The session list alone is sufficient to attribute, draft the disclosure, and demonstrate impact in evidence form.

Reranker servers ship the same way embedding servers do: OpenAI-compatible HTTP, no auth, on the assumption that only the upstream RAG pipeline calls them. When exposed they leak two things: (1) the model identifier, which indicates how seriously the operator is doing RAG, and (2) the queries the operator is processing, since some servers log recent inputs to a status endpoint for debugging. The query log is the more damaging signal because queries often contain the original user prompt verbatim.

How we test

We probe /v1/rerank for the version banner and /v1/models for the model inventory. We do not submit reranking workloads. Where a debug or status endpoint exposes recent traffic we capture only the count and timing, not the query content. The model identifier and traffic profile together characterise the operator's RAG seriousness without our ever reading queries.

vLLM exposes an OpenAI-compatible API on port 8000 by default. There is an --api-key flag. Most operators don't set it. An exposed vLLM instance is a free GPU compute pool serving whichever model the operator loaded (often a 70B parameter fine-tune that costs $5k/month to host on commercial infra), with token throughput high enough to be useful for an attacker running their own quota-heavy workloads. The /v1/models endpoint reveals the model name and architecture, which is often enough to identify the operator's research lab.

How we test

We probe /v1/models for the model inventory and /metrics for the Prometheus exposition (vLLM publishes detailed per-model token statistics here, including average request size, which is useful for inferring deployment age and traffic). For research instances we map the model name back to the publishing institution via Hugging Face. Disclosure goes to the lab's security contact directly, not the cloud abuse desk.

The framework defaults vary across the layer:

- Docker daemon on TCP 2375 ships without auth in the official documentation's "remote API" examples; operators copy-paste the config and forget the TLS step. Population-scale unauth rate: high. - etcd v2 API (/v2/keys) ships without auth in older deployments; v3 default is gRPC-auth-on but operators frequently turn it off. - Vault is auth-on-default at the framework layer; the only unauth surface is the /v1/sys/init bootstrap endpoint, which is intentionally open until the first init call. Uninitialized Vaults are a one-shot full-takeover surface. - Consul ships with ACLs disabled by default in framework config (Tier-A); 100% of reachable Consul instances at population scale have ACL off. - Argo CD** is auth-on-default (Tier-C). 99.93% of the population is properly gated; ~0.07% set the anonymous-read template-config and leak app inventories.

How we test

Each substrate platform has its own identity-and-state probe. Docker: GET /version. etcd: GET /version + GET /v2/keys?recursive=false (top-level keys only). Vault: GET /v1/sys/seal-status + GET /v1/sys/init (sealed / unsealed / uninitialized). Consul: GET /v1/agent/self + GET /v1/catalog/services. We never read secret values, never PUT/DELETE/POST /v1/sys/init. The presence of the substrate at the public boundary is the finding; the operator's k8s topology, secret-engine mounts, and service catalog leak as metadata even when the data layer is gated.

Data Labeling

Label Studio, Argilla, CVAT, Doccano, Prodigy

ports 6900 · 8080 surveyed

348 confirmed · ~99% auth-on at /v1/projects (auth-off thesis breaks here)

read

What it is

Models learn from labels. A data-labeling platform is the editing environment where humans annotate the raw data: boxes around objects, classifications on text, transcriptions of audio, span-level reasoning traces for RLHF. Label Studio (HumanSignal) is the universal multi-modal one; Argilla (Hugging Face) is the LLM-centric one; CVAT (Intel/Roboflow) owns computer-vision; Doccano is the lightweight NLP option; Prodigy (Explosion) is the paid serious one. The dataset that comes out of these tools is what the next model gets trained on. The labelling stack is upstream of model behaviour itself.

What goes wrong

The platform exposes two things. First, the raw data being labelled: often unredacted medical images, customer support transcripts, legal documents. Second, the labels themselves, which encode the operator's labelling rubric and frequently the model bias they are trying to amplify or correct. Default deployments have weak credentials (admin/admin is alarmingly common in the Label Studio Docker Compose examples) or token-based auth that operators share in Slack and forget to rotate.

How we test

We confirm the platform via its /version endpoint, then list projects via the unauthenticated API surface (Label Studio's /api/projects works without auth on the default install). Project names plus task counts tell the story: a project called "medical-imaging-batch-7" with 12,000 tasks is a healthcare operator; a project called "red-team-prompts" with a few hundred tasks is an AI lab's safety team. We never download tasks. The metadata characterises both the data class and the operator function.

Document Parsers

Unstructured, LlamaParse, marker, MinerU, Docling

ports 9998

Apache Tika · ingestion-time exposure

read

What it is

Before a document gets embedded, it has to be turned into clean text. PDFs have layout. Word documents have tables. Slide decks have hierarchy. The document-parsing layer extracts all of that into the markdown-or-JSON the embedder expects. Unstructured.io is the multi-format incumbent. LlamaParse (LlamaIndex) is the cloud-API competitor optimised for RAG. marker is the open PDF-to-markdown specialist; MinerU (OpenDataLab) is the high-quality alternative. Docling (IBM) is the newer research-grade option. Every serious RAG pipeline runs documents through one of these before they ever reach the vector DB.

What goes wrong

The parser server processes a stream of operator-uploaded documents and caches them on disk. When the parser is exposed without auth, the document queue and the parsed-output cache are both reachable. That cache often contains documents the operator has marked private. Internal contracts, HR files, partner agreements, all sitting in plaintext markdown form on the parser's disk. The parser is also a high-CPU service: an attacker can submit large or malicious PDFs to either burn the operator's compute or trigger known parsing-library RCEs.

How we test

We probe the parser's REST endpoint for the version banner and check the status endpoint for queue depth and recent-job filenames. The filenames characterise the operator's document inventory and frequently identify the legal or business unit the parser is serving. We do not submit documents.

We probe /api/prompts, /v1/prompts, or the platform-specific equivalent for the prompt inventory and read prompt names plus version counts. We do not read prompt bodies. The names alone ("customer-support-system-v3", "jailbreak-defense", "tone-formal") characterise the operator's product strategy without our needing to see the actual text.

§ 04 Methodology insights

84 methodology insights codified in the corpus each lives at /methodology/insight-NN-*

01

Protocol-strict surveys self-filter honeypots

The protocol-shape gate is a stronger honeypot filter than IP-based blocklists.

read insight 01 →
02

Single-template auth-off failures propagate at population scale

Pattern detection on response uniformity is a powerful "single root-cause / many victims" classifier.

read insight 02 →
03

Capabilities-object tool-schema leak

Auth-gated invocation surfaces still leak structural information at the unauthenticated handshake layer.

read insight 03 →
04

WHOIS-driven contact resolution is non-negotiable

ARIN/RIPE/APNIC OrgName + OrgAbuseEmail from IP-WHOIS is the authoritative input for any disclosure recipient derivation. Filename-friendly identifiers are not institution-domain mappings.

read insight 04 →
05

Same-day-remediation feedback loop

Structured disclosures with embedded one-line fixes have an order-of-magnitude faster remediation rate than vague advisories.

read insight 05 →
06

Single-word substring matching is unsound at population scale

A platform fingerprint must require, at minimum: (a) a specific endpoint that the platform alone serves, (b) structured response (JSON parse + named field, or specific HTML title format), (c) anchored keyword match conjoined with (a) and (b).

read insight 06 →
07

Shodan-facet bucketing inherits the substring-FP class

Shodan's http.html: and product: matches are themselves substring-style filters at the indexer level. Apply Insight #6's conjunctive-matcher rule at the seed layer, not just the probe layer.

read insight 07 →
08

Auth-bypass-via-misconfiguration is missed by entry-point-only fingerprints

For application-tier surveys (RAG framework, LLM orchestration, BI dashboards, anything with a documented public-role config), entry-point fingerprints are insufficient. The probe must follow redirects and check for authenticated-state-only tokens on the post-redirect target.

read insight 08 →
09

Cross-survey-correlation is a Shodan-free discovery vector with stacked-finding bias

The existing nuclide.db ledger of confirmed exposures is itself a discovery substrate. Every IP NuClide has previously confirmed running an unauth Tier-A platform is a candidate for additional unauth platforms on adjacent ports. Cross-survey-correlation probes must always sweep alt-ports, not…

read insight 09 →
10

Research/lab-instrument vendors ship web stacks with auth-disabled defaults

Population-scale exposure is the default-config decision of the vendor, not a misconfiguration by the operator. Vendor-template means population-scale exposure.

read insight 10 →
11

Source code is authoritative; bug reports are framing

When a bug report claims that a vendor wrote X to a config, verify against the vendor's source repository and current release tag before accepting the framing. Config mutators that preserve keys they don't manage are a misattribution attack surface; the right verification path is grep on the…

read insight 11 →
12

Hostname-routed SSO doesn't protect the IP-direct shadow

When an operator deploys SSO at the application layer (authentik, OAuth proxy, Keycloak, oauth2-proxy, Traefik forward-auth, etc.) and binds it via the reverse proxy's hostname routing, every service that listens on the underlying host's IP, at any port, answers requests by IP and bypasses the…

read insight 12 →
13

Shipping defaults are load-bearing for population-scale security posture

When two products in the same category have similar customer overlap but ship with opposite security defaults, the population-scale security outcomes follow the defaults. Not the operators. A single env-var default (AUTH_ENABLE=False vs no toggle at all) can produce population-scale…

read insight 13 →
14

Recon yield aligns with port-class operator intent, not port number

When sweeping IP-direct-shadow ports for hidden surfaces on hosts already fronted by an SSO reverse proxy (see Insight #12), the productive selector is what class of service the operator was deploying, not the port's formal IANA assignment, popularity rank, or even whether the port number is…

read insight 14 →
15

Shodan dork hits are not platform instances (the 50% rule)

The number of hits returned by a Shodan dork is not the number of platform instances. Across the AI/LLM infrastructure surveys in 2026-04 and 2026-05, the population of hits that match a single-token title-based dork contains roughly half false positives, services that are not the target…

read insight 15 →
16

A 200 from a platform endpoint is identity, not auth state

When a platform endpoint returns HTTP 200 to an unauthenticated probe, that response confirms platform identity, the platform is alive at the URL, accepts requests, and chose to answer, but it does NOT classify the auth posture. The fingerprint must observe the actual data layer behind the…

read insight 16 →
17

Platform-class operators are mono-platform at population scale

When two platforms solve the same problem (e.g. LLM observability, vector storage, prompt management), operators install one of them per host. Across 789 hosts spanning four AI-observability platforms (Phoenix + Langfuse + Helicone + LangSmith), there are zero genuine IP-level overlaps. The…

read insight 17 →
18

Storage-tier hygiene exceeds tracker-tier hygiene at population scale

Across 49 cloud-provider buckets extracted from the artifact URIs of 120 critically-exposed unauthenticated MLflow trackers, 48 buckets (97.96%) are locked at the storage tier. One container has an anonymous-list ACL, and it was empty at probe time.

read insight 18 →
19

SPA + headless API is a high-severity exposure tell

When a single-page application is hosted on a CDN platform (Vercel, Cloudflare Pages, Netlify, GitHub Pages, etc.) and its bundled JavaScript calls a same-brand API host of the form https://api.<brand>.<tld>/..., the API host is almost always on infrastructure the operator manages directly, and…

read insight 19 →
20

aimap's AI-service classifier needs the ML data tier, not just the inference tier

aimap classifies a target by what AI/ML services it can fingerprint on that target's open ports. The catalog has been built incrementally around the inference and observability tiers: Ollama, vLLM, llama.cpp, MLflow, Phoenix, Langfuse, LangSmith, Helicone, Open WebUI, ChromaDB, Qdrant, Milvus,…

read insight 20 →
21

Port-first discovery beats brand-dork discovery for low-footprint platforms

The standard population survey is dork-then-confirm: write a Shodan dork that matches the platform's brand string, harvest the hits, confirm each one. That works when the platform's web frontend carries Shodan-indexable distinctive text.

read insight 21 →
22

Protocol-strict handshakes are the only verifier for multi-protocol honeypot fleets

Insight #1 established that protocol-strict handshakes filter honeypots: an exact JSON-RPC initialize envelope dropped AS63949 Linode honeypot pollution from 91.6% to 1.1% in the MCP survey. The medical/edge AI survey extends this, and surfaces the second-order pattern: modern honeypot fleets…

read insight 22 →
23

Discovery-channel coverage is multiplicative

A population survey can be sourced two ways: masscan-on-cloud-prefixes (scope a set of cloud /16 ranges, scan a port across all of them) or Shodan-walk (page through the Shodan-indexed result set for a brand dork or service-product facet). Each method has a coverage profile, and those profiles…

read insight 23 →
24

Operator workload visibility via Ollama /api/show Modelfile SYSTEM prompts

When Ollama is unauthenticated, the /api/tags endpoint discloses what models the operator installed. That is the canonical finding.

read insight 24 →
25

Tier-C platforms produce ~0% unauth at population scale

The auth-on-default thesis is falsifiable: a Tier-C platform (auth-on-default in framework) that landed at 5–25% unauth at population scale would break it. None have. The cumulative evidence base across the 2026-05 survey series:

read insight 25 →
26

Shodan-facet FP rate escalates with token commonality

Codified by Insight #15 (http.title:"LiteLLM API" → 5,391 hits, 2,710 real LiteLLM = 50% FP). Sharpened by the 2026-05-15 RVC voice-cloning survey (http.title:"RVC" → ~34 hits, ~6 real = ~82% FP). Now further sharpened by the 2026-05-16 ComfyUI survey:

read insight 26 →
27

Docker-image-template dominance

Three independent surveys on 2026-05-16 surfaced the same shape:

read insight 27 →
28

A population state is not a daily rate (RETRACTED)

The first version of this insight claimed 71.6% of the 5,037-host population was wiped by an automated extortion campaign in a 24-hour window. That framing is wrong as a 24-hour event rate. The corrected numbers come from re-probing the same host list 24 hours later.

read insight 28 →
29

Snapshot vs delta

A single observation of a population says one thing. Two observations say another. When a campaign has been running long enough to saturate the population, the snapshot reports history. Only the delta reports today.

read insight 29 →
30

Multi-port identical responses identify honeypot fleets

A real service occupies one port. A honeypot fleet that ships the same canned response on every port it has open is identifiable by that uniformity alone, with no need to decode any specific protocol.

read insight 30 →
31

App-builder tools brand the OUTPUT, not the AGENT — anchor on agent API contract

_Source: code-assistants survey verification, 2026-05-18. Extends Insight #6 (conjunctive marker-anchored matchers) and Insight #15 (~50% real-rate on single-token dorks)._

read insight 31 →
32

Multi-service deception fleets emulate target-specific services for Shodan scanners; filter on body markers, not title

_Source: Jetson/TensorRT-edge population survey, 2026-05-18. Two distinct deception fleets surfaced in one survey: 22 hosts emulating Triton, 576 hosts emulating Shinobi. Distinct from the AS63949 Linode honeypot fleet documented in Insight #1's source case._

read insight 32 →
33

Side-channel attribution via Docker registry catalog content when direct fingerprinting fails

When the direct fingerprint for a target class (Shodan dork on title, body, port, banner) returns mostly false positives at population scale, look for an adjacent service the operator runs whose content reveals what the direct probe could not. Docker Registry V2 is the canonical such service:…

read insight 33 →
34

Persistence without pressure — operator-unauth populations don't self-remediate

_Source: code-assistants population follow-up survey, 2026-05-18. Cross-referenced against Insight #28 (extortion-driven decay)._

read insight 34 →
35

Side-channel attribution has high precision and low recall; it is for targeted investigation, not population discovery

Insight #33 establishes that operator-class attribution via adjacent-service content (Docker Registry /v2/_catalog) works when the operator's content carries class signals. The yield is high when the population is already selected for the class, and very low when the population is not.

read insight 35 →
36

PaaS deployment automation bakes build-time env-vars into client JS bundles; secrets prefixed with NEXT_PUBLIC_ / VITE_ leak to every visitor

When an operator deploys a Next.js or Vite app via a self-hosted PaaS (Dokploy, Coolify, Caprover, Easypanel) and declares a secret like LANGFUSE_SECRET_KEY with one of:

read insight 36 →
37

Asymmetric auth gating, dashboard requires login but the API does not; observability platforms accept unauthenticated trace ingestion even when the UI is locked

Many AI observability + telemetry platforms ship with two distinct authentication surfaces on the same port:

read insight 37 →
38

Hard-proof verification chain for exfiltrated-credential class findings; six steps from HTML-exposed key to verified operator data

A finding involving a credential exposed in public HTML cannot be tiered without traversing the six-step verification chain. Each step verifies a discrete claim. Tier promotion happens at each step; the finding's final tier is determined by the deepest step verified.

read insight 38 →
39

Pooled-account upstream proxy as attribution-laundering layer; one paid API account fans out to N unauthorized end-customers through a middle-tier relay

A subset of LLM-resale fraud operations route through a three-tier architecture that flattens attribution from the upstream vendor's perspective:

read insight 39 →
40

Auth-on-default thesis shifts rightward in successor OSS generations

Codified: 2026-05-19 (sub2api population survey) Family: Insight #25 (auth-on-default thesis), Insight #36 (PaaS build-arg secret baking), Insight #39 (pooled-account attribution laundering) Falsifiability tier: medium — pattern needs at least one more successor-generation pair to confirm or break

read insight 40 →
41

Admin-endpoint field-name enumeration is the Stage-2-deep verify primitive; secret-class field names at documented paths are the finding, no value read required

For admin-style endpoints that return a long structured JSON dump (Envoy /config_dump, Spring Actuator /env and /configprops, Kong admin /config, Consul /v1/agent/self, Vault /sys/config/state/sanitized, Traefik /api/rawdata, NATS /varz), the Stage-2-deep verify primitive is enumeration of…

read insight 41 →
42

LLM gateway model-name mismatch: proxies advertise premium model IDs while serving different backends. /v1/model/info is the authoritative discriminator; the motive (convenience alias vs fraud) requires per-host verification.

Initial framing of "fraud" was incorrect. The operator is Jo Lab (jolab.ai, jolab.app), an academic biomedical AI research lab marketing "AI for Disease Prediction & Early Diagnosis." swatweb.org is their SWAT-web Sliding Window Association Test bioinformatics tool. No customer-facing "Claude…

read insight 42 →
43

VisorSD multi-ASN grouped-OR query construction returns zero even when Shodan direct returns hundreds; the bug is in VisorSD's query templating, not Shodan's index.

VisorSD's multi-ASN grouped-OR query construction can silently return zero where Shodan direct queries return hundreds. A zero-result VisorSD run against a known-populated ASN is a tooling failure signal, not a population signal. Always cross-validate a zero VisorSD result with a direct Shodan…

read insight 43 →
44

Parallel aimap passes cannibalize each other's throughput via client-side socket pool contention; default to sequential or staged execution with the largest corpus running alone first.

Running multiple aimap processes in parallel against large corpora degrades total throughput by roughly 3× compared to sequential execution, and can cause complete hangs (zero output after 36+ minutes). The bottleneck is client-side socket pool exhaustion: N concurrent aimap binaries each…

read insight 44 →
45

Niche Shodan dork yield follows a stable class hierarchy: Server-header > frontend-bundle-ID body > route-slug body. Route-slug dorks fail because Shodan crawls root HTML, not JS bundle source.

Shodan dork yield for AI/LLM infrastructure follows a stable three-tier class hierarchy:

read insight 45 →
46

TLS certificate subject CN is a precise operator-attribution surface; operators who embed platform brand names in cert CN are doing intentional TLS termination, making cert-CN dorks stable against CDN proxying and more precise than HTML body matching.

An operator who names a TLS certificate after the AI platform they're running (openai.mycompany.com, litellm-prod, ollama-inference) has:

read insight 46 →
47

TLS cert subject CN is an operator-attribution surface, NOT a platform-confirmation or auth-state surface. CN-identified operators are the intentionally-configured class; they are inversely correlated with auth-off-default posture.

Two populations. Inverse correlation with auth posture.

read insight 47 →
49

Ollama-Cloud-signin × public-exposure = LLMjacking surface; the operator's Ollama Cloud subscription quota is billable by any public caller

An Ollama instance meeting BOTH of these conditions exposes the signed-in operator's Ollama Cloud subscription quota to public invocation:

read insight 49 →
50

OVMS Backend Co-location: FastAPI Wrapper + OpenVINO Model Server Both Exposed

Custom FastAPI embedding services often sit in front of an Intel OpenVINO Model Server (OVMS) backend on a co-located port. When the FastAPI wrapper is exposed without auth, the OVMS backend is also exposed without auth — and on a different port than the wrapper.

read insight 50 →
51

A port number names a candidate, not a finding

Codified: 2026-05-21 (global university LLM-exposure map, service-verification pass) Family: Insight #25 (auth-on-default thesis), Insight #16 (no status code is identity). This is the population-scale measurement of the METHODOLOGY's load-bearing claim that verification, not scanning, produces…

read insight 51 →
52

An HTTP 200 at an API path is not that API

Codified: 2026-05-21 (global university LLM-exposure map, per-host arsenal triage) Family: Insight #16 (no status code is identity), Insight #51 (a port number names a candidate). This is the layer-7 analogue of #51: where #51 is a TCP connect mistaken for a service, #52 is an HTTP 200 mistaken…

read insight 52 →
53

A hostname label is not a cloud project identifier

Codified: 2026-05-21 (global university LLM-exposure map, per-host arsenal triage, Firebase candidate verification) Family: Insight #51 (a port number names a candidate), Insight #52 (an HTTP 200 is not that API), Insight #16 (no status code is identity). This one is the attribution-stage…

read insight 53 →
54

Metabase setup-token: a self-authorizing credential class

Codified: 2026-05-21 (embedding-tier2-2026-05-21 session — masscan sweep of OVH/Scaleway tier-2 cloud ranges) Family: Insight #39 (install-wizard-open / pooled-account attribution laundering), Insight #16 (no status code is identity), Insight #25 (auth-on-default thesis) Falsifiability tier:…

read insight 54 →
55

Auth-gated API + Open Signup = Uncontrolled Account Creation

Date: 2026-05-22 Survey anchor: Agenta LLMOps (14-host population) Finding class: First-party authentication configuration

read insight 55 →
56

LangGraph self-identifying JSON root as primary fingerprint

Date codified: 2026-05-25 Survey anchor: LangGraph Server population survey File: case-studies/commercial/langgraph-server-survey-2026-05-25.md

read insight 56 →
57

Partial-auth failure: auth on collection endpoints, none on individual resource endpoints

Date codified: 2026-05-25 Survey anchor: Survey-38 LangGraph — Stock.ai / EMOR AI (20.193.252.230) File: case-studies/commercial/stock-ai-emor-partial-auth-2026-05-25.md

read insight 57 →
58

Vite dev server left running in production exposes full TypeScript source

Date codified: 2026-05-25 Survey anchor: Survey-38 LangGraph — Assistent Tècnic Intel·ligent / Docu Companion (157.180.21.126) File: case-studies/commercial/docu-companion-vite-dev-server-2026-05-25.md

read insight 58 →
59

Date: 2026-05-25 Survey anchor: n8n discovery, 38.102.86.8

read insight 59 →
60

Redis Stack FT._LIST as Vector-Tier Enumeration Primitive

Date: 2026-05-25 Survey anchor: Redis Stack / RedisInsight population survey (2026-05-25)

read insight 60 →
61

RedisInsight /api/databases Returns Redis Passwords in Plaintext

Date: 2026-05-26 Survey anchor: Redis Stack / RedisInsight population survey (2026-05-25)

read insight 61 →
62

Survey anchor: Cat-09 code assistants, 2026-05-26 Codified: 2026-05-26 Status: Confirmed, population-verified

read insight 62 →
63

Date: 2026-05-26 Survey anchor: Cat-04 stragglers — Prefect, Dask, ClearML, BentoML Status: Confirmed

read insight 63 →
64

Date: 2026-05-26 Survey anchor: Cat-06 stragglers — Agno (AIRIAD Risk Advisor, Collision Analysis AgentOS) Status: Confirmed

read insight 64 →
65

Date: 2026-05-27 Survey: Argo Workflows (Category 29 — K8s Workflow Orchestration) Anchoring data: 67 confirmed instances (ssl:"ArgoProj" population, 0 auth-bypass); 200 additional instances (ssl:"Argo Workflows" population, auth status pending)

read insight 65 →
66

Date: 2026-05-27 Survey: Argo Workflows (Category 29), aimap v1.9.35 fix Anchoring data: 156 hosts, 111 on port 443, 0 on port 2746

read insight 66 →
67

Insight #67: Voice/audio AI API servers are Shodan-dark behind JSON-only roots; only the demo UI indexes

For the entire voice/audio AI category, the highest-severity surfaces are the ones Shodan cannot see. The OpenAI-compatible TTS/ASR API servers (GPT-SoVITS, Orpheus, Kokoro's API path, Deepgram on-prem, WhisperLive) return a JSON-only root or a non-root JSON status endpoint that the Shodan…

read insight 67 →
68

Insight #68: The verification-rung grid. Label every claim by a depth-and-breadth pair, and never use language above the rung its evidence reached

Every finding carries a verification status expressed as a pair: an inner rung (depth, code vs live) and an outer rung (breadth, host vs population). The two axes are logically orthogonal, so they must not be collapsed into one ladder. The claim language is bound to the pair. State the pair in…

read insight 68 →
69

Insight #69: A curated-port scan's negative is not a host-level negative; run a full-range population (Censys) as a standing complement

When aimap (our AI-intent-curated port scanner) reports "no AI/ML service," that is a true statement about the ports and fingerprints it checked, not a statement about the host. The two are easy to conflate, and conflating them ships a confident, wrong "clean host" conclusion.

read insight 69 →
70

Insight #70: Censys is a dual primitive — full-range ports give identity, protocol decoders give auth-state; never conflate the label with the decoder

A Censys cross-reference returns two separable things, and treating them as one ships a wrong number. The first is identity: the full-range port sweep shows which services a host actually runs, including the data tier and second apps a curated AI-port scan never touches. The second is…

read insight 70 →
71

The auth-on-default thesis has, until now, measured platforms that have an authentication layer and ship it on or off by default (Phoenix ENABLE_AUTH=False vs Langfuse no-toggle). Service-mesh introspection planes are a different and worse class: they have no authentication layer at all. Their…

read insight 71 →
72

There is a failure class between "auth off by default" (#13) and "no auth layer at all" (#71): a platform that ships real authentication and a real authorization layer, both on by default, and then ships a self-registration knob that defaults open. The data endpoints are correctly gated, the…

read insight 72 →
73

A fingerprinter that does not send the platform's content-negotiation header will get zero results from a platform that uses header-based API versioning, even when the platform is present, exposed, and unauthenticated at the identity endpoint. The absence is a tool artifact, not a population fact.

read insight 73 →
74

An exposed AI gateway is categorically different from an exposed model server. A single unauth Ollama instance leaks one operator's inference surface. A single unauth AI gateway yields every upstream LLM provider API key the operator has wired in, across every provider (OpenAI, Anthropic,…

read insight 74 →
75

Cert-pivot (VisorGraph / crt.sh) only works on HTTPS endpoints -- there is no TLS handshake to intercept and no certificate to extract from a plaintext HTTP port. AI gateway admin APIs run HTTP-only by design:

read insight 75 →
76

Survey: Cat-31 Data Labeling (Extended), 2026-06-01.

read insight 76 →
77

Survey: MCP server population survey, 2026-06-02.

read insight 77 →
78

Survey: Single-host ad-hoc assessments, xTom Japan (AS3258), 2026-06-05/06.

read insight 78 →
79

Survey: Cat-OW (Open WebUI population survey) calibration pass, 2026-06-06.

read insight 79 →
80

Status: confirmed at n=31 known-stage subset; validation at n>=100 blocked on funding-stage data, not method.

read insight 80 →
81

Codified: 2026-06-07. Lane 1B of the 9-item plan. Source survey: data/platform-intel/mta-fingerprint-catalog-2026-06-07.md (6 MTAs probed; 4 live in docker, 2 source-characterized). Family: reference-haraka-docker-compose-leak (parent observation), Insight #78 (shared deployment kit operator…

read insight 81 →
82

Codified: 2026-06-07. Cat-33 Phase 3B Lane B survey. Promoted to HIGH: 2026-06-07 (later same day). Cat-33 Phase 5 Lane D Slice B extension. 6/6 strict confirmations across two independent surveys. Source: data/platform-intel/cat33-lane-b-vendors-2026-06-07.md (Lane B, 3 vendors) +…

read insight 82 →
83

Codified: 2026-06-07. Cat-33 Phase 3B Lane C survey. Source: data/platform-intel/cat33-lane-c-vendors-2026-06-07.md (3 vendors). Family: Insight #75 (HTTP admin ports kill cert-pivot), Insight #65 (TLS cert dork selection bias), Insight #71 (network placement as auth). Falsifiability tier:…

read insight 83 →
84

Codified: 2026-06-07. Cat-33 Phase 5 Lane D Slice D survey over LiteLLM cloud-native guardrail hooks. Source: data/platform-intel/cat33-lane-d-slice-d-cloud-deltas-2026-06-07.md (3 cloud-native + 1 OSS framework). Family: Insight #74 (gateway-as-master-key-multiplier), Insight #78…

read insight 84 →
85

Codified: 2026-06-07. Cat-33 Phase 5 Lane D Slice C survey. Source: data/platform-intel/cat33-lane-d-slice-c-specialized-2026-06-07.md (10 vendors probed, 2 confirmed stubs). Family: Insight #17 (platform-class-operators-are-mono-platform), Insight #51…

read insight 85 →

37 categories · nine-layer topology · one public IPv4 internet compiled live from the AI-LLM-Infrastructure-OSINT corpus