Most recent
navigate open esc close Corpus index built 2026-06-07 23:58 UTC

← All engagement records

Case study May 19, 2026

.edu LLM infrastructure dork-map — 1,584 verified-dork × hostname:.edu sweep (2026-05-19)

Sector
Universities
Country
edu

NuClide Research, 2026-05-19. Sub-survey of Session 24.

TL;DR

The repo’s 1,629-dork verified Shodan catalog (29 categories, hand-curated and FP-tested across 50+ prior commercial surveys) was scoped to hostname:.edu and run through shodan count (free per query, no scan credit). After dropping 45 dorks that already had a hostname: filter, 1,584 scoped queries ran in 48 minutes with a 1.2s rate-limit. 382 dorks (24%) returned ≥1 hit, 1,143 returned 0, 59 errored (~3.7% rate-limit blip). The data-mapping output establishes which platform classes have material .edu exposure surface, what populations to expect at Stage 1 verify, and which dork classes are productive vs noise on the academic surface.

The session-5 university survey (2026-05-03) found 252 JupyterHub instances on .edu via a single dork. This Stage 0 catches the same population plus 36 other productive JupyterHub-related dork variants and the entire LLM tier (Ollama, Open WebUI, LiteLLM, n8n, Dify, Streamlit, etc.) that session 5 missed because it stopped after Ollama/Open WebUI/JupyterHub.

Methodology

Source dorks: parsed every backtick-wrapped query line from the 29 shodan/queries/*.md files in this repo. Each is a hand-verified dork from a prior commercial survey, with hit-count notes and FP-trap documentation already in place.

Scope filter: appended hostname:.edu to each dork. Dorks already containing hostname: (45 of 1,629) were dropped to avoid filter conflicts.

Execution: shodan count per scoped dork, 1.2s sleep between queries, 50-min hard deadline. Output captured as count<TAB>category<TAB>scoped_dork.

Rate-limit empirical data: an earlier per-university burst sweep at 0s delay produced 62% ERRs (~165/264). At 1.2s delay this sweep produced 3.7% ERRs (~59/1,584). Sustained Shodan freelance-tier rate is ~50 queries/min per API key — empirical, useful for sizing future sweeps.

No subagent parallelization for the count phase: Shodan’s per-key throttle is the bottleneck, not local concurrency.

Headline numbers

MetricValue
Total scoped queries fired1,584
Non-zero hit count382 (24%)
Zero hit count1,143
ERR count (rate-limit)59 (3.7%)
Categories with ≥1 productive dork22 of 29
Categories with 0 productive dorks7 of 29
Sweep wall-time48 min
Shodan scan credits consumed0 (count queries are free)

Productive dorks per category

Sorted by productivity rate (productive ÷ total):

CategoryProductiveTotalProductivity
18-jupyter376260%
02-vector-databases7416046%
16-bi-dashboard6915644%
04-training-experiments286742%
01-llm-orchestration184937%
12-containers41331%
03-model-serving165728%
19-streamlit114425%
25-elasticsearch177124%
21-browser-agents146223%
22-data-labeling167123%
10-mcp-servers31421%
05-gateways-monitoring84020%
27-embedding-services115919%
24-observability106116%
26-mem0-agent-memory53315%
09-code-assistants43511%
24-llm-safety-guardrail-policy1211510%
23-ai-safety-eval4439%
17-voice-audio-ai81237%
06-agent-frameworks5796%
20-gradio3536%
26-exfiltrated-credentials-in-html3496%
15-fingerprinting22110%
07-rag-stacks0100%
08-image-generation0110%
11-credential-leaks0140%
13-backup-snapshot070%
14-gpu-compute050%

Observation: .edu is dominated by notebooks (Jupyter), data substrate (PostgreSQL/MongoDB/Redis/Elasticsearch/OpenSearch), observability (Grafana), and the core LLM frontend tier (Ollama/Open WebUI/n8n/LiteLLM/Streamlit). Universities do NOT (publicly) run RAG framework servers, image-generation servers, GPU-compute exporters, or backup/snapshot services at meaningful population on .edu hostnames — those categories had 0 productive dorks. Most of those surface on commercial cloud (AWS/Azure) infra instead.

Top LLM-tier dorks (excluding data-substrate, observability, and noise)

Filtered to platforms that are unambiguously LLM-tier infra:

HitsCategoryDork
80004-traininghttp.html:"jupyter" hostname:.edu
53918-jupyterhttp.title:"Jupyter" hostname:.edu
52118-jupyterhttp.title:"Jupyter" -port:443 hostname:.edu
51318-jupyterhttp.html:"jupyter" http.html:"token" hostname:.edu
51018-jupyterhttp.title:"Jupyter Server" hostname:.edu
49718-jupyterhttp.title:"Jupyter" country:US hostname:.edu
46618-jupyterhttp.title:"Jupyter" org:"university" hostname:.edu
29718-jupyterhttp.html:"jupyterhub" hostname:.edu
28404-training"Jupyter" hostname:.edu
27518-jupyter"JupyterHub" hostname:.edu
23318-jupyterhttp.title:"JupyterHub" hostname:.edu
17918-jupyterhttp.title:"JupyterHub" port:443 hostname:.edu
17118-jupyterhttp.html:"/hub/login" hostname:.edu
16719-streamlitport:8501 hostname:.edu
16118-jupyterhttp.title:"JupyterHub" country:US hostname:.edu
14619-streamlitport:8501 country:US hostname:.edu
13310-mcp-servershttp.title:"Open WebUI" hostname:.edu
13301-llm-orchestrationhttp.title:"Open WebUI" hostname:.edu
13301-llm-orchestrationhttp.html:"open-webui" hostname:.edu
9510-mcp-servershttp.html:"open-webui" "uvicorn" hostname:.edu
9004-traininghttp.html:"streamlit" hostname:.edu
9001-llm-orchestrationproduct:"n8n" hostname:.edu
8701-llm-orchestrationhttp.html:"Ollama is running" -port:443 hostname:.edu
8301-llm-orchestrationproduct:Ollama port:11434 hostname:.edu
5002-vector-databases"MinIO Console" port:9001 hostname:.edu
4402-vector-databasesproduct:"Docker Registry" hostname:.edu
4102-vector-databases"Docker Registry" hostname:.edu
3501-llm-orchestrationhttp.title:"LiteLLM" hostname:.edu
3302-vector-databases"MongoDB" port:27017 "vector" hostname:.edu
2902-vector-databaseshttp.title:"Harbor" hostname:.edu
1601-llm-orchestrationhttp.html:"dify" hostname:.edu
1324-observabilityhttp.title:"Phoenix" hostname:.edu
1101-llm-orchestrationhttp.html:"Chainlit" hostname:.edu
1101-llm-orchestrationhttp.title:"Open WebUI" port:8080 hostname:.edu
501-llm-orchestrationtitle:"Flowise" port:443 hostname:.edu
301-llm-orchestration"Jan" port:1337 hostname:.edu

Noise observations (Insight-class)

  1. org:"Airtable, Inc" port:443 hostname:.edu returned 46,444 — the highest hit count of any single dork in the sweep. This is NOT 46K Airtable-hosted university apps; it’s Shodan facet-combinatorial: customer .edu domains pointing at Airtable’s IPs, where the customer’s .edu ends up in Shodan’s hostname index but the actual host is Airtable. Discarded from the LLM-tier table. Class: SaaS-customer-CNAME noise; mirrors the org:"Cloudflare" problem in commercial surveys.

  2. port:4444 hostname:.edu returned 1,672 — Selenium Grid default port, but .edu campus networks routinely run port 4444 for many non-Selenium services (krb524 in particular). The cat-21 browser-agents subset of this is a real signal at much smaller population; will need conjunctive verify to extract.

  3. http.html:".co" hostname:.edu returned 1,104 under cat-24-llm-safety-guardrail-policy — substring match on the .co TLD as a body string. Generic FP class, drop.

  4. PostgreSQL/MongoDB/Redis/OpenSearch returning 1007/170/118/476 — universities run lots of data-tier infra publicly. Not LLM-specific by themselves, but pre-Insight: when colocated with a confirmed Ollama/Open WebUI/Jupyter host, these are the Pharos-class shadow-port findings to look for (per Insight #11/#12).

Dork-class hierarchy on .edu

Confirming this session’s earlier Insight #45 at the academic scope:

ClassExample.edu population
Frontend-bundle-ID bodyhttp.html:"jupyter"800
Bundle-ID body althttp.html:"jupyterhub"297
Server-header bannerhttp.html:"Ollama is running"87
Product facetproduct:Ollama port:1143483
Titlehttp.title:"Jupyter"539
Default-port facetport:8501 (Streamlit)167
JSON-config bodyhttp.html:"Chainlit"11
Route-slug bodyhttp.html:"/hub/login"171

Title and bundle-ID body are the highest-yield on .edu. Server-header banners under-represent because the .edu Ollama population is dominated by older versions that pre-date the Server: ollama header (added v0.5+).

Carry-forward — Stage 1 (next)

This case study captures Stage 0 only — the dork-mapping output. Stages 1–4 are pending:

Stage 1: Per high-yield dork (≥3 hits), shodan download --limit N to pull sample IPs. ~100 dorks × avg 50 hosts = ~5K sample tuples to verify.

Stage 2: Inline-probe verify per platform-class (proven 21s/1000-host asyncio pattern from this session’s earlier .edu work). Drop substring-FP candidates per Insight #15.

Stage 3: Hostname → institution attribution via the local world_universities_and_domains.json (2,349 US institutions with .edu domains). Suffix-match gpu.cs.example.eduExample University.

Stage 4: Diff confirmed hosts against the 81 existing case studies (49 cross-validated institutions per known-from-overview.tsv). Surface NEW institutions only. Per-institution case study files under US/ and international/.

Optional Stage 5: Disclosure routing decisions (Cowboy’s call, per feedback_no_disclosure_recommendations).

Toolchain provenance

Stage 0 — index ~/AI-LLM-Infrastructure-OSINT/shodan/queries/*.md → 1,629 verified dorks
         → drop dorks with existing hostname: filter → 1,584 scoped dorks
         → shodan count per dork, 1.2s rate-limit, 50-min deadline
         → scoped-counts.tsv (382 non-zero rows of 1,584)

Artifacts

All raw data lives at ~/recon/edu-llm-infra-2026-05-19/:

  • verified-dorks-master.tsv — the 1,629 source dorks indexed from the repo’s shodan/queries/
  • scoped-dorks-edu.tsv — the 1,584 scoped dorks fed to the sweep
  • scoped-counts.tsv — the canonical Stage-0 output (1,584 rows of count<TAB>category<TAB>scoped_dork)
  • scoped-counts.log — sweep progress + done signal
  • us-universities.tsv — local Hipo world_universities_and_domains.json filtered to 2,349 US institutions (for Stage 3 hostname → institution attribution)
  • known-from-overview.tsv — 181-row baseline of institutions already documented in OVERVIEW.md + index.md + SESSION.md
  • PLAN.md — live stage tracker for the multi-stage sub-survey

Reference