3. Model Serving & Inference, NuClide Reference

Section verified: April 22, 2026 11:38

The runtime layer that exposes models over HTTP. Most modern serving stacks emulate the OpenAI API surface (/v1/chat/completions, /v1/models, /v1/embeddings), making them trivially fingerprintable, and often trivially abusable as free compute (LLMjacking).

Inference Servers

Shodan Query	Notes
`"SillyTavern"`	9,030 hits, LLM frontend (chat/RP workloads); most-exposed model-serving surface in this catalogue
`http.html:"/v1/chat/completions"`	6,238 hits, canonical OpenAI-compat fingerprint (path appears in HTML responses; covers vLLM, llama.cpp, LMDeploy, LM Studio, most inference servers)
`http.html:"/v1/models"`	6,089 hits, same category, paired canonical OpenAI-compat path
`"llama.cpp"`	1,483 hits, most-deployed specific server (bare banner match)
`"llama.cpp" port:8080`	461 hits, llama.cpp on default port
`"vLLM"`	356 hits, banner match
`"/v1/chat/completions"`	225 hits, banner-level path leak
`"/v1/models"`	216 hits, banner-level path leak
`http.html:"vllm"`	188 hits, vLLM in HTML body
`http.html:"llama.cpp"`	81 hits, llama.cpp in HTML
`"sglang"`	60 hits, SGLang structured-generation server, OpenAI-compat
`"koboldcpp"`	59 hits, KoboldCpp banner (all ports)
`"koboldcpp" port:5001`	54 hits, KoboldCpp default port
`("koboldcpp" OR "koboldai")`	47 hits, grouped OR catches AI Horde workers (unparenthesized breaks)
`"koboldai"`	46 hits, KoboldAI web UI
`"Seldon"`	33 hits, Seldon Core model deployment (k8s)
`http.html:"lm studio"`	173 hits, LM Studio in HTML body (best LM Studio fingerprint)
`port:1234 "model"`	71 hits, LM Studio default port with model term
`http.html:"sglang"`	74 hits, SGLang in HTML body
`http.html:"seldon"`	20 hits, Seldon in HTML body
`http.title:"Seldon"`	10 hits, Seldon title match
`"LM Studio"`	12 hits, LM Studio banner match
`http.html:"nvidia nim"`	12 hits, NVIDIA NIM microservices in HTML
`http.html:"/v1/embeddings"`	90 hits, embeddings path in HTML
`"/v1/embeddings"`	8 hits, embeddings path banner leak
`"Triton Inference Server"`	7 hits, NVIDIA Triton (use exact phrase; bare “Triton” returns 1,809 polluted)
`"lmdeploy"`	5 hits, InternLM LMDeploy engine
`"vLLM" port:8000`	2 hits, vLLM on default port
`port:8000 "openai" "model"`	1 hit, OpenAI-compat on 8000 with model list
`http.html:"GPT4All"`	2 hits, GPT4All in HTML body
`"llama" "completion"`	2 hits, llama completion term combo (no port)
`"GPT4All"`	1 hit, GPT4All desktop server banner
`"NVIDIA NIM"`	1 hit, NIM exact phrase (2026 deployments still sparse on public internet)
`"aphrodite"`	362 hits, ⚠️ bare term is Greek-mythology noise, not the vLLM fork
`"Aphrodite Engine"`	1 hit, exact phrase (specific-but-rare)

Canonical query recommendation: For OpenAI-compatible inference discovery, http.html:"/v1/chat/completions" (6,238) is the best single query, it catches every server emulating the OpenAI API surface regardless of underlying implementation. Pair with http.html:"/v1/models" for a narrow overlap.

GET /v1/models HTTP/1.1  {"object":"list","data":[{"id":"meta-llama/Llama-3.1-70B-Instruct",
  "object":"model","created":1713300000,"owned_by":"vllm"}]}

Typical vLLM response on an exposed port. Model name reveals what’s hosted; /v1/chat/completions on the same port is reachable with no auth.

Embedding / Reranking

Shodan Query	Notes
`http.html:"TEI"`	332 hits, TEI in HTML (best TEI fingerprint)
`http.html:"jina"`	83 hits, Jina in HTML
`http.html:"sentence-transformers"`	31 hits, sentence-transformers in HTML
`"text-embeddings-inference"`	5 hits, TEI banner match
`"sentence-transformers"`	5 hits, banner match
`"Jina" "embeddings"`	4 hits, Jina banner + term
`"Jina" "embeddings" port:8080`	2 hits, Jina on default port
`"infinity" "embedding"`	1 hit, Infinity embedding server
`"Cohere"`	1,603 hits, bare brand name, contaminated by SaaS references / unrelated docs

Pollution flagged: http.html:"infinity" alone returns 7,794 hits, but "infinity" "embedding" collapses to 1 hit. “Infinity” is a generic English word polluting 7,793 unrelated hits. Do not use http.html:"infinity" as a fingerprint.

Cohere note: Cohere’s rerank is primarily a cloud API product; self-hosted rerank instances are rare on the public internet. Bare "Cohere" at 1,603 is brand-name noise, not server fingerprints, narrow before trusting.

Audio / Speech / Vision Inference

Shodan Query	Notes
`http.title:"Whisper"`	444 hits, Whisper ASR web UI
`http.html:"faster-whisper"`	190 hits, faster-whisper in HTML
`http.title:"Bark"`	120 hits, Suno Bark text-to-audio
`http.title:"Piper"`	103 hits, Piper TTS title
`http.html:"coqui"`	55 hits, Coqui in HTML
`"faster-whisper"`	32 hits, faster-whisper banner
`"PaddleOCR"`	28 hits, PaddleOCR service (banner form)
`"whisper.cpp" "/inference"`	16 hits, whisper.cpp HTTP server, free ASR compute
`"coqui" "tts"`	13 hits, Coqui TTS banner + term
`"piper" "tts"`	8 hits, Piper neural TTS (RPi/edge)
`"vocode"`	4 hits, Vocode voice-agent framework
`http.html:"vocode"`	2 hits, Vocode in HTML
`"paddleocr" port:8080`	1 hit, PaddleOCR (default port, narrow)