27. Embedding Services
Section verified: 2026-05-09
The vector-conversion layer that sits between raw text and vector databases. Embedding servers ingest documents or queries and return dense float vectors; without them, RAG pipelines and semantic search cannot run. They ship auth-off across every observed implementation — the attack class is compute theft and embedding oracle (pre-computing query vectors to probe downstream vector stores).
Survey note: Embedding services are Shodan-dark compared to LLM inference servers. The root / path of canonical servers (TEI, infinity) returns either a redirect or API JSON rather than HTML, so Shodan’s HTTP crawler indexes thin banners. Model-name queries (BAAI/bge, nomic-embed) are the highest-signal Shodan approach; the population-scale survey is masscan-driven on tier-2 cloud ranges, not Shodan-driven.
HuggingFace Text Embeddings Inference (TEI)
The canonical standalone embedding server from HuggingFace. Single Rust binary, ships without auth, exposes /info (model metadata), /embed (POST), /rerank (optional), and /metrics (Prometheus).
Warning — Docker Registry false positive: "text-embeddings-inference" in Shodan banner (6 hits) matches Docker Registry catalog responses that list ghcr.io/huggingface/text-embeddings-inference as a cached image. These are not live TEI servers. Narrow with port constraints or model_pipeline_tag checks.
| Shodan Query | Hits | Notes |
|---|---|---|
http.html:"text-embeddings-inference" | 2 | Low — Shodan rarely indexes the HTML body of API-only roots |
"text-embeddings-inference" | 6 | FP-heavy — mostly Docker Registry catalogs, not live TEI |
product:"Text Embeddings Inference" | 0 | No Shodan product facet registered |
port:80 http.html:"embed" http.html:"model_id" | 0 | TEI /info fields not indexed in HTML |
Live fingerprint (aimap / curl): GET /info → {"model_id": "BAAI/bge-small-en-v1.5", "model_pipeline_tag": "feature-extraction", "max_concurrent_requests": 512, "max_batch_total_tokens": 16384, "version": "2.x.x"}. The model_pipeline_tag: "feature-extraction" field is unique to TEI and not present in any LLM inference server.
Canonical aimap fingerprint: status_code:200 + json_field:model_pipeline_tag + body_contains:feature-extraction on GET /info.
Ports: 80 (Docker default internal), 8080 (common Docker mapping -p 8080:80), 3000 (older versions).
infinity-embedding (michaelfeil/infinity)
OpenAI-compatible embedding server. Default port 7997. FastAPI with /v1/embeddings (POST), /v1/models (GET), and /openapi.json. OpenAPI title is "Infinity Emb".
| Shodan Query | Hits | Notes |
|---|---|---|
http.html:"infinity_emb" | 0 | Python package name not indexed |
"Infinity Emb" | 0 | Not in Shodan index |
port:7997 http.html:"embedding" | 0 | Default port not crawled |
http.html:"infinity" http.html:"/v1/embeddings" | 1 | Combined term, 1 confirmed hit |
Live fingerprint: GET /openapi.json → body contains "Infinity Emb". Alt: GET /v1/models → JSON with data[] and infinity_emb in model paths.
Canonical aimap fingerprint: status_code:200 + body_contains:Infinity Emb on GET /openapi.json.
Custom FastAPI Embedding APIs
The dominant shape in the wild. Operators wrap BAAI/bge, nomic-embed, multilingual-e5, and other embedding models in custom FastAPI services. Root GET / returns a JSON status object that leaks: embedding model name, embedding dimension, reranker model, LLM backend, and internal filesystem paths (index_dir, docs_dir). Auth-off on every observed instance.
Fingerprint by model name (Shodan-indexed because model names appear in HTML pages that embed API response data, e.g., Swagger UI and React dashboards):
| Shodan Query | Hits | Notes |
|---|---|---|
http.html:"BAAI/bge" | 41 | Best signal — BAAI/bge family dominates; Contabo/Hetzner/Scaleway hosts |
http.html:"nomic-embed" | 22 | nomic-embed-text model family; uvicorn-served FastAPI |
http.html:"multilingual-e5" | 27 | intfloat multilingual-e5 family |
http.html:"all-MiniLM" | 404 | sentence-transformers MiniLM — high count, heavily polluted by Reposify/honeypot fleet |
http.html:"sentence-transformers" | 31 | sentence-transformers library reference |
"sentence-transformers" | 4 | banner match, higher confidence |
http.html:"jina" http.html:"embedding" | 12 | Jina embedding models in HTML |
http.html:"jinaai" | 9 | Jina AI package name in page source |
Pollution note: http.html:"all-MiniLM" at 404 hits is dominated by Server: Reposify honeypots returning identical Content-Length: 3151 across disparate IPs and ASNs. Filter: exclude server:"Reposify" and Content-Length:3151 responses.
Fingerprint by endpoint shape:
| Shodan Query | Hits | Notes |
|---|---|---|
http.html:"/v1/embeddings" -http.html:"chat" | 46 | OpenAI-compat embedding-only (excludes LLM gateways) |
http.html:"/embed" http.html:"model" | 1,541 | Too broad — includes any page with “embed” + “model” |
http.html:"/embedding" http.html:"llama" | 480 | llama.cpp /embedding endpoint in HTML |
http.html:"fastembed" | 5 | fastembed library reference |
Live fingerprint (aimap): GET / → json_field:embedding_dimension (OpenVINO pattern) OR json_field:embed (RAG config pattern). GET /health → json_field:embedding_dimension.
Ports: 8000, 8001 (most common), 8002, 8080, 8100, 5000.
Sentence-Transformers / bert-as-service
Older embedding infrastructure. bert-as-service uses ZMQ (port 5555 pull, 5556 push) rather than HTTP — not directly Shodan-scannable. Newer sentence-transformers HTTP wrappers are indistinguishable from Custom FastAPI Embedding APIs above.
| Shodan Query | Hits | Notes |
|---|---|---|
http.html:"sentence-transformers" | 31 | Library reference in page HTML |
"sentence-transformers" | 4 | Banner match |
port:5555 "sentence" | 0 | bert-as-service ZMQ not HTTP-indexed |
Jina Embeddings Self-Hosted
Jina provides self-hosted jina-embeddings-v3 and jina-reranker via their jinaai package. Typically runs on 8080/8000, FastAPI.
| Shodan Query | Hits | Notes |
|---|---|---|
http.html:"jinaai" | 9 | Package name in page source |
http.html:"jina" http.html:"embedding" | 12 | Combined term |
"jina" "embeddings" | 4 | Banner match |
"Jina" "embeddings" port:8080 | 2 | Default port, higher confidence |
llama.cpp Embedding Server Mode
llama.cpp exposes /embedding (POST) when started with --embedding flag. Frequently co-deployed with the chat endpoint on the same instance.
| Shodan Query | Hits | Notes |
|---|---|---|
http.html:"/embedding" http.html:"llama" | 480 | llama.cpp embedding endpoint in HTML |
"llama.cpp" http.html:"/embedding" | 0 | Banner + HTML combined — 0 (banner and HTML rarely co-indexed) |
OpenAI-compat Embedding Endpoints (Generic)
Any service emulating the OpenAI /v1/embeddings interface. Catches TEI, infinity, LocalAI, and custom wrappers.
| Shodan Query | Hits | Notes |
|---|---|---|
http.html:"/v1/embeddings" -http.html:"chat" | 46 | Embedding-only, excludes LLM gateways |
http.html:"/v1/embeddings" | 90 | Includes LLM gateways that also serve embeddings |
Round-2 expansion (2026-05-09 — 144 queries run, 818 unique IPs surfaced)
After the initial section, a comprehensive Shodan sweep ran 144 total queries (46 baseline + 50 zero-hit variants + 48 anchored/extension). Headline new findings worth promoting to first-class queries:
New Tier 1 model families (round-2 verified)
| Query | Hits | Notes |
|---|---|---|
http.html:"bge-m3" | 56 | BAAI BGE-M3 multi-functional (dense+sparse+ColBERT) — biggest single new family |
http.html:"text-embedding-3-large" | 55 | OpenAI v3-large model name in proxy gateways |
http.html:"feature-extraction" | 64 | TEI’s HF tag in HTML (without model_ prefix that Round 1 used) |
http.html:"e5-large" | 15 | intfloat/e5-large family |
http.html:"bge-reranker" | 14 | Reranker co-deployed with embedder |
http.html:"jina-embeddings" | 13 | The -v suffix from Round 1 was the bug |
http.html:"mxbai-embed" | 12 | mixedbread-ai/mxbai-embed-large |
http.html:"e5-base" | 8 | intfloat/e5-base |
http.html:"max_input_length" | 8 | TEI parameter — closest indexable structural marker |
http.html:"text-embedding-ada" | 7 | OpenAI ada-002 in proxy gateways |
http.html:"openai/clip" | 7 | CLIP image embedding (multimodal) |
http.html:"siglip" | 8 | Google SigLIP image embedding |
http.html:"Snowflake/arctic" | 6 | Snowflake arctic-embed (namespace required) |
http.html:"intfloat/e5" | 4 | Catches whole intfloat/e5 family |
http.html:"jina-reranker" | 4 | Jina reranker co-deployed |
http.html:"clip-vit-base-patch32" / clip-vit-large | 3+3 | CLIP variants |
http.html:"jina-embeddings-v3" | 2 | Jina v3 explicit |
http.html:"pipeline_tag" | 2 | Closest TEI structural marker |
http.html:"feature-extraction" http.html:"model_id" | 1 | TEI signature pair |
New Tier 5 platforms (embedding-hosting)
| Query | Hits | Notes |
|---|---|---|
http.html:"xinference" | 484 | Xinference — Chinese multi-model platform (98% confirmed via title=“Xinference”) |
http.html:"localai" | 190 | LocalAI multi-model (versions in title: v3.0.0, v3.8.0, v3.9.0, v3.12.1) |
http.html:"nvidia" http.html:"embedding" | 13 | Anchored — many empty titles = API-only embedders, needs aimap |
Documented false-positive queries (do NOT use)
| Query | Apparent hits | Why FP |
|---|---|---|
http.html:"stella" | 865 | AMEX Stella voice assistant, Genshin Impact servers, Italian hotels (Stella di Mare), Japanese 株式会社ステラ — name collision class |
http.html:"infinity" | 8,330 | Generic word — telecom, gaming, ISP. Use anchored with + embed only with aimap follow-up |
http.html:"NIM" | 883 | Network adapters (NIM cards), VirtualBox NIM driver, Polish parishes (Nim = “no” in some Slavic langs) |
http.html:"cohere" | 157 | SDK references, “cohere” as adjective |
http.html:"voyage" http.html:"embed" | 181 | ”voyage” = trip in French; matches French tourism sites; only voyageai (7) is real |
http.html:"truncate" http.html:"embed" | 113 | truncate is a CSS class, embed is an HTML tag — appears on every page using truncated text |
http.html:"ColBERT" | 68 | ”Colbert” surname, John D. Colbert & Associates, etc. |
Confirmed truly Shodan-dark (no variant fired)
michaelfeil/infinity,infinity-emb— infinity-embedding’s package metadata never appears in HTML; only model + endpoint references domodel_pipeline_tag,max_batch_total_tokens,Infinity Emb(banner/title) — TEI/infinity API JSON / OpenAPI title not indexedembed-multilingual-v3,embed-english-v3,voyage-large-2,voyage-2/3— Cohere/Voyage are SaaS-native; rarely operator-deployedNV-Embed,nemo-retriever,nvidia/NV-Embed— NVIDIA NIM not population-deployed in self-host spacehkunlp/instructor,instructor-large/xl— instructor-models HTML-rareAlibaba-NLP/gte,thenlper/gte,gte-base/multilingual/Qwen2-7B— GTE family Shodan-dark despite known popularity in Chinese ecosystem
Methodology Notes
Why masscan supplements Shodan here: TEI, infinity, and custom FastAPI embedding servers all return API JSON at GET /, not HTML. Shodan’s crawler indexes the root path HTTP response; a JSON blob with {"model_pipeline_tag":"feature-extraction"} looks like a non-page and gets minimal indexing. The port:7997 -http.html:"chat" query (630 hits) demonstrates this: 630 hosts on infinity’s default port, but port:7997 + html:openapi/v1/model/embed returns 0 — meaning Shodan recorded these hosts only via banner/TLS, not HTML body. Port-targeted aimap fingerprinting is the only way to confirm these. Shodan handles the population that exposes HTML dashboards; masscan/aimap handle the API-only population.
aimap fingerprints (3 new — count 66 → 69):
HuggingFace TEI—GET /info→json_field:model_pipeline_tag+body_contains:feature-extractioninfinity-embedding—GET /openapi.json→body_contains:Infinity EmbEmbedding API—GET /→json_field:embedding_dimensionORjson_field:embed
Threat class: Compute theft (GPU/CPU cost borne by operator) + embedding oracle (attacker pre-computes query vectors to probe downstream vector DBs semantically without holding the embedding key). Severity: medium on auth-off; elevated to high when paired with an exposed vector DB on the same host (dual-stack attack surface).
Key population finding from Shodan sample: Custom FastAPI wrappers around BAAI/bge and nomic-embed dominate over canonical TEI and infinity deployments. Operators build their own embedding servers rather than deploying the reference implementation, which means varied endpoint shapes and no canonical fingerprint covers the full population — aimap’s multi-probe approach is necessary.
See also
02-vector-databases.md— downstream targets of embedding oracles03-model-serving.md— Embedding / Reranking subsection with initial counts07-rag-stacks.md— full RAG stack context