Embedding Servers, Model Layer, NuClide Stack

What it is

Embedding servers turn text (or images, or audio) into high-dimensional vectors that vector databases can search. Every RAG pipeline has one of these in the middle. Text Embeddings Inference (TEI) is Hugging Face’s production-grade Rust runtime; Infinity (Michael Feil) is the fast Python alternative; the original sentence-transformers library ships its own HTTP server; Ollama also serves embeddings via /api/embeddings for the small-deployment crowd. They look like miniature inference servers because that’s exactly what they are.

What goes wrong

Embedding servers expose an OpenAI-compatible /v1/embeddings endpoint by default and are typically deployed without auth, on the assumption that “only my RAG pipeline talks to it.” When the host ends up reachable on a public IP an attacker gets a free embedding service, useful for their own RAG pipelines. More damaging: the model loaded by the server is often a fine-tuned variant trained on the operator’s private corpus, and those custom weights are often what makes the operator’s product different from the generic alternative.

How we test

We hit /v1/models (or the /info endpoint TEI exposes) for the model inventory and tokenizer metadata, then capture the model identifier. If the model name maps to a known Hugging Face artefact we attribute via the publishing org. If it’s a private fine-tune we capture the architecture and tokenizer fingerprint, which is sufficient evidence of operator intellectual property without our needing to issue any embedding requests.

Receipts

Research

Every survey, case study, and disclosure we've published that touches this layer of the stack. Counts on the cells above tally these directly.

Queued

We haven't surveyed this category yet. The technology is on our map; the receipts will follow when the cross-cloud survey lands. Browse the research feed for what's already published, or watch this page.

Model Layer

Embedding Servers

What it is

What goes wrong

How we test

Research

Other categories in this layer

Ollama

vLLM

Triton Inference Server

Speech & Audio

llama.cpp