What it is
Embedding servers turn text (or images, or audio) into high-dimensional
vectors that vector databases can search. Every RAG pipeline has one of
these in the middle. Text Embeddings Inference (TEI) is Hugging Face’s
production-grade Rust runtime; Infinity (Michael Feil) is the fast Python
alternative; the original sentence-transformers library ships its own
HTTP server; Ollama also serves embeddings via /api/embeddings for
the small-deployment crowd. They look like miniature inference servers
because that’s exactly what they are.
What goes wrong
Embedding servers expose an OpenAI-compatible /v1/embeddings endpoint by
default and are typically deployed without auth, on the assumption that
“only my RAG pipeline talks to it.” When the host ends up reachable on a
public IP an attacker gets a free embedding service, useful for their own
RAG pipelines. More damaging: the model loaded by the server is often a
fine-tuned variant trained on the operator’s private corpus, and those
custom weights are often what makes the operator’s product different from
the generic alternative.
How we test
We hit /v1/models (or the /info endpoint TEI exposes) for the model
inventory and tokenizer metadata, then capture the model identifier. If
the model name maps to a known Hugging Face artefact we attribute via the
publishing org. If it’s a private fine-tune we capture the architecture
and tokenizer fingerprint, which is sufficient evidence of operator
intellectual property without our needing to issue any embedding requests.