What it is
A vector database stores high-dimensional embeddings, numerical fingerprints of text, images, audio, and answers nearest-neighbour queries against them. It is the memory of every RAG system. The popular ones each carve out a slightly different niche: ChromaDB (the developer-friendly default), Qdrant (Rust-fast, popular in production), Milvus (the heavyweight enterprise option), Weaviate (schema-rich), Pinecone (managed-only), pgvector (Postgres extension), Elasticsearch with its dense_vector type (the one that already lives in your stack).
What goes wrong
Every popular vector DB ships with authentication off by default and a public
listen socket. The exposure isn’t theoretical, it’s the contents. Every
collection is named (often after the project: customer-support-knowledge,
legal-discovery-q4, patient-notes-v2); every collection contains the
embedded text in its metadata; many collections also contain the original
source URL or document ID. Reading a single collection lets an attacker
reconstruct most of the operator’s internal corpus, plus the prompts the
operator has been embedding (which often are customer queries).
How we test
We hit the heartbeat endpoint to confirm the engine, list collections via the unauthenticated metadata API, and read the first record’s metadata only (never the raw vectors, never the bulk content). The collection names plus the metadata-schema fields are sufficient evidence of exposure. For operators we already know, universities, medical centres, financial institutions, we draft the disclosure on collection names alone, which cleanly avoids touching the contents in any reportable way.