What it is
llama.cpp is the C++ reference implementation of LLaMA inference, the project that pioneered GGUF quantization and runs LLMs on commodity CPU + small GPU hardware. Its built-in HTTP server (llama-server) exposes an OpenAI-compatible API at /v1/models, /v1/chat/completions, plus the platform-native /props and /completion endpoints. Operators frequently co-deploy llama.cpp on the same port as Ollama (:11434) so existing Ollama clients can swap backends transparently.
What goes wrong
llama.cpp has no built-in authentication. The framework’s design assumption (same as Ollama, vLLM, Triton) is that auth comes from a reverse proxy. Population-scale surveys find ~70% of :11434 ports running llama.cpp instead of (or alongside) Ollama, all unauthenticated. The /props endpoint discloses the loaded chat template (sometimes a custom-trained one), the model’s n_ctx, the total slots, and the operator’s quantization config. /completion accepts arbitrary prompts and burns operator compute. When the operator has loaded a custom-finetuned model (Xiyan_FT_14B, Baichuan_32B_medical, etc.), the model itself is operator IP.
How we test
We probe three alternative endpoints to distinguish llama.cpp from co-deployed Ollama: /v1/models should return JSON with "owned_by":"llamacpp", /props returns the server-info JSON with default_generation_settings + chat_template, and the HTTP Server: header reads llama.cpp on most builds. We never POST /completion or /v1/chat/completions; the model identity + config disclosure is the finding. The llama.cpp fingerprint was added to aimap in v1.9.4 (2026-05-15) after a field instance was caught running custom BitNet-b1.58-2B-4T on a Contabo SG host.