What it is
Ollama is the easiest way to run a large language model on your own hardware.
One binary, one command: ollama pull llama3 and you have a local OpenAI-style
API on port 11434. It pulls quantised model weights from its own registry,
manages the GPU layout, and serves an OpenAI-compatible chat endpoint. It is
genuinely beautifully engineered, and it is the reason most of the world’s
self-hosted AI exists.
What goes wrong
The framework has no authentication concept by design. The maintainers’
position is that auth is an upstream concern (run it behind a reverse proxy,
behind Tailscale, behind your firewall). Most operators don’t. Any host
running Ollama on a public IP is a free, unauthenticated, unlimited model
endpoint: an attacker can list the model inventory at /api/tags, chat
through /api/chat, and even pull arbitrary new models via /api/pull,
which silently downloads gigabytes onto the operator’s disk and bills any
attached cloud egress.
How we test
We hit /api/tags to enumerate the loaded models (this is the population-scale
fingerprint behind our cross-cloud surveys), capture the response, and
attribute via the kind of models loaded: a host serving gemma3:e4b and
nothing else is a hobbyist; a host serving fifteen fine-tuned variants of
llama3:70b with custom Modelfiles is a commercial operator. We do not issue
chat completions. We do not call /api/pull. The model inventory tells the
whole story.