What it is
Triton is NVIDIA’s enterprise inference server: the heavyweight runtime
designed for production model serving across every hardware target NVIDIA
makes. It supports TensorRT, ONNX, PyTorch, TensorFlow, vLLM, and Python
backends; it runs ensemble pipelines across models; it has a binary protocol
(gRPC) and an HTTP/REST one. When you see a tritonserver container in a
Kubernetes deployment, you’re looking at someone serious about ML throughput.
What goes wrong
Triton’s HTTP endpoints (/v2/models, /v2/repository/index, /v2/health/ready)
are unauthenticated by design (NVIDIA’s position: enforce auth at the
ingress). The model repository index is a verbatim list of model names, their
versions, their backends, and their state. For commercial operators these
names are their intellectual property: fraud-detection-v3, recommender-cold-start-v7,
biometric-match-v2. We’ve found Triton instances exposing classifier models
that are clearly pulled from the operator’s product, alongside the safety
classifiers the operator hopes nobody bypasses.
How we test
We hit /v2 for the version banner, /v2/repository/index for the catalogue,
and /v2/models/{name} for the model config (which exposes input/output
tensor shapes, sufficient to reverse-engineer the model’s purpose without
ever invoking it). When the model is a published architecture (a known LLM, a
known vision backbone) we do not issue inference. When it’s a custom
fine-tune we capture only the metadata.