Most recent
navigate open esc close Corpus index built 2026-06-07 23:58 UTC

§ THE STACK / MODEL LAYER

Triton Inference Server

NVIDIA model serving

The runtime that actually executes the model: where the weights run.

What it is

Triton is NVIDIA’s enterprise inference server: the heavyweight runtime designed for production model serving across every hardware target NVIDIA makes. It supports TensorRT, ONNX, PyTorch, TensorFlow, vLLM, and Python backends; it runs ensemble pipelines across models; it has a binary protocol (gRPC) and an HTTP/REST one. When you see a tritonserver container in a Kubernetes deployment, you’re looking at someone serious about ML throughput.

What goes wrong

Triton’s HTTP endpoints (/v2/models, /v2/repository/index, /v2/health/ready) are unauthenticated by design (NVIDIA’s position: enforce auth at the ingress). The model repository index is a verbatim list of model names, their versions, their backends, and their state. For commercial operators these names are their intellectual property: fraud-detection-v3, recommender-cold-start-v7, biometric-match-v2. We’ve found Triton instances exposing classifier models that are clearly pulled from the operator’s product, alongside the safety classifiers the operator hopes nobody bypasses.

How we test

We hit /v2 for the version banner, /v2/repository/index for the catalogue, and /v2/models/{name} for the model config (which exposes input/output tensor shapes, sufficient to reverse-engineer the model’s purpose without ever invoking it). When the model is a published architecture (a known LLM, a known vision backbone) we do not issue inference. When it’s a custom fine-tune we capture only the metadata.