Most recent
navigate open esc close Corpus index built 2026-06-07 23:58 UTC

§ THE STACK / OBSERVABILITY & SAFETY

AI Safety & Evals

Guardrails, Nemo Guardrails, lm-eval

Tracing, evaluation, and policy enforcement around the entire stack.

What it is

Eval harnesses measure model behaviour against benchmarks: capabilities, biases, refusal patterns, jailbreak resistance. lm-eval-harness (EleutherAI) is the universal capability-eval; EleutherAI’s safety eval forks track refusal/harm rates; NVIDIA Nemo Guardrails and Guardrails AI sit in front of production models and constrain output in real time; Inspect (UK AI Safety Institute) and Anthropic’s evals are the research-grade options. Together they are how a serious AI deployment knows whether the model is doing what it’s supposed to.

What goes wrong

Eval and guardrail systems hold the operator’s threat model: the prompts they consider harmful, the responses they consider unacceptable, the policy they want enforced. When an eval-harness server is exposed unauthenticated, an attacker reads the full set of red-team prompts the operator uses, learns which of those prompts the model currently fails, and gets a precise roadmap to bypass the operator’s guardrails. The exposure of a guardrail configuration is also a disclosure of the policy boundary itself.

How we test

We probe for harness control endpoints (lm-eval’s WebSocket UI, Nemo Guardrails’ REST API on port 8000) and read the policy/eval inventory via the unauthenticated metadata endpoints. We never run new evals. The eval names alone are the disclosure evidence. Names like “jailbreak-bench”, “medical-refusals”, “copyright-output” characterise the operator’s concerns and identify their team without our needing to read prompt bodies.