LLM Observability, Observability & Safety, NuClide Stack

What it is

Once an operator runs an LLM in production they need to see what it’s doing. LLM observability platforms record every prompt, every completion, every tool call, every retrieved document, with token-cost and latency overlays. Langfuse is the open self-hostable leader; Helicone is the proxy-based one; LangSmith (LangChain) is the SaaS option; Arize AI’s Phoenix is the open-source agent development & evaluation platform; Lunary sits in the same space. Together they are the AI equivalent of Datadog, the system of record for everything the model has done.

What goes wrong

The trace store is the operator’s most sensitive AI artefact. It contains every customer prompt verbatim (which is often customer PII), every retrieved document (which is often the operator’s private corpus), and every tool call with full arguments (which is often credentials in plain text). Langfuse ships with a project-key model that operators sometimes bypass by enabling the public-projects feature for “share a trace with my colleague” workflows and forgetting to disable it. The traces become indexed and crawlable by default after that.

How we test

We probe /api/public/projects and /api/public/traces for the trace inventory; the response shape confirms Langfuse and reveals project names along with first-seen and last-seen timestamps. We never read trace bodies. Project names attribute the operator (most are “customer-support-prod”, “sales-enrichment”, etc.) and the date range characterises the corpus volume. Trace counts in the millions on a single project warrant priority disclosure.

LLM Observability

What it is

What goes wrong

How we test

Research

Cross-cloud surveys

The Auth-on-Default Landscape of OSS AI/LLM Infrastructure

Langfuse Population Survey — 816/918 Open Registration (88.9%)

Arize Phoenix Population Survey — 41/55 Unauthenticated Project Disclosure

AI Evaluation and Red-Team Platform Survey — Promptfoo Population Pass

AI Cost / Billing / Usage Analytics population survey: Langfuse secret-key exposures + Dokploy frontend-secret leak class

AI observability tier, Phase 2 synthesis (cross-cuts + version-deltas)

VisorBishop loop-iteration #1: Re-sweep all Phase 1 corpora, surface gaps

VisorBishop: Phase 3 meta-fingerprinter for the AI observability tier

Helicone deep-dive: Phase 2 (default ClickHouse exposure on benchmarkit.solutions)

Helicone LLM-observability population survey (21-host self-hosted population)

Langfuse deep-dive: Phase 2 (source audit + latent primitives + extended IP-shadow)

Langfuse LLM-observability population survey (1,333-host population, 0% unauth)

LangSmith deep-dive: Phase 2 (customer identity disclosure on 19 enterprise operators)

LangSmith LLM-observability population survey (27-host self-hosted population)

AI observability tier: Small platforms population sweep (Lunary, OpenLIT, Pezzo)

AI observability tier: Cross-platform synthesis (Phase 1)

Agent frameworks cross-survey, planning + dork catalog (2026-05-07)

Langfuse cross-survey-correlation single-host case study (2026-05-06)

LLM Observability + Training Telemetry: Auth Posture Survey

Field cases

Langfuse Postgres Cert Pivot — Data Tier Survey + CygnusAlpha Production Finding

Coordinated disclosures

Artsypetz Langfuse Open Signup 2026 05 25

In the Langfuse container env (or .env file):

Other categories in this layer

AI Safety & Evals

Prompt Management