Data Labeling, Data Layer, NuClide Stack

What it is

Models learn from labels. A data-labeling platform is the editing environment where humans annotate the raw data: boxes around objects, classifications on text, transcriptions of audio, span-level reasoning traces for RLHF. Label Studio (HumanSignal) is the universal multi-modal one; Argilla (Hugging Face) is the LLM-centric one; CVAT (Intel/Roboflow) owns computer-vision; Doccano is the lightweight NLP option; Prodigy (Explosion) is the paid serious one. The dataset that comes out of these tools is what the next model gets trained on. The labelling stack is upstream of model behaviour itself.

What goes wrong

The platform exposes two things. First, the raw data being labelled: often unredacted medical images, customer support transcripts, legal documents. Second, the labels themselves, which encode the operator’s labelling rubric and frequently the model bias they are trying to amplify or correct. Default deployments have weak credentials (admin/admin is alarmingly common in the Label Studio Docker Compose examples) or token-based auth that operators share in Slack and forget to rotate.

How we test

We confirm the platform via its /version endpoint, then list projects via the unauthenticated API surface (Label Studio’s /api/projects works without auth on the default install). Project names plus task counts tell the story: a project called “medical-imaging-batch-7” with 12,000 tasks is a healthcare operator; a project called “red-team-prompts” with a few hundred tasks is an AI lab’s safety team. We never download tasks. The metadata characterises both the data class and the operator function.

Receipts

Research

Every survey, case study, and disclosure we've published that touches this layer of the stack. Counts on the cells above tally these directly.

Cross-cloud surveys

4

Survey May 31, 2026

Argo Workflows Population Survey — Cat-29 (2026-05-31)

## DCWF KSAT coverage

Read →

Survey May 31, 2026

Data Labeling & Annotation: the registration knob that re-opens the door

Data-labeling platforms sit at the input boundary of every supervised-learning and RLHF pipeline. They hold the raw data being labeled: PII-dense text, scanned documents, medical and facial imagery, a…

Read →

Survey May 16, 2026

Data-Labeling Population Survey (2026-05-16)

Survey of the data-labeling platform population. The systems that store training-data annotation tasks, often containing PII or sensitive labels. Smaller surface than other categories surveyed today;…

Read →

Survey May 1, 2026

Data Labeling / Annotation Servers: Cross-Cloud Survey (2026-05)

Data-labeling and annotation servers (Argilla, LabelStudio, Prodigy, doccano, CVAT) sit at the input boundary of every supervised-learning ML pipeline. They host the raw data being labeled, frequently…

Read →

Data Layer

Data Labeling

What it is

What goes wrong

How we test

Research

Cross-cloud surveys

Argo Workflows Population Survey — Cat-29 (2026-05-31)

Data Labeling & Annotation: the registration knob that re-opens the door

Data-Labeling Population Survey (2026-05-16)

Data Labeling / Annotation Servers: Cross-Cloud Survey (2026-05)

Other categories in this layer

Vector Databases

Search Engines

OLAP / Analytics Backends

MLOps Tracking

Agent Memory

Object Storage

Compute Orchestration

GPU Compute & Telemetry

Container Orchestration

Medical / Edge AI

Backup & Snapshots

Fine-tuning Runtimes

Document Parsers

Model Hubs & Registries