Most recent
navigate open esc close Corpus index built 2026-06-07 23:58 UTC

§ THE STACK / DATA LAYER

Data Labeling

Label Studio, Argilla, CVAT, Doccano, Prodigy. training-data annotation

Vector stores, registries, memory, datasets: what the model knows and remembers.

What it is

Models learn from labels. A data-labeling platform is the editing environment where humans annotate the raw data: boxes around objects, classifications on text, transcriptions of audio, span-level reasoning traces for RLHF. Label Studio (HumanSignal) is the universal multi-modal one; Argilla (Hugging Face) is the LLM-centric one; CVAT (Intel/Roboflow) owns computer-vision; Doccano is the lightweight NLP option; Prodigy (Explosion) is the paid serious one. The dataset that comes out of these tools is what the next model gets trained on. The labelling stack is upstream of model behaviour itself.

What goes wrong

The platform exposes two things. First, the raw data being labelled: often unredacted medical images, customer support transcripts, legal documents. Second, the labels themselves, which encode the operator’s labelling rubric and frequently the model bias they are trying to amplify or correct. Default deployments have weak credentials (admin/admin is alarmingly common in the Label Studio Docker Compose examples) or token-based auth that operators share in Slack and forget to rotate.

How we test

We confirm the platform via its /version endpoint, then list projects via the unauthenticated API surface (Label Studio’s /api/projects works without auth on the default install). Project names plus task counts tell the story: a project called “medical-imaging-batch-7” with 12,000 tasks is a healthcare operator; a project called “red-team-prompts” with a few hundred tasks is an AI lab’s safety team. We never download tasks. The metadata characterises both the data class and the operator function.