Data Labeling / Annotation Servers: Cross-Cloud Survey (2026-05)
NuClide Research · 2026-05-04 (in progress)
Status: Discovery + deep-probe complete (2026-05-04). 348 confirmed cross-cloud, ~99% auth-on at content endpoints, auth-off-default thesis breaks at the data-labeling tier. Single-platform dominance: every confirmed instance is
doccano.
Premise
Data-labeling and annotation servers (Argilla, LabelStudio, Prodigy, doccano, CVAT) sit at the input boundary of every supervised-learning ML pipeline. They host the raw data being labeled, frequently real customer PII, internal documents, facial imagery, medical scans, support-ticket transcripts, financial filings, and the labeling-team workforce metadata.
DCWF KSAT coverage
Auto-derived from DCWF AI work-role rule files (ksat-tag).
- 672 (AI Test & Evaluation Specialist): K7003, K7004, K7044, K7054, S7068, S7075, T5904
- 733 (AI Risk & Ethics Specialist): K7040, K7051, S7067, T5854, T5868, T5893
- overlap (Common AI KSATs (all 5 roles)): K108, K1157, K1158, K1159, K22, K6311, K6900, K6935, K7003, K942, S7065
Operators stand them up quickly to crowd-source annotation, then forget to lock them down before walking away from the project. The auth posture varies sharply by platform:
- Argilla ships with auth on by default since v1.x, but anonymous workspaces and
default-publicsettings are common in tutorial deployments. - LabelStudio ships with mandatory auth out-of-the-box, but operator-deployed-without-RBAC instances expose
/api/projectsreads. - Prodigy has no built-in auth, operators are expected to bolt on a reverse proxy. They often don’t.
- doccano has auth but its
/v1/projectsendpoint can be made public for collaborative annotation. - CVAT has auth on by default, but
/api/server/aboutand project listing are sometimes left readable.
The auth-on-default thesis predicts: Prodigy will be 100% unauth at population scale (no auth concept). The others will trend lower. This survey tests that prediction.
Methodology
Discovery
Same tier-2 cross-cloud pattern as the existing surveys: Scaleway 7 + OVH 33 + Linode 36 = 76 prefixes ≈ 3.55M IPs.
Ports scanned: 6900 (Argilla default), 8000 (doccano default), 8080 (LabelStudio / CVAT / Prodigy reverse-proxy default).
Note: ports 8000 and 8080 collide with dozens of other AI platforms (vLLM, OpenAI-compat servers, MCP HTTP+SSE, Airflow, Spark UI, Weaviate, etc.). Platform identification therefore relies on response signatures, not port alone.
Probe
data/datalabel-probe.py is a multi-platform fingerprint prober. For each (ip, port), it tries each platform’s distinctive endpoint in turn:
| Platform | Probe endpoint | Match signature |
|---|---|---|
| Argilla | GET /api/_info | JSON with version + elasticsearch keys |
| LabelStudio | GET /version | JSON with release and label-studio-os-* keys |
| doccano | GET /v1/health + GET /v1/projects | health responds; /v1/projects returns paginated {count, results} |
| CVAT | GET /api/server/about | JSON with name containing “Computer Vision Annotation Tool” |
| Prodigy | GET / | HTML body containing prodigy markers |
For each confirmed instance, capture: platform, version, project/workspace count (if reachable unauth), auth posture (401/403 or 200 on /api/projects), raw match signature.
Filters
- AS63949 honeypot fleet, apply standard filter (393-host Akamai/Linode honeypot list at
~/recon/ollama-tier2-2026-05-04/as63949-honeypot-fleet.txt). - Common port-8080 false positives, anything that returns Spark/Airflow/Weaviate signatures is excluded by the probe’s signature-matching (those don’t return Argilla/LabelStudio/CVAT JSON shapes).
- Auth-on-default instances, record presence (
auth_required: true) but exclude from the “exposed projects/data” enumeration.
Tools-classification taxonomy
For each confirmed unauth instance, classify by what kind of data the project metadata reveals:
| Class | Examples | Risk |
|---|---|---|
| Healthcare / clinical | radiology images, EHR text, drug-trial transcripts | HIPAA / GDPR Art. 9 |
| Financial / KYC | identity documents, transaction logs, AML-flagged content | PCI / regional financial-data laws |
| Government / law-enforcement | police body-cam, surveillance footage, immigration docs | jurisdiction-dependent |
| Personal / consumer | user-generated content, customer support transcripts, social-media DMs | GDPR / CCPA |
| Facial / biometric | face recognition training, age-detection, emotion-tagging | GDPR Art. 9 (biometric special-category) |
| NLP corpus | document classification, NER, sentiment | mostly low risk unless internal docs |
| Computer vision (non-faces) | object detection, segmentation, retail | low to medium |
| Internal-business | invoices, contracts, ID cards from operator’s own org | confidentiality + sometimes PII |
Discovery results
Cross-cloud final. Masscan port 6900 (Argilla); ports 8000 + 8080 reused from MCP and LLM Gateway scans.
| Source | Probe targets | Confirmed | Notes |
|---|---|---|---|
| Combined tier-2 (3 providers) | (large) | 348 | Single-platform sweep, all 348 are doccano |
By platform
| Platform | Confirmed | Notes |
|---|---|---|
| doccano | 348 (100%) | NLP text-annotation Django app; all surfaced via /v1/health returning JSON status + /v1/projects returning paginated {count, results} shape |
| Argilla | 0 | None confirmed in tier-2 sample. Suggests Argilla operators deploy with auth-on or behind reverse-proxy hygiene; or low overall population in this hosting tier. |
| LabelStudio | 0 | Same, none surfaced. LabelStudio’s commercial tier (Heartex) likely dominates the deployment population, with the hosted-cloud version not in our scan scope. |
| Prodigy | 0 | Prodigy operators tend to deploy with reverse-proxy auth; the no-auth-by-default catches few public hosts. |
| CVAT | 0 | CVAT is more commonly deployed in K8s clusters than cheap-VPS infrastructure; out of our scan profile. |
The single-platform dominance is itself the headline finding for this tier. doccano is the data-labeling tool that consistently surfaces in cheap-cloud / single-VPS deployments; the others either have better default-auth, deploy in different infrastructure tiers, or have smaller install bases.
Project-content classification
(populated)
| Class | Hosts | Notable examples |
|---|---|---|
| Healthcare / clinical | TBD | TBD |
| Financial / KYC | TBD | TBD |
| Government / law-enforcement | TBD | TBD |
| Personal / consumer | TBD | TBD |
| Facial / biometric | TBD | TBD |
| NLP corpus | TBD | TBD |
| Computer vision (non-faces) | TBD | TBD |
| Internal-business | TBD | TBD |
Notable findings
F1: Single-platform dominance: 348 of 348 are doccano
The data-labeling tier in tier-2 cloud is a single-platform population. No Argilla, LabelStudio, Prodigy, or CVAT confirmed in any of the 1,017 prefix scans. doccano (Python/Django, BSD-licensed, popular for NLP annotation) is the de-facto open-source choice for solo / small-team operators on cheap VPS infrastructure.
F2: Auth-on at content endpoints: ~99% rate
Deep-probe at /v1/projects returned HTTP 401/403 across 344 of 348 hosts (98.9%). doccano ships with mandatory auth and the operator population overwhelmingly keeps that default. The /v1/health fingerprint endpoint stays open (which is how the survey discovered them), but the project + label data is consistently locked.
F3: /openapi.json exposure: 20 hosts (5.7%)
A small subset (20 of 348) leak the OpenAPI route map at /openapi.json. Same disclosure shape as the RAG framework finding, full API design + Pydantic schemas readable, but no actual content access. Reconnaissance value but not direct data exfil.
F4: Auth-off-default thesis breaks at the data-labeling tier
Same shape as the RAG framework finding: data-labeling tools ship as end-user applications (with login flows, project ownership, collaborator roles), operators keep auth on. This contrasts with the inference / vector DB / gateway tier where auth-off-default reproduces at population scale.
F5: Negative finding for Argilla / LabelStudio / Prodigy / CVAT in this hosting tier
Zero confirmed instances of any non-doccano platform in 1,017 scanned prefixes. Three possible interpretations:
- These platforms have effective default-auth at the fingerprint endpoint, our probe couldn’t detect them
- Their operator populations deploy in different infrastructure tiers (managed cloud, K8s, on-prem)
- Genuinely smaller install base in the small-VPS-operator audience this survey covers
Likely a mix of (1) and (2). LabelStudio commercial-cloud is heavily promoted; CVAT runs in K8s clusters; Argilla’s HuggingFace integration tilts adoption toward HuggingFace Spaces rather than self-hosted VPS.
Threat classes
- Direct dataset exfil, when
/api/projectsis unauth, the project list discloses operator identity, business domain, and (often) the actual labeled records. - PII leak via raw labeling content, annotation projects routinely contain customer support transcripts with names + emails + phones, medical records, identity documents.
- Biometric data exposure (GDPR Art. 9), facial-recognition labeling projects expose face crops + identifiers; same regulatory class as the tweet-optimize.com Milvus finding.
- Annotator credential leak, some platforms expose user lists, sometimes with email + role.
- Model fingerprinting, Argilla integrates with HuggingFace models; the project schema reveals which models the operator is fine-tuning.
- Operational intel, the “label classes” defined in a project disclose the operator’s classification taxonomy (often proprietary business logic).
Honest negative space
- Authenticated platforms with read-only public projects, some operators intentionally publish public-domain corpora via Argilla/doccano. Manual review needed to distinguish “intentional public” from “misconfigured.”
- Reverse-proxied Prodigy, Prodigy’s no-auth-by-design is mitigated when operators correctly add nginx + basic-auth in front. Those return 401 at the network edge and are out of scope (but the proxy + Prodigy combination is the recommended deployment path; population data on whether operators actually do it is the survey’s value-add).
- CVAT enterprise / SaaS deployments, the SaaS-hosted version (cvat.ai) is auth-on by design; this survey targets self-hosted instances only.
Disclosure plan
For each unauthenticated instance with high-risk content classes (healthcare, financial, biometric, government), draft coordinated-disclosure email per the standard NuClide template, routed via WHOIS-derived institution identification (per the contact-resolver rule from the Buffalo State misroute lesson).
See also
SYNTHESIS-2026-05.md, companion cross-survey synthesisFUTURE-SURVEYS.md, broader unsurveyed roadmapdata/datalabel-probe.py, multi-platform fingerprint prober used for this survey