Most recent
navigate open esc close Corpus index built 2026-06-07 23:58 UTC

← All research

Survey May 16, 2026

Data-Labeling Population Survey (2026-05-16)

NuClide Research · 2026-05-16 (third survey in the day’s 4-category batch) Closes: category 22 (data-labeling). Label Studio / CVAT / Doccano / Argilla / Prodigy


Summary

Survey of the data-labeling platform population. The systems that store training-data annotation tasks, often containing PII or sensitive labels. Smaller surface than other categories surveyed today; the mixed result is informative.

DCWF KSAT coverage

Auto-derived from DCWF AI work-role rule files (ksat-tag).

  • 672 (AI Test & Evaluation Specialist): K7003, K7004, S7068, S7070, T5858, T5904
  • 733 (AI Risk & Ethics Specialist): K7040, K7051, S7067, T5854, T5868, T5893
  • overlap (Common AI KSATs (all 5 roles)): K108, K1158, K22, K6935, K7003
  • 772 unique candidates harvested across the 5 platforms
  • Probed via fast_enum_data_labeling.py (~2 min total)
  • 16 unauth Prodigy instances (Prodigy is by-design auth-free for its annotation web UI, see methodology section)
  • 0 Label Studio unauth at the data layer (3 real instances on legacy 0.7.x; all auth-gated when projects requested)
  • 0 CVAT unauth at the data layer (auth-gated where reachable)
  • 0 Doccano unauth at the data layer (auth-gated where reachable)
  • 0 Argilla unauth (same 4 instances surfaced in agent-memory-population-survey-2026-05-16.md, all auth-gated)

Headline: the Label Studio / CVAT / Doccano / Argilla quartet is Tier-C at population scale. Framework defaults enforce auth. Prodigy is a deliberate exception: its annotation UI is intentionally auth-free as a workflow choice.


Prodigy: auth-free by-design

Prodigy (Explosion AI) is a paid annotation tool. Its server-side architecture is deliberately single-user with no built-in auth. Operators are expected to run it on localhost or behind their own access control. 16 hosts confirmed via root-page title <title>Prodigy</title> and /sessions returning JSON.

Per the auth-on-default thesis: Prodigy is *Tier-A (auth optional, off-by-default per framework design)**. It’s the cleanest example of “the framework chose not to ship auth at all; the operator’s job is to provide it externally.” Same family as Ollama, vLLM, etc., but in the data-labeling tier.

The risk is not the platform; the risk is that operators expose Prodigy to the public internet without the expected localhost-only deployment. 16 hosts crossed that line. Each represents an open annotation workflow. Labels assigned to whatever the operator was annotating, fully visible.


Label Studio: the legacy-version probe gap

Label Studio at v1.x exposes version via GET /api/version:

{"label-studio-os-package":{"version":"1.20.0",...},"label-studio-frontend":{...}}

But at v0.7.x (legacy), the path was GET /version returning a different shape:

{"label-studio-backend":"0.7.4","label-studio-frontend":{...}}

My initial probe only checked /api/version. Re-probe with both paths surfaced:

  • 3 real Label Studio v0.7.x hosts (116.202.182.74:8140, 116.202.182.74:8139, etc., legacy version, /version path)
  • 0 real Label Studio v1.x with unauth data layer (/api/projects requires auth)
  • 60 of 63 re-probed candidates were Shodan FPs (the Shodan title-match was on unrelated services or stale references)

Confirms [[insight-15-dork-hits-vs-platform-instances]] again: http.title:"Label Studio" returned ~1,663 Shodan candidates → ~3 real Label Studio instances confirmed → FP rate ~99.8%. The title string is uncommon enough to assume specificity, but the dork still over-counts dramatically.

Label Studio’s data layer (/api/projects?page_size=1) is auth-gated at the population-scale population observed. Tier-C confirmed.


CVAT / Doccano: Tier-C confirmed

CVAT (Computer Vision Annotation Tool, originally by Intel/Datumaro) and Doccano (text annotation, by Hironsan) both ship with auth-on-default. The probes:

  • CVAT: /api/server/about returns version unauth (informational), but /api/tasks requires Django session auth → 401/403
  • Doccano: /v1/projects returns DRF paginated 200, but the response is empty until operator logs in to create projects, AND when projects exist, listing returns 401

The few real instances surfaced (~10 across CVAT + Doccano combined) all returned auth-gated at the data layer. Tier-C confirmed for both.


Cross-survey colocation

Argilla appears in both this survey and agent-memory-population-survey-2026-05-16.md. The same 4-host set is observed from both angles. Argilla intentionally sits at the data-labeling / agent-feedback boundary, so cross-category coverage is appropriate. All 4 instances auth-gated under both probe angles.

No colocation with same-day ComfyUI (548) or vector-DB-stragglers (881) unauth populations.


Methodology placement

Adds 4 platforms (Label Studio, CVAT, Doccano, Argilla) to Tier-C confirmed-at-population-scale and 1 platform (Prodigy) to Tier-A* (auth optional, off-by-design).

The data-labeling tier as a whole confirms the auth-on-default thesis. The thesis predicts: platforms designed with multi-tenant annotation workflows (Label Studio, CVAT, Doccano) ship auth-on-default, observed. Platforms designed as single-user annotation tools (Prodigy) don’t ship auth, observed. The mapping from framework intent to population behavior is the load-bearing principle.


Toolchain Provenance

0. shodan search × 9 dorks → 772 unique ip:port candidates
1. fast_enum_data_labeling.py (threads=60, 142s) → initial classify
   (Prodigy 16, others 0 unauth, 135 auth-gated, 28 shell-only, 440 dead, 56 unrelated)
2. Label Studio re-probe (legacy /version + new /api/version) → 0 unauth, 3 real (legacy), 60 FP
3. CVAT/Doccano spot checks → confirmed Tier-C
4. (queued) visorlog ingest → 16 Prodigy events into nuclide.db source='data-labeling-survey-2026-05-16'

Honest negative space

  • Label Studio v1.x population not separately surveyed at high N. The candidate pool surfaced mostly legacy or non-LS instances. A future port-first masscan on Label Studio’s default port (8080) on tier-2 cloud would surface the modern LS population for a fair Tier-C confirmation.
  • CVAT 354 Shodan candidates not exhaustively probed. The probe identified CVAT only when /api/server/about returned a JSON with cvat/computer vision annotation markers. Of the 354 candidates, only a small fraction were real CVAT. The rest are Shodan FPs (CVAT is an abbreviation that conflicts with CVAT Insurance and similar). A second-conjunct dork (http.title:"CVAT" "computer vision annotation") would tighten the candidate set.
  • Prodigy’s 16 unauth instances are operator-attribution-rich. Each runs a research/annotation project; reading the <title> plus /sessions JSON discloses what’s being labeled. Not pursued in this survey (out of scope, the platform is intentionally auth-free; the per-host disclosure would be a workflow recommendation to put it behind a VPN).
  • No data sampling. Restraint: read project counts (when reachable) but never read annotation contents. Several Doccano / Label Studio hosts could probably be queried for project metadata, but the auth-gated outcome means no data is read.

See also