Most recent
navigate open esc close Corpus index built 2026-06-07 23:58 UTC

← All reference

Reference

22. Data Labeling / Annotation Servers

Source: https://github.com/nuclide-research/AI-LLM-Infrastructure-OSINT/blob/main/shodan/queries/22-data-labeling

Section created: 2026-05-09

Data-labeling and annotation servers sit at the input boundary of every supervised-learning ML pipeline. They host raw data being labeled — frequently real customer PII, internal documents, facial imagery, medical scans, support-ticket transcripts, and financial filings. Operators stand them up quickly for crowd-sourced annotation projects, then leave them running after project completion.

Survey result (2026-05-04): 348 confirmed cross-cloud (Scaleway + OVH + Linode, 3.55M IPs). Single-platform dominance: 348/348 are doccano. Auth posture inverts the thesis — 98.9% auth-on at content endpoints (/v1/projects returns 401/403). doccano ships with mandatory auth; operators keep it. The /v1/health fingerprint endpoint stays open (enabling discovery) but project/label data is consistently locked. Notable gap: Argilla, LabelStudio, Prodigy, CVAT returned zero confirmed instances in this hosting tier; those platforms deploy predominantly in managed cloud or K8s rather than cheap-VPS.

Auth posture by platform:

  • doccano: T2 — auth on by default; /v1/health open for fingerprinting, content auth-gated. /openapi.json exposed on 5.7% of instances (20/348).
  • Prodigy: T1 — no built-in auth concept; operators expected to add proxy. Rare on public internet.
  • Argilla: T2 — auth on since v1.x; default-public workspace misconfiguration sometimes seen.
  • LabelStudio: T2 — mandatory auth; /api/projects sometimes misconfigured readable.
  • CVAT: T2 — auth on by default; /api/server/about and project list occasionally left open.

CVE watch (corrected 2026-05-31 vs NVD; see data/platform-intel/data-labeling-osint-2026-05-31.md):

  • Argilla: NO published CVE. CVE-2023-38686 cited in earlier drafts is Sydent/Matrix, NOT Argilla (confirmed via NVD). The real Argilla exposure is documented DEFAULT CREDENTIALS (owner.apikey / argilla.apikey / password 12345678) shipped and never force-rotated.
  • Label Studio: CVE-2022-25011 is RESERVED with no public Label Studio association (do not cite). Real issues: GHSA-cpmr-mw4j-99r7 (UNAUTH nginx alias path-traversal to SECRET_KEY, <=1.7.0), CVE-2023-43791 (hardcoded SECRET_KEY <1.8.2, 9.8), CVE-2025-25297 (S3 SSRF to cloud metadata), CVE-2026-22033 (stored XSS + token theft <=1.22.0). The default-open knob is open self-registration (DISABLE_SIGNUP_WITHOUT_LINK=False).
  • CVAT: CVE-2022-31188 (SSRF no-auth 9.8 <2.0.0), CVE-2025-23045 (Nuclio RCE 8.7), CVE-2024-47172 (BOLA 7.1). Collocated Grafana ships anonymous-admin by docker-compose default.
  • doccano: CVE-2024-40441 (SSRF via auto-labeling, requires project admin). Otherwise auth-bypass misconfig + /swagger/ + /openapi.json schema disclosure.
  • Prodigy: no CVE; no built-in auth by default.

doccano

Shodan QueryNotes
port:8000 "doccano"doccano on default port 8000
http.html:"doccano" port:8000HTML-scoped on default port
http.html:"doccano"Any port; doccano sometimes reverse-proxied
http.html:"/v1/projects" port:8000doccano projects API path in source
http.html:"/v1/health" port:8000Health endpoint path in source
"doccano"Bare-string in any indexed field
http.title:"doccano"Page title
http.title:"doccano" port:8000Title + default port
http.html:"doccano" port:80Reverse-proxied on port 80
http.html:"doccano" port:443Reverse-proxied HTTPS
hostname:"doccano"rDNS pattern
ssl.cert.subject.cn:"doccano"TLS cert CN
http.html:"doccano" org:"ovh"OVH-hosted (highest density in survey)
http.html:"doccano" org:"linode"Linode-hosted
http.html:"doccano" org:"scaleway"Scaleway-hosted
http.html:"doccano" org:"hetzner"Hetzner-hosted
http.html:"doccano" org:"digitalocean"DigitalOcean-hosted
http.html:"doccano" country:USUS-scoped
http.html:"doccano" country:CNChina
http.html:"doccano" country:DEGermany
http.html:"doccano" country:JPJapan (doccano is Japan-origin, Recruit Holdings)
http.html:"doccano" country:KRKorea

Argilla

Shodan QueryNotes
port:6900Argilla default port; near-unique signal
port:6900 "argilla"Argilla on default port
port:6900 http.html:"argilla"HTML-scoped
port:6900 http.status:200Live + responding
http.html:"argilla" port:6900HTML-scoped on default port
http.html:"/api/_info"Argilla API info endpoint path in source
http.html:"/api/workspaces"Argilla workspace API path
"argilla" port:6900Bare-string on default port
"argilla"Bare-string broadest
http.title:"Argilla"Page title
http.title:"Argilla" port:6900Title + default port
http.html:"argilla" port:80Reverse-proxied on port 80
http.html:"argilla" port:443Reverse-proxied HTTPS
hostname:"argilla"rDNS pattern
ssl.cert.subject.cn:"argilla"TLS cert CN
http.html:"argilla" org:"hetzner"Hetzner-hosted
http.html:"argilla" org:"amazon"AWS-hosted
http.html:"argilla" org:"google"GCP-hosted
http.html:"argilla" org:"university"Academic deployments (fine-tuning research)
http.html:"argilla" org:"hospital"Healthcare (highest-impact labeling context)

Label Studio

Shodan QueryNotes
port:8080 "label-studio"Label Studio on common alt port
port:8080 "LabelStudio"Capitalized form
http.html:"label-studio" port:8080HTML-scoped
http.html:"LabelStudio"Any port; Label Studio reverse-proxied
http.html:"/api/version"Label Studio version endpoint in source
http.html:"/api/projects"Label Studio projects API path
"label-studio"Bare-string in any field
"Label Studio"Product name
http.title:"Label Studio"Page title
http.title:"Label Studio" port:8080Title + common port
hostname:"label-studio"rDNS pattern
hostname:"labelstudio"Compressed rDNS
ssl.cert.subject.cn:"labelstudio"TLS cert CN

Prodigy (Explosion AI)

Shodan QueryNotes
port:8080 "prodigy"Prodigy on reverse-proxy port (no default server port; ships with --port flag)
http.html:"prodigy" port:8080HTML-scoped
http.html:"prodigy" port:80Reverse-proxied
http.html:"prodigy"Any port; highest false-positive rate of the platforms here
http.html:"spacy" http.html:"prodigy"Prodigy + spaCy co-occurrence (same vendor, often co-deployed)
"prodigy" "annotation"Conjunction reduces false positives
hostname:"prodigy"rDNS; check results carefully (name collision with Prodigy Music)

CVAT (Computer Vision Annotation Tool)

Shodan QueryNotes
port:8080 "CVAT"CVAT on default port
http.html:"cvat" port:8080HTML-scoped
http.html:"Computer Vision Annotation Tool"Full product name; high precision
http.html:"/api/server/about"CVAT server-info endpoint path in source
"CVAT" port:8080Bare-string on default port
http.title:"CVAT"Page title
hostname:"cvat"rDNS pattern
ssl.cert.subject.cn:"cvat"TLS cert CN
http.html:"cvat" org:"university"Academic CV research deployments

Combined

Shodan QueryNotes
(http.html:"doccano" OR port:6900 OR http.html:"Label Studio" OR http.html:"CVAT")Full data-labeling sweep
(http.html:"doccano" OR http.html:"argilla" OR http.html:"label-studio") org:"university"Academic annotation sweep
(http.html:"doccano" OR http.html:"argilla" OR http.html:"label-studio") org:"hospital"Healthcare annotation (HIPAA/GDPR risk)
port:6900 OR (port:8000 http.html:"doccano")Argilla + doccano combined port sweep
(http.html:"doccano" OR http.html:"CVAT" OR http.html:"Label Studio") http.html:"medical"Medical-data labeling projects
(http.html:"doccano" OR http.html:"argilla") country:JPJapan-scoped (doccano origin country)

VERIFIED RESULTS (2026-05-31 survey, aimap v1.9.44 + datalabel-probe v0.2)

Managed-cloud tier (title dorks, no cheap-VPS org filter). 80 candidates.

  • Label Studio http.title:"Label Studio": 21 hits, 17 confirmed; 16/17 open self-registration (/user/signup reachable, DISABLE_SIGNUP_WITHOUT_LINK=False default) = effective-unauth (register then read /api/projects + /api/tasks). 1 signup-closed. http.html:"label-studio-os-package" = 0 (Shodan does not index the /api/version JSON key).
  • CVAT http.title:"Computer Vision Annotation Tool": 30 hits, 20 confirmed, 20/20 auth-on. PROBE NOTE: /api/server/about needs Accept: application/vnd.cvat+json (DRF AcceptHeaderVersioning); a plain application/json probe and aimap both get 0/30 (Insight #73). Versions 2.5-2.64.1 (many outdated, applicable-CVE class).
  • doccano http.title:"doccano": auth-on (consistent with the 2026-05-04 348/348). FP TRAP: do NOT fingerprint doccano on /v1/health {“status”:…} alone, it FPs on Label Studio hosts (whose proxy also serves /v1/health); anchor on the /-page “doccano” marker.
  • Argilla port:6900 "argilla" = 0, Prodigy http.html:"prodigy.js" = 0: Shodan-dark. Argilla is HF-Spaces-dominant (needs HF-Hub enum); both carry default-open knobs (Argilla owner.apikey/12345678, Prodigy no-auth) this Shodan tier could not measure.

Insight #72: the discriminator is the default value of the registration knob (LS open vs CVAT/doccano closed). Insight #73: header-versioned APIs (CVAT vendor media type) are invisible to header-less fingerprinters.