ML Governance / Data Catalog — Shodan Query Catalog
Generated: 2026-05-27 from pre-survey OSINT pass (13 platforms) See: data/platform-intel/ml-governance-osint-2026-05-27.md for full intel
OpenMetadata
Auth default: on — but CVE-2024-28255 (CVSS 9.8) allows auth bypass on all versions < 1.3.1; exploited in wild against Kubernetes clusters Exposure class: Full data catalog — table schemas, PII tags, database connection metadata, pipeline lineage, env vars with credentials on compromised container
| Label | Query | Rationale | FP Risk |
|---|---|---|---|
| primary | http.title:"OpenMetadata" port:8585 | Default port + page title unique to OpenMetadata | Low |
| secondary | http.html:"open-metadata" port:8585 | HTML class/asset paths contain open-metadata | Low |
| tertiary | html:"openmetadata" port:8585 | Broader catch for customized installs | Low-Med |
| k8s-variant | http.html:"openmetadata" port:8080 | Some K8s ingress rewrites on 8080 | Med |
| identity-probe | GET /api/v1/system/version → {"version":"...","revision":"..."} | Unauthenticated version disclosure; confirms identity | — |
| bypass-probe | GET /api/v1/tables;v1=x/ → 200 | Path param injection — confirms pre-1.3.1 vulnerable instance | — |
DataHub (LinkedIn)
Auth default: off on GMS backend (port 8080) by default; frontend accepts datahub/datahub; JWT not cryptographically verified even when auth “enabled” Exposure class: Full org data inventory — all database entities, table/column lineage, ownership maps, PII classification, ingestion source configs
| Label | Query | Rationale | FP Risk |
|---|---|---|---|
| primary | http.title:"DataHub" port:9002 | Frontend UI title unique to DataHub | Low-Med |
| secondary | http.html:"datahubproject" port:9002 | React bundle contains datahubproject references | Low |
| gms-direct | port:8080 http.html:"datahub-gms" | GMS container banner/health endpoint | Med |
| gms-restli | port:8080 "X-RestLi-Protocol-Version" | Distinctive Rest.li header emitted by GMS | Low |
| identity-probe-ui | GET /authenticate with datahub/datahub → session cookie | Confirms default creds on frontend | — |
| identity-probe-gms | GET /config on port 8080 → JSON with "noCode" field | Unauthenticated GMS config endpoint | — |
| gms-entity | GET /entities?urns[0]=urn:li:corpuser:datahub on 8080 → entity JSON | Confirms unauth GMS access | — |
Apache Atlas
Auth default: default-creds (admin/admin) — no unauthenticated access, but creds are universally known and rarely changed Exposure class: Full Hadoop/big data inventory — Hive tables, HDFS paths, HBase, Kafka topics, Spark jobs, PII classification, entity lineage
| Label | Query | Rationale | FP Risk |
|---|---|---|---|
| primary | port:21000 http.title:"Atlas" | Port 21000 is nearly exclusive to Apache Atlas | Low |
| secondary | port:21000 html:"Apache Atlas" | Page content contains Apache Atlas branding | Low |
| api-path | http.html:"/api/atlas/v2" port:21000 | API path in HTML links/JS confirms Atlas | Low |
| https-variant | port:21443 http.title:"Atlas" | HTTPS variant (SSL enabled deployments) | Low |
| identity-probe | GET /api/atlas/admin/version with -u admin:admin → {"Description":"Metadata Management...","Version":"2.x.x"} | Default creds confirm; Description field unique | — |
| entity-dump | GET /api/atlas/v2/search/basic?typeName=hive_table with admin:admin → returns table inventory | Data access confirmation | — |
Amundsen (Lyft)
Auth default: off — auth is entirely absent unless flaskoidc manually configured for all three microservices Exposure class: Table/column metadata, ownership, PII tags, table statistics, data lineage — full catalog read without credentials
| Label | Query | Rationale | FP Risk |
|---|---|---|---|
| primary | http.title:"Amundsen" port:5000 | Title match on frontend port | Med (port 5000 busy) |
| secondary | http.html:"amundsen" port:5000 | HTML body contains amundsen class names | Med |
| multi-port | http.html:"amundsen" port:5001 | Search service port — less FP noise | Low-Med |
| identity-probe | GET /healthcheck on 5001 and 5002 → {"status":"ok"} | Confirms metadata + search services running | — |
| data-probe | GET /api/metadata/v0/table_detail/<table_key> → table JSON | Confirms unauthenticated catalog read | — |
Marquez (OpenLineage)
Auth default: off — documented as having no auth by default Exposure class: Full pipeline lineage graph — job names, dataset names, run history, SQL queries in facets, schema snapshots, connection URIs in OpenLineage facets
| Label | Query | Rationale | FP Risk |
|---|---|---|---|
| primary | http.html:"marquez" port:5000 | HTML references Marquez on default port | Med (port 5000 busy) |
| secondary | http.html:"marquezproject" port:5000 | More specific — MarquezProject branding | Low |
| ui-port | http.html:"openlineage" port:3000 | UI on port 3000 with OpenLineage references | Med |
| jar-variant | http.html:"marquezproject" port:8080 | Java jar deployment on 8080 | Med |
| graphql | http.html:"graphql-playground" port:5000 html:"marquez" | GraphQL playground is a distinctive co-signal | Low |
| identity-probe | GET /api/v1/namespaces → {"namespaces":[{"name":"...","createdAt":"..."}]} | Unauthenticated; namespaces array unique to Marquez | — |
| lineage-read | GET /api/v1/jobs?namespace=<name> → job list with run history | Confirms full lineage access | — |
OpenLineage (Ecosystem)
Auth default: depends on server — Marquez (off), Airflow transport (inherits Airflow auth) Exposure class: Pipeline topology, dataset schemas, job runs, SQL query text in facets
| Label | Query | Rationale | FP Risk |
|---|---|---|---|
| primary | http.html:"openlineage" port:5000 | OpenLineage server (Marquez) on 5000 | Med |
| airflow-variant | http.html:"openlineage" port:8080 | Airflow with OpenLineage integration | High (Airflow also on 8080) |
| identity-probe | POST /api/v1/lineage with minimal OpenLineage event → 201 | Confirms write access to lineage store | — |
Great Expectations
Auth default: off when Data Docs are served externally; library mode has no server Exposure class: Validation results, expectation suite definitions, column statistics, null rates, value distributions — schema fingerprint of all profiled datasets
| Label | Query | Rationale | FP Risk |
|---|---|---|---|
| primary | http.html:"great_expectations" port:5000 | CSS class name unique to GX Data Docs | Med (port 5000) |
| secondary | http.title:"Data Docs" http.html:"great_expectations" | Title + class combo narrows FP | Low |
| any-port | http.html:"great_expectations" | Catch deployments on 80/443 via proxy | Med |
| identity-probe | GET / → HTML with great_expectations in <link> or <script> tags | Confirms GX Data Docs page | — |
Monte Carlo
Auth default: N/A — SaaS only; no self-hosted server Exposure class: Not applicable
Not a Shodan target. Monte Carlo agent is outbound-only. Skip.
Soda Core / Soda Cloud
Auth default: N/A — SaaS; self-hosted agent is outbound-only Exposure class: Not applicable
Not a Shodan target. Soda Agent listens on no public port. Skip.
Atlan
Auth default: on (OAuth2) — SaaS primary; self-deployed runtime is outbound worker only Exposure class: Not applicable for survey purposes
Not a meaningful Shodan target. Self-deployed runtime is a Kubernetes worker pod with no inbound listener.
Collibra
Auth default: on — default creds Admin/Admin documented; session-based auth enforced at login
Exposure class: Full enterprise data governance inventory — all data assets, business glossary, lineage policies, PII rules, stewardship assignments
| Label | Query | Rationale | FP Risk |
|---|---|---|---|
| primary | port:4402 http.html:"Collibra" | Console port distinctive; Collibra in HTML | Low |
| secondary | port:4402 http.title:"Collibra" | Title match on console UI port | Low |
| search-api | port:4421 http.html:"collibra" | Search REST API port | Low |
| agent-port | port:4401 http.html:"collibra" | Agent port — less likely public-facing | Low |
| identity-probe | POST /rest/2.0/auth/sessions {"username":"Admin","password":"Admin"} → session cookie | Default creds; JSESSIONID confirms auth | — |
| data-probe | GET /rest/2.0/assets with session → asset list | Confirms full governance inventory access | — |
CKAN
Auth default: partially open by design — read operations unauthenticated; write requires API token Exposure class: Full dataset inventory, organization structure, resource URLs (may embed API keys/tokens in dataset records), public government data catalogs
| Label | Query | Rationale | FP Risk |
|---|---|---|---|
| primary | http.html:"ckan" port:5000 | Development server on 5000 with CKAN in HTML | Med |
| secondary | http.html:"/api/3/action" http.html:"ckan" | API path reference in HTML unique to CKAN | Low |
| prod-80 | http.title:"CKAN" port:80 | Production on 80 via proxy | Med |
| version-field | http.html:"ckan_version" | ckan_version field in API response HTML/JSON | Low |
| identity-probe | GET /api/3/action/status_show → {"success":true,"result":{"ckan_version":"...","site_title":"..."}} | Unauthenticated; ckan_version is definitive | — |
| dataset-dump | GET /api/3/action/package_list → full dataset name list | Unauthenticated enumeration of entire catalog | — |
MLflow (Model Registry — registry-specific gaps)
Auth default: off — no auth unless explicitly configured with auth plugin; default server exposes registry read/write Exposure class: Model names, versions, stages (Staging/Production), artifact URIs (s3://, gs://, azure:// paths), model signatures — model poisoning vector via unauthenticated registry write
| Label | Query | Rationale | FP Risk |
|---|---|---|---|
| primary | http.html:"registered-models" port:5000 | Model registry UI contains this string | Low-Med |
| secondary | http.html:"/api/2.0/mlflow/registered-models" port:5000 | API path in UI JS references | Low |
| ui-registry | http.title:"MLflow" http.html:"Model Registry" port:5000 | UI page title + registry tab | Low |
| identity-probe | GET /api/2.0/mlflow/registered-models/list → {"registered_models":[...]} | Registry-specific endpoint; array key unique to MLflow registry | — |
| artifact-leak | GET /api/2.0/mlflow/model-versions/get-download-uri?name=<model>&version=1 → presigned URI | Cloud storage URI exposure — s3:// bucket enumeration | — |
| write-probe | POST /api/2.0/mlflow/registered-models/create with {"name":"test"} → 200 | Confirms unauthenticated write = model poisoning possible | — |
Priority Ranking for Harvest
| Rank | Platform | Reason |
|---|---|---|
| 1 | OpenMetadata | CVE-2024-28255 actively exploited; auth bypass confirmed; high K8s prevalence |
| 2 | DataHub | GMS auth off by default; JWT non-verification even when “on”; LinkedIn pedigree = widespread enterprise use |
| 3 | Apache Atlas | Famous default creds; Hadoop ecosystem ubiquity; high-value targets |
| 4 | Marquez | No auth documented; full lineage graph exposed; growing OpenLineage adoption |
| 5 | Amundsen | No auth by default; full catalog read; Lyft-influenced orgs |
| 6 | MLflow (registry) | Auth off by default; model poisoning vector; prior surveys show large Shodan population |
| 7 | CKAN | Massive government deployment footprint; read is open by design; credential leak in resources |
| 8 | Collibra | Low population but extremely high-value enterprise targets |
| 9 | Great Expectations | Low population; exposure only when Data Docs served externally |
| — | Monte Carlo | Not a Shodan target |
| — | Soda Core/Cloud | Not a Shodan target |
| — | Atlan | Not a Shodan target |