Most recent
navigate open esc close Corpus index built 2026-06-07 23:58 UTC

← All reference

Reference

ML Governance / Data Catalog — Shodan Query Catalog

Source: https://github.com/nuclide-research/AI-LLM-Infrastructure-OSINT/blob/main/shodan/queries/ml-governance-queries

Generated: 2026-05-27 from pre-survey OSINT pass (13 platforms) See: data/platform-intel/ml-governance-osint-2026-05-27.md for full intel


OpenMetadata

Auth default: on — but CVE-2024-28255 (CVSS 9.8) allows auth bypass on all versions < 1.3.1; exploited in wild against Kubernetes clusters Exposure class: Full data catalog — table schemas, PII tags, database connection metadata, pipeline lineage, env vars with credentials on compromised container

LabelQueryRationaleFP Risk
primaryhttp.title:"OpenMetadata" port:8585Default port + page title unique to OpenMetadataLow
secondaryhttp.html:"open-metadata" port:8585HTML class/asset paths contain open-metadataLow
tertiaryhtml:"openmetadata" port:8585Broader catch for customized installsLow-Med
k8s-varianthttp.html:"openmetadata" port:8080Some K8s ingress rewrites on 8080Med
identity-probeGET /api/v1/system/version{"version":"...","revision":"..."}Unauthenticated version disclosure; confirms identity
bypass-probeGET /api/v1/tables;v1=x/ → 200Path param injection — confirms pre-1.3.1 vulnerable instance

DataHub (LinkedIn)

Auth default: off on GMS backend (port 8080) by default; frontend accepts datahub/datahub; JWT not cryptographically verified even when auth “enabled” Exposure class: Full org data inventory — all database entities, table/column lineage, ownership maps, PII classification, ingestion source configs

LabelQueryRationaleFP Risk
primaryhttp.title:"DataHub" port:9002Frontend UI title unique to DataHubLow-Med
secondaryhttp.html:"datahubproject" port:9002React bundle contains datahubproject referencesLow
gms-directport:8080 http.html:"datahub-gms"GMS container banner/health endpointMed
gms-restliport:8080 "X-RestLi-Protocol-Version"Distinctive Rest.li header emitted by GMSLow
identity-probe-uiGET /authenticate with datahub/datahub → session cookieConfirms default creds on frontend
identity-probe-gmsGET /config on port 8080 → JSON with "noCode" fieldUnauthenticated GMS config endpoint
gms-entityGET /entities?urns[0]=urn:li:corpuser:datahub on 8080 → entity JSONConfirms unauth GMS access

Apache Atlas

Auth default: default-creds (admin/admin) — no unauthenticated access, but creds are universally known and rarely changed Exposure class: Full Hadoop/big data inventory — Hive tables, HDFS paths, HBase, Kafka topics, Spark jobs, PII classification, entity lineage

LabelQueryRationaleFP Risk
primaryport:21000 http.title:"Atlas"Port 21000 is nearly exclusive to Apache AtlasLow
secondaryport:21000 html:"Apache Atlas"Page content contains Apache Atlas brandingLow
api-pathhttp.html:"/api/atlas/v2" port:21000API path in HTML links/JS confirms AtlasLow
https-variantport:21443 http.title:"Atlas"HTTPS variant (SSL enabled deployments)Low
identity-probeGET /api/atlas/admin/version with -u admin:admin{"Description":"Metadata Management...","Version":"2.x.x"}Default creds confirm; Description field unique
entity-dumpGET /api/atlas/v2/search/basic?typeName=hive_table with admin:admin → returns table inventoryData access confirmation

Amundsen (Lyft)

Auth default: off — auth is entirely absent unless flaskoidc manually configured for all three microservices Exposure class: Table/column metadata, ownership, PII tags, table statistics, data lineage — full catalog read without credentials

LabelQueryRationaleFP Risk
primaryhttp.title:"Amundsen" port:5000Title match on frontend portMed (port 5000 busy)
secondaryhttp.html:"amundsen" port:5000HTML body contains amundsen class namesMed
multi-porthttp.html:"amundsen" port:5001Search service port — less FP noiseLow-Med
identity-probeGET /healthcheck on 5001 and 5002 → {"status":"ok"}Confirms metadata + search services running
data-probeGET /api/metadata/v0/table_detail/<table_key> → table JSONConfirms unauthenticated catalog read

Marquez (OpenLineage)

Auth default: off — documented as having no auth by default Exposure class: Full pipeline lineage graph — job names, dataset names, run history, SQL queries in facets, schema snapshots, connection URIs in OpenLineage facets

LabelQueryRationaleFP Risk
primaryhttp.html:"marquez" port:5000HTML references Marquez on default portMed (port 5000 busy)
secondaryhttp.html:"marquezproject" port:5000More specific — MarquezProject brandingLow
ui-porthttp.html:"openlineage" port:3000UI on port 3000 with OpenLineage referencesMed
jar-varianthttp.html:"marquezproject" port:8080Java jar deployment on 8080Med
graphqlhttp.html:"graphql-playground" port:5000 html:"marquez"GraphQL playground is a distinctive co-signalLow
identity-probeGET /api/v1/namespaces{"namespaces":[{"name":"...","createdAt":"..."}]}Unauthenticated; namespaces array unique to Marquez
lineage-readGET /api/v1/jobs?namespace=<name> → job list with run historyConfirms full lineage access

OpenLineage (Ecosystem)

Auth default: depends on server — Marquez (off), Airflow transport (inherits Airflow auth) Exposure class: Pipeline topology, dataset schemas, job runs, SQL query text in facets

LabelQueryRationaleFP Risk
primaryhttp.html:"openlineage" port:5000OpenLineage server (Marquez) on 5000Med
airflow-varianthttp.html:"openlineage" port:8080Airflow with OpenLineage integrationHigh (Airflow also on 8080)
identity-probePOST /api/v1/lineage with minimal OpenLineage event → 201Confirms write access to lineage store

Great Expectations

Auth default: off when Data Docs are served externally; library mode has no server Exposure class: Validation results, expectation suite definitions, column statistics, null rates, value distributions — schema fingerprint of all profiled datasets

LabelQueryRationaleFP Risk
primaryhttp.html:"great_expectations" port:5000CSS class name unique to GX Data DocsMed (port 5000)
secondaryhttp.title:"Data Docs" http.html:"great_expectations"Title + class combo narrows FPLow
any-porthttp.html:"great_expectations"Catch deployments on 80/443 via proxyMed
identity-probeGET / → HTML with great_expectations in <link> or <script> tagsConfirms GX Data Docs page

Monte Carlo

Auth default: N/A — SaaS only; no self-hosted server Exposure class: Not applicable

Not a Shodan target. Monte Carlo agent is outbound-only. Skip.


Soda Core / Soda Cloud

Auth default: N/A — SaaS; self-hosted agent is outbound-only Exposure class: Not applicable

Not a Shodan target. Soda Agent listens on no public port. Skip.


Atlan

Auth default: on (OAuth2) — SaaS primary; self-deployed runtime is outbound worker only Exposure class: Not applicable for survey purposes

Not a meaningful Shodan target. Self-deployed runtime is a Kubernetes worker pod with no inbound listener.


Collibra

Auth default: on — default creds Admin/Admin documented; session-based auth enforced at login Exposure class: Full enterprise data governance inventory — all data assets, business glossary, lineage policies, PII rules, stewardship assignments

LabelQueryRationaleFP Risk
primaryport:4402 http.html:"Collibra"Console port distinctive; Collibra in HTMLLow
secondaryport:4402 http.title:"Collibra"Title match on console UI portLow
search-apiport:4421 http.html:"collibra"Search REST API portLow
agent-portport:4401 http.html:"collibra"Agent port — less likely public-facingLow
identity-probePOST /rest/2.0/auth/sessions {"username":"Admin","password":"Admin"} → session cookieDefault creds; JSESSIONID confirms auth
data-probeGET /rest/2.0/assets with session → asset listConfirms full governance inventory access

CKAN

Auth default: partially open by design — read operations unauthenticated; write requires API token Exposure class: Full dataset inventory, organization structure, resource URLs (may embed API keys/tokens in dataset records), public government data catalogs

LabelQueryRationaleFP Risk
primaryhttp.html:"ckan" port:5000Development server on 5000 with CKAN in HTMLMed
secondaryhttp.html:"/api/3/action" http.html:"ckan"API path reference in HTML unique to CKANLow
prod-80http.title:"CKAN" port:80Production on 80 via proxyMed
version-fieldhttp.html:"ckan_version"ckan_version field in API response HTML/JSONLow
identity-probeGET /api/3/action/status_show{"success":true,"result":{"ckan_version":"...","site_title":"..."}}Unauthenticated; ckan_version is definitive
dataset-dumpGET /api/3/action/package_list → full dataset name listUnauthenticated enumeration of entire catalog

MLflow (Model Registry — registry-specific gaps)

Auth default: off — no auth unless explicitly configured with auth plugin; default server exposes registry read/write Exposure class: Model names, versions, stages (Staging/Production), artifact URIs (s3://, gs://, azure:// paths), model signatures — model poisoning vector via unauthenticated registry write

LabelQueryRationaleFP Risk
primaryhttp.html:"registered-models" port:5000Model registry UI contains this stringLow-Med
secondaryhttp.html:"/api/2.0/mlflow/registered-models" port:5000API path in UI JS referencesLow
ui-registryhttp.title:"MLflow" http.html:"Model Registry" port:5000UI page title + registry tabLow
identity-probeGET /api/2.0/mlflow/registered-models/list{"registered_models":[...]}Registry-specific endpoint; array key unique to MLflow registry
artifact-leakGET /api/2.0/mlflow/model-versions/get-download-uri?name=<model>&version=1 → presigned URICloud storage URI exposure — s3:// bucket enumeration
write-probePOST /api/2.0/mlflow/registered-models/create with {"name":"test"} → 200Confirms unauthenticated write = model poisoning possible

Priority Ranking for Harvest

RankPlatformReason
1OpenMetadataCVE-2024-28255 actively exploited; auth bypass confirmed; high K8s prevalence
2DataHubGMS auth off by default; JWT non-verification even when “on”; LinkedIn pedigree = widespread enterprise use
3Apache AtlasFamous default creds; Hadoop ecosystem ubiquity; high-value targets
4MarquezNo auth documented; full lineage graph exposed; growing OpenLineage adoption
5AmundsenNo auth by default; full catalog read; Lyft-influenced orgs
6MLflow (registry)Auth off by default; model poisoning vector; prior surveys show large Shodan population
7CKANMassive government deployment footprint; read is open by design; credential leak in resources
8CollibraLow population but extremely high-value enterprise targets
9Great ExpectationsLow population; exposure only when Data Docs served externally
Monte CarloNot a Shodan target
Soda Core/CloudNot a Shodan target
AtlanNot a Shodan target