ML Governance / Data Catalog survey, 2026-05-29
Survey type: new-category population survey. Data catalogs, ML metadata stores, lineage trackers. Pre-survey intel: data/platform-intel/ml-governance-osint-2026-05-27.md.
Summary
Nine dorks. Six platforms. The category is well-secured at population scale, and that is the finding. The auth-on platforms run patched versions. The auth-off platforms are either Shodan-dark or empty demos. One unauthenticated Marquez server confirmed, and it held no production data.
DCWF KSAT coverage
Auto-derived from DCWF AI work-role rule files (ksat-tag).
- 672 (AI Test & Evaluation Specialist): K7003, K7004, K7044, S7068, S7070, S7075, T5904, T5919
- 733 (AI Risk & Ethics Specialist): K7040, K7051, S7067, T5854, T5893
- overlap (Common AI KSATs (all 5 roles)): K108, K1158, K1159, K22, K6311, K6900, K6935, K7003, K942
This is the auth-on-default thesis confirmed by its secure branch. Voice-AI (cat-17, same day) shipped auth-off and bled. ML governance ships auth-on at the catalog tier, operators patch, and the population holds. The shipping default is load-bearing in both directions.
Stage 0, Discover
| Platform | Dork | Total | Verdict |
|---|---|---|---|
| OpenMetadata | http.title:"OpenMetadata" port:8585 | 56 | clean, all real |
| DataHub | http.title:"DataHub" port:9002 | 27 | clean frontends |
| Apache Atlas | port:21000 http.title:"Atlas" + 2 variants | 0 | Shodan-dark |
| CKAN | http.html:"ckan" port:5000 | 53 | gov open-data, reads open by design |
| Marquez | http.title:"Marquez" | 50 | ~50% real, surname collisions |
Atlas returned zero across three variants (title, api/atlas/v2 path, "Apache Atlas"). Atlas lives in Cloudera and Hortonworks deployments on internal
networks, not public tier-2 cloud. The variant space is exhausted. This is a
category-tenancy negative, the same shape as NVIDIA Clara on the medical-edge
survey.
CKAN’s ckan_version JSON-field dork returned zero because the field lives in
the /api/3/action/status_show JSON body, not the crawled HTML. The HTML dork
found 53 government open-data portals (Turkish and Indonesian municipalities).
CKAN reads are open by design for public data portals, so an open dataset list
is not a finding.
Stage 2, Verify
The verification was version-bucketing, and it inverted the headline.
OpenMetadata, the CVSS 9.8 that was not there. CVE-2024-28255 is an auth-bypass scored 9.8, exploited in the wild against Kubernetes clusters for cryptomining. It affects versions below 1.3.1. The dork found 56 OpenMetadata hosts. A survey that stopped at the dork would report 56 exposed catalogs with a 9.8 auth bypass.
The version endpoint settled it. GET /api/v1/system/version is unauthenticated
by design and returns the version string. Ten of ten sampled hosts ran 1.10.5
through 1.12.6. Every one is years past 1.3.1. None carry the bypass. The catalog
data endpoint confirmed the lock: GET /api/v1/tables returned 401 on every host
tested. The version string leaks, which is information disclosure at the INFO
tier. The data estate behind it is closed.
DataHub, the backend that was not exposed. DataHub’s GMS service on port 8080 is authentication-off by default and is the high-value surface. Ten of ten sampled hosts refused the connection on 8080. Only the frontend on 9002 faces the internet, behind its login. The auth-off surface exists in the software and was not deployed to the public internet on any sampled host. One host, 3.30.235.161, sits on AWS us-gov-west-1, frontend only.
Marquez, the one open door, empty. Marquez is authentication-off by default.
One host, 136.114.205.74 on Google Cloud, served /api/v1/namespaces with no
credentials. The namespaces were default and metalake_demo. The default
namespace held zero jobs and zero datasets. metalake_demo is the namespace
name from the Marquez tutorial. This is a demo deployment, open as the software
ships, holding no production lineage. A second Marquez host, 48.217.48.173,
returned HTTP 500 with a PostgreSQL connection error. The server is up and the
database is down, so it serves no lineage. The remaining sampled hosts did not
answer on the API ports.
Stage 1 through 7, the arsenal
aimap confirmed the picture: twelve services, one unauthenticated. It flagged the
ten OpenMetadata hosts at fingerprint-default critical but its enumerator
correctly downgraded each to MEDIUM “unauthenticated version disclosure”, the
same hardcoded-default-severity pattern aimap shows on MLflow and Flowise. Trust
the enumerator verdict, not the fingerprint default. It found the Marquez
unauthenticated namespaces and a Grafana on the same host as one OpenMetadata
instance, login-gated.
menlohunt swept the unauthenticated Marquez host for adjacent services and found only SSH and the Marquez port. No stacked Redis, no exposed Postgres, zero attack chains. Unlike the voice-AI stacked host the same day, this operator isolated the service. VisorGraph and VisorBishop returned nothing: bare cloud IPs with no certificate SAN to pivot and no adjacent unauthenticated surface. BARE found no Metasploit coverage for either finding class. VisorScuba scored every host as passing, which is correct for the version-disclosure INFO tier and a gap for the Marquez unauthenticated API, which maps to no control.
Impact
The category-level result is low risk and high confidence. The data catalog is a reconnaissance goldmine when open: one exposed OpenMetadata or DataHub reveals every database, schema, PII tag, and connection string in an organization. None of the sampled production catalogs were open. The one open Marquez held no data.
The reconnaissance value of an exposed catalog is the reason to keep checking this category as it grows. An open OpenMetadata below 1.3.1, or a DataHub with GMS on a public 8080, would be a map to an entire data estate. The population did not contain one today.
Remediation
- OpenMetadata operators below 1.3.1 must upgrade now. CVE-2024-28255 is exploited in the wild.
- DataHub operators must keep GMS on 8080 off the public internet and change the datahub/datahub frontend default.
- Marquez and Amundsen ship authentication-off. Operators must put them behind a reverse proxy with auth or bind them to localhost.
What the method could not see
Apache Atlas is Shodan-dark on public cloud and was not enumerated. A Cloudera or Hortonworks census needs internal-network access, not Shodan. CKAN’s open-by- design reads mask the real finding classes for that platform: API tokens leaked in resource descriptions and the path-traversal RCE CVE-2023-32321, neither pursued here. The OpenMetadata and Marquez samples were page-one of the result sets, not the full populations.
Toolchain provenance
JAXEN Playwright web UI; 9 dorks, 6 platforms (Cloudflare paced)
aimap v1.9.39 lean 25 hosts x 11 ports; 12 services, 1 unauth
aimap-profile Marquez + gov-adjacent DataHub: unclassified/commercial, no honeypot
VisorGraph 0 nodes/edges (bare cloud IP, no cert SAN)
VisorBishop 3 hosts severity=none, no ip-shadow findings
VisorSD N/A no Shodan key
VisorGoose N/A gov/edu scope
menlohunt Marquez host: SSH + Marquez only, 0 chains (isolated)
recongraph N/A Shodan-dependent
nu-recon N/A simulated-only without live key
VisorPlus components run individually
VisorLog 25 events via aimap adapter -> nuclide.db
VisorScuba 25 hosts passing (version-disclosure INFO; Marquez-unauth unmapped, gap)
BARE no MSF coverage (0.547/0.427) first-party/novel
VisorCorpus N/A no LLM-adjacent surface (catalogs, not inference)
VisorAgent N/A controlled-target only; not fired at survey hosts
VisorRAG N/A no RAG surface
VisorHollow N/A Windows-only
cortex run at codify on analysis
JS-bundle N/A catalogs serve own UI, no CDN-SPA secret bundle