Most recent
navigate open esc close Corpus index built 2026-06-07 23:58 UTC

§ THE STACK / DATA LAYER

Object Storage

MinIO, S3, model & dataset stores

Vector stores, registries, memory, datasets: what the model knows and remembers.

What it is

Models and datasets are big (gigabytes to terabytes per artefact), and the universal storage substrate for them is S3-compatible object storage. MinIO is the self-hosted on-prem option (also bundled with most RAG distributions like Dify); AWS S3, Google Cloud Storage, and Cloudflare R2 are the cloud variants; Garage and SeaweedFS are the smaller open alternatives. Every model registry, every fine-tuning job, every RAG document loader writes through one of these.

What goes wrong

MinIO ships with the credentials minioadmin / minioadmin and a public console on port 9001. Most operators change the password but leave the console reachable; many leave the API on port 9000 with a public bucket policy that reveals the bucket inventory. The buckets are typically named after the project (model-weights, training-data-2026, customer-uploads), and the keys inside them describe the artefact lifecycle. S3 buckets exhibit the same pattern at a different scale: misconfigured bucket policies, public ACLs from old aws s3 sync --acl public-read mistakes, and the now-classic “bucket name is the company name plus production” enumeration vulnerability.

How we test

We list buckets through the unauthenticated MinIO admin API where reachable, and check S3 buckets via probabilistic name enumeration (no brute-force, just the patterns that fall out of the operator’s known naming conventions). We confirm exposure with a single HEAD against a bucket-listing URL; we do not download objects. Bucket names plus their key-prefix structure are the disclosure evidence.