Document Parsers, Data Layer, NuClide Stack

What it is

Before a document gets embedded, it has to be turned into clean text. PDFs have layout. Word documents have tables. Slide decks have hierarchy. The document-parsing layer extracts all of that into the markdown-or-JSON the embedder expects. Unstructured.io is the multi-format incumbent. LlamaParse (LlamaIndex) is the cloud-API competitor optimised for RAG. marker is the open PDF-to-markdown specialist; MinerU (OpenDataLab) is the high-quality alternative. Docling (IBM) is the newer research-grade option. Every serious RAG pipeline runs documents through one of these before they ever reach the vector DB.

What goes wrong

The parser server processes a stream of operator-uploaded documents and caches them on disk. When the parser is exposed without auth, the document queue and the parsed-output cache are both reachable. That cache often contains documents the operator has marked private. Internal contracts, HR files, partner agreements, all sitting in plaintext markdown form on the parser’s disk. The parser is also a high-CPU service: an attacker can submit large or malicious PDFs to either burn the operator’s compute or trigger known parsing-library RCEs.

How we test

We probe the parser’s REST endpoint for the version banner and check the status endpoint for queue depth and recent-job filenames. The filenames characterise the operator’s document inventory and frequently identify the legal or business unit the parser is serving. We do not submit documents.

Receipts

Research

Every survey, case study, and disclosure we've published that touches this layer of the stack. Counts on the cells above tally these directly.

Queued

We haven't surveyed this category yet. The technology is on our map; the receipts will follow when the cross-cloud survey lands. Browse the research feed for what's already published, or watch this page.

Data Layer

Document Parsers

What it is

What goes wrong

How we test

Research

Other categories in this layer

Vector Databases

Search Engines

OLAP / Analytics Backends

MLOps Tracking

Agent Memory

Data Labeling

Object Storage

Compute Orchestration

GPU Compute & Telemetry

Container Orchestration

Medical / Edge AI

Backup & Snapshots

Fine-tuning Runtimes

Model Hubs & Registries