What it is
Before a document gets embedded, it has to be turned into clean text. PDFs have layout. Word documents have tables. Slide decks have hierarchy. The document-parsing layer extracts all of that into the markdown-or-JSON the embedder expects. Unstructured.io is the multi-format incumbent. LlamaParse (LlamaIndex) is the cloud-API competitor optimised for RAG. marker is the open PDF-to-markdown specialist; MinerU (OpenDataLab) is the high-quality alternative. Docling (IBM) is the newer research-grade option. Every serious RAG pipeline runs documents through one of these before they ever reach the vector DB.
What goes wrong
The parser server processes a stream of operator-uploaded documents and caches them on disk. When the parser is exposed without auth, the document queue and the parsed-output cache are both reachable. That cache often contains documents the operator has marked private. Internal contracts, HR files, partner agreements, all sitting in plaintext markdown form on the parser’s disk. The parser is also a high-CPU service: an attacker can submit large or malicious PDFs to either burn the operator’s compute or trigger known parsing-library RCEs.
How we test
We probe the parser’s REST endpoint for the version banner and check the status endpoint for queue depth and recent-job filenames. The filenames characterise the operator’s document inventory and frequently identify the legal or business unit the parser is serving. We do not submit documents.