aimap's AI-service classifier needs the ML data tier, not just the inference tier, NuClide Methodology

The insight

aimap classifies a target by what AI/ML services it can fingerprint on that target’s open ports. The catalog has been built incrementally around the inference and observability tiers: Ollama, vLLM, llama.cpp, MLflow, Phoenix, Langfuse, LangSmith, Helicone, Open WebUI, ChromaDB, Qdrant, Milvus, etc. It does not yet treat the standard ML data tier as AI-relevant: PostgreSQL, Redis, S3-compatible MinIO buckets, MailHog sinks, Kafka brokers, RabbitMQ.

On the PENTECH host that anchored the SmartShop AI case study, aimap’s 55-minute deep-enum across 19 ports returned exactly one “AI service found” (Apache Airflow on port 8080) despite the same host running an exposed MLflow tracker (port 5000), an exposed Redis (port 6379), an exposed PostgreSQL (port 5432) that backs the MLflow tracker, and a Postfix mail server. The visible AI/ML attack surface was understated by 4x.

Why this matters

The undercount has three downstream effects:

Operator-impact framing. Disclosure emails generated from aimap’s classification understate the operator’s actual blast radius. A recipient reading “1 unauth AI service” responds with different urgency than one reading “AI service running on the same host as its backing PostgreSQL, Redis cache, and orchestration scheduler.”
Risk scoring. VisorScuba and BARE consume aimap’s output. A host that exposes the full ML data tier should score worse than a host that exposes only the tracker, but the current pipeline treats them equivalently.
Operator attribution. Same-host port adjacency is a high-signal attribution feature (“the team that runs the MLflow tracker also runs the Postgres on the same VM”). Without the data tier in the catalog, this signal is lost.

The aimap-profile companion does catch some of this via Shodan-passive port enumeration, but the active aimap scan is where the per-port fingerprint evidence lives. The split classification weakens both stages.

What the catalog should add

Six ports/services worth treating as ML-data-tier AI signals when adjacent to an inference- or tracker-tier service on the same host:

Port	Service	AI-context signal
5432	PostgreSQL	MLflow backend store, Langfuse DB, embeddings tables
6379	Redis	Inference cache, session store for serving stacks
9000 / 9001	MinIO / S3-compatible	Local artifact store, RAG document corpus
1025 / 8025	MailHog	Inference-pipeline notification sink
9092	Kafka	Streaming inference, event-driven RAG
5672 / 15672	RabbitMQ	Inference queueing

Standalone, none of these are “an AI service.” Adjacent to a confirmed AI service on the same host, every one of them is part of the ML pipeline and should classify accordingly. The conjunctive matcher pattern (Insight #6: status_code + json_field + body_contains) extends naturally to a “adjacent-port” predicate: port:5432 alongside port:5000 becomes an MLflow-backend-store fingerprint that the standalone Postgres probe cannot detect.

How to apply

The rule was implemented in aimap v1.8.3 (2026-05-13). Implementation shape:

Run the standard port enum.
After Phase 2 fingerprinting, derive AdjacencyMatch records: for each open port on a host with at least one confirmed AI/ML service, if the port appears in the data-tier catalog, emit an adjacency finding scaled to the catalog’s per-port severity.
Adjacencies appear as a new section in the terminal output (“ML-ADJACENT INFRASTRUCTURE”) and as a separate adjacencies key in the JSON report.
Severity counts in the report summary include adjacency findings.

Reference implementation: adjacency.go in the aimap repository, covered by adjacency_test.go (6 tests). Live-validated 2026-05-13 against 78.135.66.61 (PENTECH BILISIM / SmartShop AI host). The host’s exposed Postgres on :5432 and Redis on :6379 now both emit as ML-adjacent findings tied to the MLflow + Airflow services on the same host.

Operational shape:

aimap -list ips.txt -ports "5000,80,443,8080" -o report.json
# Adjacencies present in the report under the "adjacencies" key,
# rendered in the terminal under ML-ADJACENT INFRASTRUCTURE.

Self-critical note

This is a tooling gap our own work surfaced against our own tooling. The PENTECH chain ran six tools across one host in parallel, and the discrepancy between aimap’s “1 AI service” and the full Shodan host record’s “11 ports / 27 CVEs / full ML pipeline” became visible only because we cross-checked the outputs.

The lesson generalizes: single-tool classifications should always be cross-checked against the broader infrastructure record when the host is under deep-dive. Catalog gaps in one tool are invisible from within that tool’s output. They only show up against an external reference.

When this could break

A host running just MLflow with co-located Postgres for a personal research project. The data tier exposure is real but the operator impact is genuinely low. The adjacency-based reclassification should preserve severity nuance, not flatten everything to HIGH.
Honeypots running fake ML stacks. Adjacent-port fingerprinting will flag honeypots that mimic the full stack. Insight #1 (honeypot self-filtering) already addresses this; the data-tier classifier should inherit the same protocol-strictness check.

Discovery context

The PENTECH chain in the SmartShop AI case study ran aimap, visorgraph, aimap-profile, Shodan host pull, JS-bundle extraction, and a controlled set of anonymous API probes against 78.135.66.61. The Shodan record showed ports [80, 110, 143, 443, 465, 587, 995, 5000, 5432, 6379, 8080]. aimap’s output reported 5000 as MLflow and 8080 as Airflow, but emitted only Airflow as an “AI service found”. MLflow was fingerprinted but suppressed at the AI-classification stage by a catalog-version mismatch. The Postgres + Redis + Postfix were never classified as AI-related at all.

The visible asymmetry between aimap’s narrow read and the broader Shodan record was the prompt for codifying this gap as an insight rather than a silent bug.

SOURCE · case-studies/commercial/smartshop-ai-pentech-disclosure-2026-05-13.md