AI Exposure Survey System: Analytics & OLAP Design
1. Purpose and Scope
This document describes the analytics and storage design for the AI exposure survey system.
- SQLite remains the system of record for findings and lifecycle state.
- A real-time OLAP layer (ClickHouse) provides population-scale analytics and supports new, aggregated tools for both humans and LLM agents.
- DuckDB is used for embedded, ad-hoc analysis over exported data.
This design is additive: OLAP augments the existing per-finding pipeline; it does not replace it.
2. Current Architecture (Baseline)
2.1 Canonical Stores
nuclide.db(VisorLog)- Lifecycle-tracked findings store (statuses, disclosure state).
empire.db(JAXEN)- Shodan harvest staging DB.
~/recon/...- Raw per-survey artifacts on disk; evidence packs for disclosures.
- JSON shapes
- Stable schemas passed between tools; outputs are valid inputs to downstream tools.
2.2 Tools and Stages (High Level)
- Discovery and staging:
- Shodan → JAXEN (
empire.db) - Masscan, CT logs, DNS, WHOIS, JS bundle analysis
- Shodan → JAXEN (
- Enrichment and attribution:
VisorGraphcert-pivot (IP → operator via SANs & WHOIS)aimapfingerprinting (service/framework tag)
- Policy and scoring:
VisorScuba(Rego) → compliance score (0–10) + criticality tier- Severity from PII presence, unauth admin surface, version disclosure
- Post-processing:
- CVE links from Shodan vulns + version inference
- Honeypot filtering
nuclide-contactfor recipient resolution- Country/sector tagging
VisorLogdedupe by IP
SQLite + files hold all authoritative state and evidence.
3. New Requirements for Population-Scale Analytics
3.1 Scale and Retention
- Bursty ingestion: 500–1000 findings/min during survey runs.
- Steady state: <100 findings/day.
- Ledger growth: ~500 findings/month (across active surveys).
- Retention goals:
- Hot (fast queries): 90 days.
- Warm (trend analysis): 1 year.
- Cold (raw artifacts/evidence): indefinite (on disk).
3.2 Latency and Cadence
- Active exploitation alerts: 5–15 minutes from probe → alert.
- New high-severity findings: visible in dashboards within 1 hour (no paging).
- Trend and population studies: daily/weekly; focus on rates, not per-instance immediacy.
- Drift checks: hourly cron; re-probe stale findings.
- Disclosure status updates: propagate to analytics within 10 minutes of
VisorLogedits.
3.3 Population-Scale Queries (Examples)
Must be “interactive” (sub-second to a few seconds):
- New high-severity findings since last check, grouped by category & framework.
- Auth-off rate per platform class over the last 90 days.
- Per-operator exposure picture for any org/ASN/netblock.
- Disclosure pipeline state: open & undisclosed by severity.
- Cross-platform multi-finding operators (same org leaks ≥2 categories).
- Re-probe staleness queue: findings last verified > N days ago.
- CVE distribution across open findings.
4. Target Architecture
4.1 Storage Roles
-
SQLite (existing)
- Source of record for:
- Findings, lifecycle state, disclosure history.
- Rego outputs, scoring, tags.
- Backed by raw evidence under
~/recon.
- Source of record for:
-
ClickHouse (new, real-time OLAP layer)
- Analytic mirror over findings and enrichment outputs.
- Supports:
- Time-windowed aggregation (last 15m, 1h, 90d, 1y).
- Multi-dimensional group-bys (framework, platform class, sector, country, org, ASN, category, severity, CVE).
- Feeds dashboards, daily/weekly rate studies, and LLM population-level tools.
-
DuckDB (new, embedded analytics)
- Local, ad-hoc analysis engine.
- Reads:
- Parquet exports from ClickHouse,
- or snapshots from SQLite/JSON.
- Used for:
- Prototyping metrics,
- One-off analysis,
- Testing new policy/taxonomy changes on subsets.
4.2 Data Flow
-
Scanners and harvesters populate staging and evidence:
- JAXEN →
empire.db, raw JSON artifacts under~/recon.
- JAXEN →
-
Enrichment pipeline:
VisorGraph→ operator mapping.aimap→ service/framework tagging.VisorScuba→ Rego-based compliance and criticality scores.- CVE mapping, honeypot filtering, recipient resolution.
-
SQLite update:
nuclide.db(VisorLog) records:- Findings, tags, scores, status, disclosure metadata, verification timestamps.
-
OLAP ingestion:
- A small ETL/sync process:
- Reads from
nuclide.db(and/or JSON exports). - Writes to ClickHouse fact and dimension tables.
- Reads from
- Batched updates aligned with:
- Survey runs,
- Hourly drift jobs,
- Disclosure status changes.
- A small ETL/sync process:
-
DuckDB usage:
- Snapshots or exports from ClickHouse/SQLite for:
- Local analysis,
- Testing new Rego policies and metrics on subsets of the data.
- Snapshots or exports from ClickHouse/SQLite for:
5. OLAP Schema (High-Level)
5.1 Fact Table: findings
Example columns (non-exhaustive):
- Identity:
finding_id(PK)survey_idfirst_seen_at,last_verified_at
- Target:
ip,port,hostname,asn,cidrorg_id(operator),country,sector
- Classification:
platform_class(e.g., LLM gateway, vector DB, MCP, RAG endpoint, data layer)framework/framework_versioncategory(taxonomy category)
- Security:
auth_present(bool/enum)severity_tiercompliance_score(0–10, VisorScuba)criticality_tierpii_present(bool/enum)admin_surface_exposed(bool)version_disclosed(bool)cve_ids(array or normalized join table)exploitation_indicator(bool/enum)
- Lifecycle:
status(open, in_progress, closed, etc.)disclosure_sent(bool)disclosure_first_sent_atdisclosure_last_updated_at
Dimensions (optional separate tables if needed for normalization/readability):
org,asn,framework,category,survey,country,sector,cve.
6. OLAP-Backed Queries and Tools
6.1 Human-Facing and Dashboard Queries
Backed directly by ClickHouse:
- New high-severity findings since last check, grouped by
category,framework. - Auth-off rate per
platform_classover last 90 days (per day/week). - Exposure snapshot for a given
org_id/asn/cidr:- Counts by
category,severity, status. - Time series of new vs. closed.
- Counts by
- Disclosure pipeline state:
status in ('open','in_progress') AND disclosure_sent=falsegrouped byseverity,framework,sector.
- Multi-category operators:
- Group by
org_id,count_distinct(category) >= 2.
- Group by
- Re-probe staleness queue:
status != 'closed' AND last_verified_at < now() - N days, sorted by risk score.
- CVE distribution:
- Group by
cve_id,framework,categoryoverstatus='open'.
- Group by
6.2 LLM/AI-Backed Tools (Additions)
New tools that call OLAP instead of SQLite directly:
get_auth_off_rates(window, group_by)get_population_trends(framework, window)get_operator_exposure(org_or_asn_or_cidr)get_disclosure_backlog(group_by)get_multi_category_operators(min_categories)get_reprobe_candidates(N_days)get_cve_distribution(filter)
These are additive to existing per-finding tools:
get_findings,get_finding_historyaimap_fingerprint,visorgraph_pivotnuclide_contact,bare_rankvisorscuba_score,visorcorpus_generatedraft_disclosure,cluster_find
LLM agents can now perform both:
- Local reasoning (per finding), and
- Population-tier reasoning (rates, trends, cross-operator patterns).
7. Decision Logic and Cadences Using OLAP
- Active exploitation alerts:
- OLAP query over last 5–15 minutes for findings with exploitation indicators; drive pages.
- New high-severity dashboards:
- Hourly sync from SQLite → ClickHouse; dashboards query “since last check”.
- Daily/weekly auth-off-rate and population-thesis jobs:
- Scheduled OLAP queries → reports and maintainer-facing evidence packs.
- Hourly drift:
- OLAP-driven
get_reprobe_candidates(N_days)→ scheduler.
- OLAP-driven
- Disclosure status:
- Changes in
nuclide.dbmirrored into OLAP within ~10 minutes → up-to-date pipeline views.
- Changes in
8. Migration and Safety Considerations
- ClickHouse is a mirror, not a replacement:
- All authoritative state remains in SQLite + raw artifacts.
- OLAP ingestion is idempotent and recoverable from SQLite and evidence packs.
- The system already surveys ClickHouse for exposure:
- Apply same standards internally: no unauthenticated exposure of analytic cluster.
- OLAP failure modes:
- If ClickHouse is down or lagging:
- Per-finding tools and disclosures continue from SQLite.
- Population-level views degrade gracefully (stale/absent analytics).
- If ClickHouse is down or lagging:
9. Companion Documents
reference/olap-schema-clickhouse.sql, concreteCREATE TABLEsketches for ClickHouse fact + dimensions. Partition keys, sort keys, optional CVE / org / framework dimension tables. Future: materialized views for the Section 6.1 query set.reference/olap-tools-spec.md, spec for each of the 8 OLAP-backed LLM tools: input schema, output schema, SLO, ClickHouse SQL sketch, fallback behavior when the OLAP layer is unavailable.reference/olap-migration.md, bootstrap procedure for ClickHouse fromnuclide.db+~/recon, change-detection sync mechanics (high-water mark onupdated_at, ReplacingMergeTree for in-place upserts), DuckDB ad-hoc workflow, verification + invariants, security notes (self-dogfooding the survey rules on the analytic cluster), and an implementation checklist.