OLAP Migration: Bootstrapping and Maintaining the Analytics Layer
Source: https://github.com/nuclide-research/AI-LLM-Infrastructure-OSINT/blob/main/reference/olap-migration
Companion to: reference/realtime-olap-architecture.md, reference/olap-schema-clickhouse.sql, reference/olap-tools-spec.md.
This document describes how to:
- Bootstrap ClickHouse from existing SQLite +
~/recon. - Keep ClickHouse in sync with
nuclide.dbandempire.db. - Use DuckDB for ad-hoc analysis.
- Handle failure and correctness concerns.
SQLite remains the system of record. ClickHouse is an analytic mirror for population-scale reasoning; DuckDB is for local experiments.
1. Goals
- Populate ClickHouse with historical findings and enrichment outputs.
- Support the OLAP-backed tools described in
olap-tools-spec.md(get_auth_off_rates,get_operator_exposure, etc.). - Ensure we can rebuild ClickHouse at any time from:
nuclide.db(VisorLog),empire.db(JAXEN, as needed),- Raw artifacts under
~/recon.
No canonical data lives only in ClickHouse.
2. Baseline: Existing State
-
nuclide.db(SQLite):- Lifecycle-tracked findings, current classification, disclosure state, timestamps, Rego scores, tags.
-
empire.db(SQLite):- Shodan staging, intermediate harvest metadata.
-
~/recon/...:- Raw per-survey artifacts (HTTP responses, banners, CT dumps, JS bundles, etc.).
- Evidence packs referenced by findings.
-
JSON shapes:
- Stable contracts between tools (VisorGraph, aimap, VisorScuba, VisorLog, etc.).
These remain unchanged by the migration.
3. ClickHouse Bootstrap
3.1 Create ClickHouse Schema
Apply the schema (or adjusted variant) from olap-schema-clickhouse.sql, including:
findingsfact table.- Optionally
finding_cves,org_dim,framework_dimif/when required.
Run this once against the ClickHouse cluster.
3.2 Export from SQLite
Write a small CLI (e.g., export_findings.py) that:
- Connects to
nuclide.db. - SELECTs all relevant fields for the
findingstable. - Emits either:
- CSV files, or
- Parquet files (preferred), or
- Direct HTTP inserts to ClickHouse.
Recommended: Parquet to allow reuse with DuckDB.
Example shape (pseudo-SQL from SQLite):
SELECT
finding_id,
survey_id,
first_seen_at,
last_verified_at,
created_at,
updated_at,
ip,
port,
hostname,
asn,
cidr,
org_id,
org_name,
country,
sector,
platform_class,
framework,
framework_version,
category,
tags_json, -- JSON array of strings
auth_present,
severity_tier,
compliance_score,
criticality_tier,
pii_present,
admin_surface_exposed,
version_disclosed,
exploitation_indicator,
cve_ids_json, -- JSON array of strings
status,
disclosure_sent,
disclosure_first_sent_at,
disclosure_last_updated_at,
last_status_change_at,
is_honeypot,
source_systems_json,
survey_version,
policy_version
FROM findings; -- or the appropriate table name in nuclide.db
Then transform JSON fields into arrays during export, or let ClickHouse parse JSON.
3.3 Load into ClickHouse
Option A: Use clickhouse-client with CSV/TSV.
Option B (recommended): load Parquet via HTTP or clickhouse-local.
Example (CSV for illustration):
clickhouse-client --query="INSERT INTO findings FORMAT CSV" \
< findings_export.csv
Or Parquet:
INSERT INTO findings
SELECT *
FROM file('findings_export.parquet', Parquet);
Bootstrap should be idempotent if possible (or run once on a fresh DB).
4. Ongoing Sync (SQLite → ClickHouse)
4.1 Change Detection Strategy
We need to propagate:
- New findings.
- Updates to severity/score/classification.
- Status/disclosure changes.
- Re-verification timestamps.
Approach:
- Add a monotonic column in SQLite (if not already present):
updated_atorlast_synced_at.
- Track a high-water mark per sync job:
- e.g.,
last_synced_updated_at.
- e.g.,
4.2 Sync Process
Implement a periodic job (e.g. sync_clickhouse.py) that runs every N minutes (e.g., 5–10 min):
- Read
last_synced_updated_atfrom a local state file/table. - Query
nuclide.dbfor rows withupdated_at > last_synced_updated_at. - Export deltas to a temp file (CSV/Parquet) or in-memory.
- Upsert into ClickHouse.
Because ClickHouse doesn’t have native row-level UPDATE in MergeTree, use one of:
-
Re-insert pattern:
- Use a
ReplacingMergeTreewith aversionorupdated_atcolumn. - Insert a new row for each update; queries use the latest version.
- Use a
-
Insert-only with “current” flag (for now):
- If the volume is low, simpler: insert new rows and filter by latest
updated_atperfinding_idin views.
- If the volume is low, simpler: insert new rows and filter by latest
Example engine tweak:
ENGINE = ReplacingMergeTree(updated_at)
PARTITION BY toYYYYMM(first_seen_at)
ORDER BY (finding_id);
Then sync job simply inserts delta rows; merges resolve to latest state.
- Update
last_synced_updated_atafter a successful batch.
4.3 Failure Handling
- If sync fails:
- Do not advance high-water mark.
- Next run reattempts the same range.
- If ClickHouse is down:
- SQLite + pipeline continue to function; only analytics degrade.
- To rebuild:
- Drop/empty
findings. - Rerun full bootstrap from SQLite +
~/recon.
- Drop/empty
5. DuckDB Integration
DuckDB is for:
- Local, ad-hoc analysis.
- Prototyping new metrics/reports on subsets of data.
- Testing new Rego policies or taxonomy transformations.
5.1 Data Sources for DuckDB
- Parquet exports from ClickHouse (via
SELECT ... INTO OUTFILE). - Direct reads from Parquet generated by SQLite export.
- Possibly small snapshots of
~/reconJSON.
Example workflow:
-- In DuckDB
CREATE TABLE findings AS
SELECT * FROM read_parquet('findings_export.parquet');
-- Experiment with a new aggregate
SELECT platform_class,
count(*) AS total,
avg(compliance_score) AS avg_score
FROM findings
GROUP BY platform_class;
Once a new metric/projection looks useful, port it into a ClickHouse view or a Python/Go tool.
6. Verification and Invariants
After bootstrap and initial sync, verify:
-
Row counts
count(*)in SQLite vs ClickHouse (with latest-version semantics).- Allow for minor skew if you ignore closed/old findings in ClickHouse.
-
Spot-check fields
- Sample N
finding_ids; compare fields between SQLite and ClickHouse.
- Sample N
-
Key queries
- Run a selection of OLAP-backed queries (auth-off rates, backlog, multi-category operators) and sanity-check results against known examples.
Invariants:
- No canonical data exists only in ClickHouse.
- Every ClickHouse row is derivable from:
nuclide.db+empire.db+ artifacts under~/recon.
- Sync is monotonic: we never delete or mutate SQLite based on ClickHouse.
7. Security and Exposure
- Treat ClickHouse as sensitive:
- It aggregates cross-operator, cross-sector exposure data.
- Apply the same survey standards to your own cluster:
- No unauthenticated exposure,
- TLS + auth,
- Network restrictions.
The system already knows how to detect exposed ClickHouse instances; apply those checks locally (self-dogfooding).
8. Implementation Checklist
- Finalize ClickHouse schema (
olap-schema-clickhouse.sql). - Implement
export_findingsfromnuclide.dbto Parquet. - Implement bootstrap loader into ClickHouse.
- Implement periodic
sync_clickhousejob (delta export + insert). - Add a high-water mark store for
updated_at. - Add basic verification script (row counts + spot-checks).
- Add DuckDB example notebook/script for local analysis.
- Add monitoring/alerting on sync failures and ClickHouse health.
This is a living migration plan: adjust schema fields, sync cadence, and engine options as the system evolves, but the core pattern, SQLite as system of record, ClickHouse as analytic mirror, DuckDB as scratchpad, stays stable.
9. Reference Install: rooster (2026-05-06, decommissioned same day)
Status: DECOMMISSIONED 2026-05-06. Container, cron, credentials, data, and log directories were removed several hours after install. Rationale: at the current ledger scale (~600 findings, ~500 added per month), DuckDB embedded against
nuclide.dbanswers every population-tier query the OLAP-tools-spec describes in milliseconds. The ClickHouse mirror was running but no read path was actually using it,data/olap-demo.pyreads SQLite directly via DuckDB’ssqlite_scanner, not ClickHouse. The infrastructure (schema, exporter, sync, bootstrap, cron wrapper) is preserved in the repo and is ready to redeploy when scale or use case warrants, e.g., crossing 100k findings, public-facing dashboards, or sub-second materialized-view alerting.The notes below remain canonical for the next install. Re-deploy effort is ~5 minutes given the gotchas are now documented.
First production install was on rooster (Linux 6.17, Docker 29.4.2, IPv6 disabled). 581 rows bootstrapped end-to-end. Notes for future installs:
Container
ClickHouse 26.3.9.8 official image (clickhouse/clickhouse-server:latest). Run with --network host so the listener binds directly to rooster’s network stack, then lock to localhost via the listen-host config override below. Bridge networking failed on this host because docker-proxy and the IPv6-disabled config interact badly, direct curl to the container’s bridge IP got Connection refused even when the listener was visibly up inside the container.
docker run -d \
--name nuclide-clickhouse \
--restart unless-stopped \
--network host \
-v ~/clickhouse-data:/var/lib/clickhouse \
-v ~/clickhouse-logs:/var/log/clickhouse-server \
-v ~/clickhouse-config/listen.xml:/etc/clickhouse-server/config.d/listen.xml:ro \
--env-file ~/.config/nuclide/clickhouse-credentials.env \
-e CLICKHOUSE_DEFAULT_ACCESS_MANAGEMENT=1 \
--ulimit nofile=262144:262144 \
clickhouse/clickhouse-server:latest
listen.xml config override
<clickhouse>
<!-- Container runs with host networking; bind listener to localhost only.
Dogfoods the survey rule against exposing ClickHouse on a public IP. -->
<listen_host replace="replace">127.0.0.1</listen_host>
</clickhouse>
Two gotchas resolved during this install:
-
IPv6 dual-stack default fails on IPv6-disabled hosts. ClickHouse’s default config uses
<listen_host>::</listen_host>(IPv6 wildcard with IPv4-mapped fallback). On a host with IPv6 disabled the bind returnsEAI: Address family for hostname not supported. Thelisten_try=1fallback to0.0.0.0did not work cleanly here; an explicitreplace="replace"override was required. -
XML comments cannot contain
--. The XML parser rejects--network hostinside a<!-- ... -->comment because-->ends the comment. SAXParseException at parse time put the container into a restart loop. Avoid hyphen-pairs inside comments.
Credentials
~/.config/nuclide/clickhouse-credentials.env (mode 600)
CLICKHOUSE_HOST=127.0.0.1
CLICKHOUSE_PORT=8123
CLICKHOUSE_USER=default
CLICKHOUSE_PASSWORD=<generated>
CLICKHOUSE_DATABASE=nuclide
Generate password with openssl rand -base64 24 | tr -d '/+=' | head -c 32.
Schema notes (reflected in olap-schema-clickhouse.sql)
Two schema corrections shipped during bootstrap:
finding_cvesoriginally hadPARTITION BY toYYYYMM(now()). ClickHouse rejects this, partition keys must be deterministic functions of row data, not of system clock. Removed PARTITION BY (single partition is fine for this small join table).ipanddisclosure_first_sent_at/disclosure_last_updated_atare nowNullable(IPv4)/Nullable(DateTime). Empty string""is not parseable as either type; rows without resolved IP or pending-disclosure findings need actualNULL.
Insert path
Both bootstrap-clickhouse.py and sync-clickhouse.py use ClickHouse’s native JSONEachRow format via clickhouse_connect.client.raw_insert(). Two reasons:
- The exporter emits ISO-8601 timestamp strings (e.g.
"2026-05-02T23:53:51Z");client.insert()expects Pythondatetimeobjects and crashes with'str' object has no attribute 'timestamp'. ClickHouse server-side parses these viaparseDateTimeBestEffortnatively. - The exporter emits a superset of columns (carries
notes,match_path,tldfor evidence-pack tracing). The insert-side projects each row dict to the table’s declared column set before serialization; extras are silently dropped.
Self-dogfood
After install, verify the listener is actually localhost-only:
ROOSTER_IP=$(hostname -I | awk '{print $1}')
curl -s -m 3 "http://$ROOSTER_IP:8123/ping" # should fail (connection refused)
curl -s -m 3 "http://127.0.0.1:8123/ping" # should return "Ok."
Confirms the survey rule “no unauthenticated ClickHouse exposure on a public interface” is being applied to our own analytics layer.
Cron-driven delta sync
data/cron-sync-clickhouse.sh is the documented entrypoint. It sources the credentials env, runs sync-clickhouse.py --execute, and appends UTC-bracketed output to ~/.config/nuclide/clickhouse-sync.log. Suggested crontab entry (every 10 minutes):
*/10 * * * * /home/cowboy/AI-LLM-Infrastructure-OSINT/data/cron-sync-clickhouse.sh
A typical run on no-delta is logged as:
[2026-05-06T03:21:05Z] sync starting
Delta rows: 0
Nothing to sync. State unchanged.
[2026-05-06T03:21:05Z] sync finished (exit 0)
When new findings land in nuclide.db, the next cron tick advances the watermark and inserts only the deltas. Watermark state at ~/.config/nuclide/clickhouse-sync-state.json (mode 600). To force a full re-sync (e.g., after a schema change): python3 data/sync-clickhouse.py --reset then re-run.