Speech & Audio, Model Layer, NuClide Stack

What it is

Speech models translate between text and audio in both directions. Whisper (OpenAI) is the universal speech-to-text engine; Piper is Rhasspy’s tiny fast TTS; Coqui XTTS is the high-quality multi-speaker TTS that survived the company’s death; RVC (Retrieval-based Voice Conversion) is the model that turns one person’s voice into another’s. Servers like wyoming-piper, openedai-speech, and the Whisper-server reference deployments wrap these models in HTTP APIs.

What goes wrong

The model itself isn’t the exposure. The audio it processes is. Whisper servers exposed without auth become free transcription endpoints; we’ve found deployments where the operator was clearly using their server to transcribe internal meetings, with the audio paths in the request log telling the story. RVC servers carry an additional risk: the operator’s trained voice models are stored on disk and served via the API. An attacker can pull a celebrity or executive voice model, then synthesise arbitrary speech in that voice.

How we test

We probe /v1/audio/transcriptions and /api/voices for the model and voice inventory, then characterise what kind of audio the operator processes by the file-extension distribution in the recent-jobs endpoint. We never submit audio. The voice-model catalogue is sufficient to identify problematic deployments (any voice model whose name matches a real person warrants disclosure to that person’s representation, not just the cloud abuse desk).

Receipts

Research

Every survey, case study, and disclosure we've published that touches this layer of the stack. Counts on the cells above tally these directly.

Cross-cloud surveys

4

Survey May 28, 2026

Voice/Audio AI Infrastructure Survey

## DCWF KSAT coverage

Read →

Survey May 15, 2026

Voice-Cloning Population Survey: Shodan-Reachable Slice (2026-05-15)

Survey of the Shodan-reachable voice-cloning surface (RVC / GPT-SoVITS / Applio / OpenVoice / ChatTTS / F5-TTS) and adjacent voice-TTS platforms. The aimap fingerprints for these platforms were shippe…

Read →

Survey May 15, 2026

Whisper ASR Population Survey (2026-05-15)

Population-scale survey of Whisper ASR (speech-to-text) deployments. The canonical OpenAI Whisper plus the popular forks (whisper.cpp, faster-whisper, WhisperX). aimap fingerprints shipped 2026-05-08;…

Read →

Survey May 4, 2026

Speech & Audio AI on Tier-2 Cloud: Auth Posture Survey

Mass-scan of port 9000 (whisper-asr-webservice default + faster-whisper-server common) across the same 76 tier-2 /16 ranges (3.55M IPs), Scaleway + OVH + Linode. 10,991 port-open candidates → 6 confir…

Read →

Field cases

2

Case May 29, 2026

Voice/Audio AI re-run: Category 17, 2026-05-29

Fifteen dorks. Twenty-eight candidates. Six confirmed unauthenticated voice services across five hosts. One four-service stacked host. Four false positives killed at the verification stage, including…

Read →

Case May 3, 2026

UC Berkeley: Course AI Assistant, Unauthenticated Memory Injection

roar-art.EECS.Berkeley.EDU (128.32.43.210) runs a production FastAPI service called "Course AI Assistant API" serving AI-assisted tutoring across EECS courses. The /api/chat/memory-synopsis endpoint i…

Read →

Model Layer

Speech & Audio

What it is

What goes wrong

How we test

Research

Cross-cloud surveys

Voice/Audio AI Infrastructure Survey

Voice-Cloning Population Survey: Shodan-Reachable Slice (2026-05-15)

Whisper ASR Population Survey (2026-05-15)

Speech & Audio AI on Tier-2 Cloud: Auth Posture Survey

Field cases

Voice/Audio AI re-run: Category 17, 2026-05-29

UC Berkeley: Course AI Assistant, Unauthenticated Memory Injection

Other categories in this layer

Ollama

vLLM

Triton Inference Server

Embedding Servers

llama.cpp