What it is
Speech models translate between text and audio in both directions. Whisper (OpenAI) is the universal speech-to-text engine; Piper is Rhasspy’s tiny fast TTS; Coqui XTTS is the high-quality multi-speaker TTS that survived the company’s death; RVC (Retrieval-based Voice Conversion) is the model that turns one person’s voice into another’s. Servers like wyoming-piper, openedai-speech, and the Whisper-server reference deployments wrap these models in HTTP APIs.
What goes wrong
The model itself isn’t the exposure. The audio it processes is. Whisper servers exposed without auth become free transcription endpoints; we’ve found deployments where the operator was clearly using their server to transcribe internal meetings, with the audio paths in the request log telling the story. RVC servers carry an additional risk: the operator’s trained voice models are stored on disk and served via the API. An attacker can pull a celebrity or executive voice model, then synthesise arbitrary speech in that voice.
How we test
We probe /v1/audio/transcriptions and /api/voices for the model and voice
inventory, then characterise what kind of audio the operator processes by the
file-extension distribution in the recent-jobs endpoint. We never submit audio.
The voice-model catalogue is sufficient to identify problematic deployments
(any voice model whose name matches a real person warrants disclosure to that
person’s representation, not just the cloud abuse desk).